Showing posts with label Databases. Show all posts
Showing posts with label Databases. Show all posts

Monday, May 5, 2014

Jepson & Distributed DataStores

Kyle Kingsbury is doing an amazing job with his Jepson project. TheHackerCIO has been long disturbed by the tendency for people to make these assertions and claims without the experimental evidence to back them up or provide an assessment basis for them.

Especially in the database world.

Here are a handful of the problems:

I can't tell you how many time's I've heard, "Oh, in the inner-join using RDBMS X, a nested-loop algorithm will of course perform better depending on the which table is the outer and which is the inner."

No doubt.

But these DBMSs have an optimizer. They have tables full of statistics about the data, presumable updated on a regular basis. These vendors have had 20 years to tweak optimizations. Yet, the documentation gives no indication as to whether their "optimizer" can pick the right outer table and inner table, or whether you must explicitly pick the right one yourself.

So lots of people just assume that the optimizer can/will do this. Which isn't unreasonable.

But the days have come where things need to be specified tighter.

We simply need clear black/white, preferably not greatly hedged,  statements in the documentation. Statements that can be tested. Verified. Proven. Or disproven.

The newer world of NoSql is no exception to this rule or problem.

But Kyle has been there.

Kyle got interested in understanding the issues around the NoSql databases. But he did things the right way: he set up a controlled environment, and began systematically testing, examining, and proving out how the CAP theorem implications actually work in a partitioning environment. This led to a number of surprises for the vendors, ... not to mention the users???

You can take a look at his full Jepson Project here. He's tested Cassandra (My current focus), Redis, Kafka, NuoDb, Zookeeper, Riak, Mongo, Postgress, possibly others ...

To get a proper sense for this correct, test-based approach, recommend this.  Here are just a few enticing flavor notes, taken from a section to please devote your most careful attention, entitled, "Testing Partitions":

  • Theory bounds a design space, but real software may not achieve those bounds. We need to test a system's behavior to really understand how it behaves.
  • To cause a partition, you'll need a way to drop or delay messages: for instance, with firewall rules. 
  • Running these commands repeatably on several hosts takes a little bit of work.

Work might be a necessary evil. But understanding isn't going to come without it. Or without actually, experimental testing.

In this article, you will see exactly what to set up to get started with your own multi-node, partition-able, experimental test-bed, within which you can see how your NoSql is going to behave.

Because there's no short-cut.

Or, as earlier time might have put it,

There is no royal road to enlightenment.

I Remain,


Tuesday, November 5, 2013

Cassandra Last Night at the TRG

Cassandra was the topic at TRG last night.

That is to say, Apache Cassandra. I'm not clear why the project chose to refer to themselves by the name of a Greek prophetess who was doomed to always prophesying correctly, but also to never being believed.

Perhaps the the eventual consistency model?

Still, it doesn't seem like the greatest PR approach and it doesn't seem like the Big Data initiatives would like to think of their correct insights always being disregarded.

But such is the name of the product.

Our presenter, Adrian Rodriguez, did a nice hands-on tutorial where he built up a data model for a Social web application centered around dog photos. He provided a github account where the full blown application can be browsed.

He also pointed us to a very helpful consistency calculator website, where the implications of your consistency level choice are clearly shown: Cassandra Parameters for Dummies.

Adrian recommended the very sound policy of defining calls in quorum and then relaxing this only where necessary, in keeping with the dictum: "don't prematurely optimize."

I also liked his way of explaining that Cassandra databases grow out left to right, with everything attaching to the primary key as a new column, and with all the join overhead done upfront at update time in all the other relevant rows; in contrast to the Relational Model, where databases grow top to bottom as new rows are added. This is an excellent way for beginners to start wrapping their heads around this NoSql database.

Tonight is the Java Users group, so a report will be in order tomorrow on Groovy.

Full details of the presentation may be read here.

I Remain,