TheHackerCIO [TheH4ck3rC10]: Chaos Monkey

Thursday, November 14, 2013

Throwing A Monkey Into the Works

Not a MonkeyWrench.

But throwing either into the works isn't the normal way to treat a production system.

But here is yet another example of the need to check your assumptions in the face of technological change.

Netflix did just that when they created the Chaos Monkey. It was released to the world at large a little over a year ago, in 2012. It's now available, along with several other variants on github, as the SimianArmy. The Chaos monkey is a service that randomly seeks out AWS ASGs (Autoscaling groups), then finds virtual instances in them and kills them, thus applying the ultimate acid test of your failover capability.

The idea is to fail early, fail often.

And to see how your failover works in this context.

The Chaos Monkey can be configured to run during normal working hours, so that the problems which may result from its chaos can be addressed by the staff during regular hours, rather than at 3am.

And, in fact, the reason for throwing this monkey into the works is to avoid those 3am calls. And to get used to planning for failure.

This is where the re-thinking comes into play. The new environment for technology is one which derives from enormous data management, not to mention millions or even billions of users, where systems need to be:

available 24x7x365 without a maintenance window
flexibly scalable, ideally linearly
based on commodity hardware, subject to failure and outage
resilient within this context; failover should be automatic and transparent to the user

Instead of designing with the assumption of avoiding component failure at all costs, the Netflix approach says we deal with failure like a RAID system. We build using cheap commodity hardware. That's the "I" in RAID, by the way. (Inexpensive) We build in a lot of redundancy. (That's the "R"). Then, we automate failover and make it transparent to the user. But, in this new approach, we need to know how dependable our system is.

Unless something is measured, it is an unknown quantity. The more we deal with failure on a regular basis, the more prepared we are for the unexpected. It's like regular fire drills. Or terrorism simulations.

So here is the paradoxical need for the Chaos Monkey. The real threat is that commodity hardware is too dependable. It's so dependable that it can lull us into not properly planning for failure. But our whole approach with cheap commodity hardware and full, 100% uptime is predicated on and assumes regular failure. Solution: produce failure regularly, but at random.

Perfect example of putting new wine into new bottles.

You can read more about all the members of the Simian Army here.

And just last month, the released the Cassandra Chaos Monkey, so that NoSql database instances can experience the same Chaos as your other tiers. The announcement is here.

I Remain,

TheHackerCIO

Wednesday, November 13, 2013

An Evening's Evangelism

Last night was spent playing hookey from the Geeky Book club. But only because a particularly special speaker was in town. Patrick McFadin, chief Evangelist for Apache Cassandra was speaking at DreamWorks in Glendale.

So, TheHackerCIO slogged through an hour and a half of LA traffic to get out to Glendale in time to see the talk. Not to mention hearing it.

Patrick is a good presenter, so the talk was well organized and interesting. His purpose was to convince us that C* [the semi-official abbreviation for Cassandra] was the best persistence tier for your application.

He predicated this on the tunable consistency available in C*; pointing out that if you were willing to specify ALL, and take the performance hit, you could construct the most consistent distributed database system possible. One where every node had to acknowledge before an operation completed.

The talk was too long to go too in-depth, but I was particularly interested by the architecture of writing all files out immutably. Even compaction is accomplished by reading in the fragmented files and writing a new compressed one. So, in theory, you could always recover -- even from programmatic database corruption. Ideally, you use a snapshot to do point-in-time recovery, followed by writing a script to extract "post-point-in-time" updates from the files and apply it where required.

He mentioned that the joke among C* cognoscenti is that CQL has a UPSERT statement, because update and insert are so very similar. If a row doesn't exist, update will insert it and if it exists insert will replace the data in it! UPSERT is a fun way to remember this similarity of statements.

Patrick also pointed out that Netflix -- the poster boy for C* -- has just released the Chaos Monkey for C*! He challenged the Mainframe person attending to introduce the Chaos Monkey to the Mainframe systems, and see how they compare in terms of failover and availability. If you don't know about the Chaos monkey, tomorrow I'll fill you in on it. Because it's important.

To summarize his talk, I liked his zinger the best: Use Oracle to count your money; Use Cassandra to make it.

I Remain,

TheHackerCIO

TheHackerCIO [TheH4ck3rC10]

Pages

Thursday, November 14, 2013

Throwing A Monkey Into the Works

Wednesday, November 13, 2013

An Evening's Evangelism

Labels

BlogRoll

Followers

About Me