Thursday, November 14, 2013

Throwing A Monkey Into the Works



Not a MonkeyWrench.

But throwing either into the works isn't the normal way to treat a production system.

But here is yet another example of the need to check your assumptions in the face of technological change.

Netflix did just that when they created the Chaos Monkey. It was released to the world at large a little over a year ago, in 2012.  It's now available, along with several other variants on github, as the SimianArmy. The Chaos monkey is a service that randomly seeks out AWS ASGs (Autoscaling groups), then finds virtual instances in them and kills them, thus applying the ultimate acid test of your failover capability.

The idea is to fail early, fail often.

And to see how your failover works in this context.

The Chaos Monkey can be configured to run during normal working hours, so that the problems which may result from its chaos can be addressed by the staff during regular hours, rather than at 3am.

And, in fact, the reason for throwing this monkey into the works is to avoid those 3am calls. And to get used to planning for failure.

This is where the re-thinking comes into play. The new environment for technology is one which derives from enormous data management, not to mention millions or even billions of users, where systems need to be:

  • available 24x7x365 without a maintenance window 
  • flexibly scalable, ideally linearly
  • based on commodity hardware, subject to failure and outage
  • resilient within this context; failover should be automatic and transparent to the user
Instead of designing with the assumption of avoiding component failure at all costs, the Netflix approach says we deal with failure like a RAID system. We build using cheap commodity hardware. That's the "I" in RAID, by the way. (Inexpensive) We build in a lot of redundancy. (That's the "R"). Then, we automate failover and make it transparent to the user.  But, in this new approach, we need to know how dependable our system is.

Unless something is measured, it is an unknown quantity. The more we deal with failure on a regular basis, the more prepared we are for the unexpected. It's like regular fire drills. Or terrorism simulations. 

So here is the paradoxical need for the Chaos Monkey. The real threat is that commodity hardware is too dependable. It's so dependable that it can lull us into not properly planning for failure. But our whole approach with cheap commodity hardware  and full, 100% uptime is predicated on and assumes regular failure. Solution: produce failure regularly, but at random. 

Perfect example of putting new wine into new bottles.

You can read more about all the members of the Simian Army here

And just last month, the released the Cassandra Chaos Monkey, so that NoSql database instances can experience the same Chaos as your other tiers. The announcement is here

I Remain,

TheHackerCIO