Netflix Releases Chaos Monkey Into the Wild
Netflix has released its Chaos Monkey tool to open source. Chaos Monkey is a testing utility that randomly shuts down virtual machine instances across large systems, ensuring that system is built with a great degree of resilience. Chaos Monkey's is built in Amazon Web Service's system, but its design is flexible enough for it to be used with cloud providers other than Amazon and with other instance groupings, Netflix said.
07/31/12 5:00 AM PT
Netflix on Monday released the source code for its Chaos Monkey tool into the open source jungle.
Chaos Monkey is a tool the company built in the Amazon Web Services (AWS) system. Its job is to randomly kill virtual machines and services within Netflix's architecture.
The idea is to test systems in real-world conditions. Chaos Monkey's ability to cause frequent failures forces Netflix's engineers to ensure their infrastructure is built to be resilient.
"Crafting automated services that regularly check and test the status of a set of infrastructure in the cloud using an API is a preferred practice to ensure optimal performance while minimizing financial risk," said Michael Sheehan, technology evangelist at GoGrid.
"My wife does a lot of stuff on application performance management, and once an app drifts off into the cloud, it's very hard to track where it is and what problems it's facing, and this kind of tool sounds excellent to me," Joe Clabby, president of Clabby Analytics, told LinuxInsider.
Amazon did not respond to our request to comment for this story.
Going Wild in the Cloud
Chaos Monkey runs in AWS. It seeks out Auto Scaling Groups (ASGs) and terminates virtual machines randomly in each group.
Auto Scaling lets users scale their Amazon Elastic Compute Cloud (EC2) capacity up or down as required. An Auto Scaling Group is a collection of Amazon EC2 instances to which users apply a set of scaling conditions.
An instance is a virtual machine.
When Chaos Monkey terminates an instance, the ASG should detect it and automatically bring up a new instance configured identically to the one that was taken out.
Take Two Bananas and Call Me in the Morning
Chaos Monkey allows for an opt-in or opt-out model. Netflix uses the opt-out model so that if an application owner doesn't respond, the tool acts on the application. Chaos Monkey has a tunable probability that it uses to control the chance of a termination. A probability of one, which is 100 percent, will terminate one instance per day per ASG. Users can set the probability to the level required.
Further, Chaos Monkey has a configurable schedule that runs by default on non-holiday weekends between 9 a.m. and 5 p.m. so that engineers will be able to respond as needed.
Chaos Monkey terminated more than 65,000 instances in Netflix's production and testing environments over the past year. Although most of them didn't cause noticeable problems, those that stood out could be isolated and resolved so they wouldn't recur.
Simple Precautions for the User
Chaos Monkey's design is flexible enough for it to be used with cloud providers other than Amazon and with other instance groupings, Netflix said. It can be enhanced to add the support.
However, "Each cloud IaaS [Infrastructure as a Service] provider employs different methodologies for creating, controlling and destroying its cloud infrastructure," GoGrid's Sheehan told LinuxInsider.
Although standards are being developed, tested and used, the concepts and tactics Chaos Monkey uses might not be sound for CloudStack, OpenStack, GoGrid or other cloud infrastructure vendors, Sheehan said.
Before implementing Chaos Monkey or any type of automation, users should back up, clone and document existing instances, because "automation ... can also be detrimental if not thoroughly tested and explored prior to implementation." The quality assurance (QA) testing process should also be expanded.
FOSS-tering Out the Code
In the past, Netflix hasn't exactly endeared itself to the Linux community. The company brought its streaming to Linux relatively late compared to the Mac and Windows platforms.
That raises the question of whether perhaps Netflix is trying to make up with the FOSS community.
Perhaps that prompted Netflix to point out that it uses a wide range of open source technologies, and it has released many of its internally developments and components and libraries starting with Curator for Zookeeper and, recently, Asgard.
Netflix said it's seeking to pay back the open source community by releasing its source code. More is on the way.
Netflix did not respond to our request for further comment for this story.