Chaos Engineering: Building Immunity in Production systems
Modern software-based services are implemented as large scale, highly distributed systems running in cloud or data centers. Disruptive real-world events like hardware failures or software bugs can create turbulent conditions in the environments where these systems and can lead to unpredictable outcomes. Chaos Engineering is a study of system’s ability to withstand such disruptive turbulent conditions. It works by purposefully injecting failure into the production environment that mirrors the actual failure modes and monitors the recovery.
Chaos engineering uses experimentation to study effects of such disruptions. These experiments typically start by defining “steady state” of the system and come up with metrics that can be used to measure this steady state. Then various events that mirror the failure modes (aka “Chaos”) that are possible in our production environment (e.g. server crash), are injected systematically in the system in controlled environment.
Effect of the injected “Chaos” is observed by collecting and analyzing the metrics identified above. If the system is able to recover successfully, this builds confidence in system’s ability to handle an actual unplanned outage.
If a failure to recover is observed, then it becomes a target for improvement before that behavior manifests in the system at large. By continuing to run these simulations, it is possible to identify several such vulnerabilities. Fixing these vulnerabilities strengthens the system over a period of time. Extensive monitoring and logging is essential for the success of Chaos Engineering in its goal to improve the resiliency of the system.
Outline/Structure of the Talk
- Introduction: What is Chaos Engineering?
- A Brief History: Available tools & methodologies
- Building a Hypothesis around Steady-state behavior
- Understanding the Failure Modes & coming up with Chaos definitions
- Observing the system: Defining & Measuring the metrics
- Setting up Control Group & Experimental groups
- Running the experiments in Production
- Automating Experiments to run Continuously: “Chaos Monkey”
- Collecting feedback & Using it to improve the system
- Final remarks & Conclusion
- Various techniques to improve the resiliency of your systems
developers, infrastructure engineers, technical leads, architects