How resilience testing of infrastructure is the path to achieve a stable infrastructure

Most of us use certain types of backups and failovers when one or more component fails but are you sure these backups will kick when failure happens.

Resilience Testing is based on the principle that bugs and failure will always creep into your infrastructure despite your hardest efforts.

Human error , ever changing load and  environment  will affect our setup and if not tested regularly will lead to unsatisfactory results.

To make sure our system always works as expected,We need to embrace the concept of Resilience Testing.


Outline/Structure of the Experience Report

Infrastructure resilience is the ability of a infrastructure, to continue operating even when there has been an equipment failure, power outage or other disruptions
Resilience testing is the principle where you subject your infrastructure to failures, attacks and other scenarios so that your infrastructure is ever ready to face potential problems and threats and you are able to provide reliable and ever running service for your customers

In this talk  i will start off with a few real life experiences where resilience testing could have saved us from potential disasters.

Then we will talk about some of the big tech companies who invest a lot in resilience testing and the techniques they use to implement resilience testing and make their system more secure and increase their uptime.We will also discuss popular tools like simian army.  

After this we will take a simple infrastructure and discuss potential resilience testing techniques which we can implement on the infrastructure.

 I have attached the slides from my recent talk in Devops days India 2015. I will use a lot images in my talk as you can see in the slides.

Learning Outcome

Devops Teams focus on setting up infrastructure but not on regular testing of infrastructure.

For example load testing is mostly done when we are going to production but rarely people do it again even after 4-5 months into production.

All these leads to potential downtimes.With addition of resilience testing the infrastructure will be ever improved.From my experiences shared during the talk the techniques to avoid them devops guys will be able to make their infra more resilient .Failures will be few and this customer experience will as expected.

Target Audience

Devops engineers

