How resilience testing of infrastructure is the path to achieve a stable infrastructure

Most of us use certain types of backups and failovers when one or more component fails but are you sure these backups will kick when failure happens.

Resilience Testing is based on the principle that bugs and failure will always creep into your infrastructure despite your hardest efforts.

Human error , ever changing load and  environment  will affect our setup and if not tested regularly will lead to unsatisfactory results.

To make sure our system always works as expected,We need to embrace the concept of Resilience Testing.

1 favorite thumb_down thumb_up 6 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist

Outline/structure of the Session

Infrastructure resilience is the ability of a infrastructure, to continue operating even when there has been an equipment failure, power outage or other disruptions
Resilience testing is the principle where you subject your infrastructure to failures, attacks and other scenarios so that your infrastructure is ever ready to face potential problems and threats and you are able to provide reliable and ever running service for your customers

In this talk  i will start off with a few real life experiences where resilience testing could have saved us from potential disasters.

Then we will talk about some of the big tech companies who invest a lot in resilience testing and the techniques they use to implement resilience testing and make their system more secure and increase their uptime.We will also discuss popular tools like simian army.  

After this we will take a simple infrastructure and discuss potential resilience testing techniques which we can implement on the infrastructure.

 I have attached the slides from my recent talk in Devops days India 2015. I will use a lot images in my talk as you can see in the slides.

Learning Outcome

Devops Teams focus on setting up infrastructure but not on regular testing of infrastructure.

For example load testing is mostly done when we are going to production but rarely people do it again even after 4-5 months into production.

All these leads to potential downtimes.With addition of resilience testing the infrastructure will be ever improved.From my experiences shared during the talk the techniques to avoid them devops guys will be able to make their infra more resilient .Failures will be few and this customer experience will as expected.

Target Audience

Devops engineers

schedule Submitted 3 years ago

Comments Subscribe to Comments

comment Comment on this Submission
  • Naresh Jain
    By Naresh Jain  ~  2 years ago
    reply Reply

    Hi Vipul,

    This is a very important topic, which does not get much attention. IMHO if you turn this into an experience report or a case study, where you can talk about how you guys first realised the need for resilience testing, how you went about setting it up (with good technical insights) and how it has helped you... that would really help the participants to walk away with a nice story to remind the importance of resilience testing.

    If you agree, please update the proposal accordingly.

    • Vipul Sharma
      By Vipul Sharma  ~  2 years ago
      reply Reply

      Hey Naresh 

      I completely agree with your point. Infact i had given the talk the same way in Devopsdays India 2015.

      We started off with a few real life examples where resilience testing could have saved us from potential disasters. Most of these situations are from personal experience.

      Then after we establish the need of resilience testing, We discussed the modern tech giants who use actively do resilience testing and in the end we took a sample infrastructure and then discussed a few simple resilience testing techniques to make the infrastructure more secure and resilient to potential disasters. 

      I will add these details to proposal

      • Naresh Jain
        By Naresh Jain  ~  2 years ago
        reply Reply

        Thanks for the prompt response Vipul. Do you think 20 mins would be sufficient for this talk? Also can you please update the Session Type as experience report.

        • Vipul Sharma
          By Vipul Sharma  ~  2 years ago
          reply Reply

          Hey Naresh

          20 min will be enough for my talk. I had the same time limit for DevopsDays 2015 talk.

          I have updated the session type as requested

  • Joel Tosi
    By Joel Tosi  ~  3 years ago
    reply Reply

    Hi Vipul,

        Have you presented this session before?  If so, could you please link to the video and / or slide deck so we can get an idea for your presentation style.


    Also, if someone attends your session, what would they be able to do when they get back to work?  Are you talking about how to start with resiliency testing or just why it is important?


    Thanks much,


    • Vipul Sharma
      By Vipul Sharma  ~  3 years ago
      reply Reply


      I have presented this session in DevopsdaysIndian 2015. Check schedule here

      The video for the talk hasnt been released yet.

      I am attaching the link for the slides which i have used in devopsdays presentation.

      The goal of this session is to introduce audience to resilience testing and help add it to their systems. I will be talking about why resilience testing is important , what tools can we use to implement it in our infrastructure and in the end take a sample infra and apply different tools on it to make it more resilient.


  • Liked Sankalp Singh

    Sankalp Singh / Shobhit Srivastava - Mopping up network nightmares

    20 Mins
    In todays world with the modern tools and technologies we are getting more and more abstracted from the underlying networking concepts. But these concepts are still very essential and relevant when we work in a real life situation.
    We will look at 3 very different infrastructure in terms of tools and technologies and seemingly entirely unrelated problems. But we will see how networking is the simple key which unravel each of these mysteries.