Confused Tester in Chaotic World #ChaosTesting

location_city Bengaluru schedule Oct 16th 02:45 - 03:05 PM place Mysore Hall 2

"You can’t legislate against failure, focus on fast detection and response."

You can think this as a fairy tale story -

As once upon a time, in theory, if everything works perfectly, we have a plan to survive the disasters we thought of in advance.

But big Question is How Did That Work Out ?

We are here to answer that Big question with our session.

While it is possible to sit down and anticipate some of the issues you can expect when a system fails it, knowing what actually happens is another thing.

This really depends on what your tolerances for failure are and based on the likelihood of them happening.

The result of this is you are forced to design and build highly fault tolerant systems and to withstand massive outages with minimal downtime.

The prevailing wisdom is that you will see failures in production; the only question is whether you'll be surprised by them or inflict them intentionally to test system resilience and learn from the experience. The latter approach is chaos engineering.

The important aspect of Chaos Engineering is Chaos Testing.

Historically, the emphasis has always been on mean time to failure (MTTF); working hard to extend the time between system failures, with little emphasis on how fast a failure could be corrected.

In today's world, the emphasis needs to shift to mean time to recover (MTTR), minimizing the time it takes to recover from a failure.

At a high level, chaotic testing is simply creating the capability to continuously, but randomly, cause failures in your production system. This practice is meant to test the resiliency of the systems and the environment, as well as determine MTTR.

Adopting chaotic testing will help improve your MTTR, improve organizational confidence in the resiliency of your production environment, and it will also keep you out of tomorrow's headlines.

A case study to showcase the real world how can we handle our failures By testing proactively instead of waiting for an outage.

The product understudy over here is one of the key products serving the major contact center industries across the globe.

The impact of outage in an contact center with 40K+ agents specially during peak seasons is huge. Contact center are considered as backbone of industries like e-commerce, telecom, travel etc. and dealing directly with people.

How we ensured a seamless takeover between contact centers across the globes even if an entire high availability contact center goes down. The established calls in an scales of multiple thousand also recovers in fraction of milli seconds. And how this all was achieved by testing all the unknowns in an controlled environments continuously.

 
 

Outline/Structure of the Experience Report

  1. Case Study - What we learnt with a leading contact center product in while testing in production - 10 Mins
  2. What are the advantages and challenges of Chaos Testing? - 8 Mins
  3. Feedback, Q&A, Reflection - 2 Mins

Learning Outcome

  1. What is Chaos Testing?
  2. Should We be Chaos Testing.
  3. How to start Chaos Testing?
  4. What are the advantages of Chaos Testing?
  5. MTTF/MTTR
  6. A story how we identified the need of Testing in production and what we learnt out of it.
  7. We will not tell "this is how you should do it" but i hope you all will find some of lessons useful to your context.

Target Audience

Anybody & everybody who is part of professional career

Prerequisites for Attendees

- Come with an open learning mind

- Ask questions with focus on discovery and learning

- Participate fully in activities - this is a co-learning session and i am no an expert to give one way presentation and talk.

- Share experiences

schedule Submitted 10 months ago

Public Feedback

comment Suggest improvements to the Speaker
  • Leena S N
    By Leena S N  ~  6 months ago
    reply Reply

    Hi Ashish,

    The topic seems interesting. But the overall proposal looks more like a "how-to" of Chaos testing which may not be very useful for the audience.

    Can you change the proposal to present it as an experience report? How Chaos testing helped you and your team? What impacts it had on the product and the business?

    Thanks,

    Leena

     

    • Ashish Kumar
      By Ashish Kumar  ~  6 months ago
      reply Reply

      Hi Leena,

      Thanks for the feedback, the thought behind keeping it more of how was that it is not a very popular concept right now and audience might be interested to learn the concept.

      Based on your feedback i will update the proposal from talk to more as experience report.

      Regards,

      Ashish

      • Naresh Jain
        By Naresh Jain  ~  5 months ago
        reply Reply

        Hi Ashish,

        I don't see the updated proposal. Can you please check?

        Also, please make sure you provide specific examples to help the review team understand the depth of the experience.

        • Ashish Kumar
          By Ashish Kumar  ~  6 months ago
          reply Reply

          Hi Naresh,

          I have updated the proposal and expect feedback from you and reviewing team to make it even better.

          Regards,

          Ashish

          • Naresh Jain
            By Naresh Jain  ~  5 months ago
            reply Reply

            Thanks, Ashish.

            You mentioned:

            How we ensured a seamless takeover between contact centers across the globes even if an entire high availability contact center goes down. The established calls in an scales of multiple thousand also recovers in fraction of milli seconds

            This is exactly the kind of topic we are looking for. However, the proposal lacks any technical details on how you did Chaos Testing, what tools you used, etc. Where can I find out more about this?

            • Ashish Kumar
              By Ashish Kumar  ~  5 months ago
              reply Reply

              Hi Naresh,

              Thanks for taking your time out to look into this topic and giving your feedback.

              I thought at first that it will be too much information in abstract so have not gone in details.

              Let me take this opportunity to chalk out the picture for you.

              In a Contact Center, router is the brain of the entire solution. Almost all the Enterprise contact center applications comes with redundant/HA/Duplex sides, where one side acts as the Active node and does the call processing and the other will stay connected to the Active node and receives all the events.

              Chaos Goal:-

              During a high volume call load to the contact center, there could be unanticipated issues like network communication issue with the duplex partner (fault could be at the active call processing node or inactive node) or with the dependent other components, (Public/Private), burst of calls beyond the capacity, latency issues.

              In any of the scenarios the contact center functionality should not be impacted w.r.t existing/Active calls or with new call processing and along with that there should be no impact to the Agents.

              Chaos Experiments: -

              We simulate the real call load in our labs and production to perform failure scenarios to validate the impact on the contact center and check mean time to recover/get the other side to go Active and take over the existing calls and handle the new calls.

              Here are couple of examples:

              As already mentioned, each application is duplex with one side active and other side as stand by.

              ○ Using the VMWare APIs, we invoke to cut the communication between the active node and standby node, or we disable the network at the Active call processing node on a random basis. More like mockwire

              Expected Outcome/Behavior: - In all the scenarios, there shouldn't be any call drops and the other side should take over seamlessly and there should not be a split brain (of course it depends on how the product is designed)

              ○ Using the VMWare APIs, we invoke the commands to cut the communication at one complete site (example: Geographically distributed).

              Expected Outcome/Behavior: - The other side takes over seamlessly

              ○ Using internal developed tools and PowerShell scripting, we invoke commands to kill the process/.exe on Active node randomly

              Expected Outcome/Behavior: - automatic whisper should play to agents, and stand by should take over.

              Record the MTTR.

              ○ Push 10 times excess load/traffic using internal tools (as it is an attack on contact center to take this down) on to the components (beyond the capacity of each component) to see the stability of the application.

              Expected Outcome/Behavior: - If the setting is 18cps (calls per second) and inflow is 200 cps reject the excess of call. 

              The application to reject the excess call volume  through IVR or straight away disconnect (depends on configuration).

              Aim is to ensure the contact center and agents are not down.

              ○ Introduce Latency between components using simulators while the systems are running with peak load

              ○ Abruptly failing power on the VMs (which are Active nodes), ESXi - using the VMWare APIs and Physically

              In all the above conditions, we don’t announce the failure to the application, still the contact center should continue with no downtime/outage . 

              These are our learning from various successful chaos experiments and now we are heading in direction of deploying tools from simian army. Specifically we are starting to do few experiments with chaos and latency monkeys. 

              Please let me know if these information sufficient or you are looking for anything specific.

              • Naresh Jain
                By Naresh Jain  ~  5 months ago
                reply Reply

                Thank you, Ashish. Can you please elaborate on: What are the advantages and challenges of Chaos Testing?


  • Liked Ashish Kumar
    keyboard_arrow_down

    Ashish Kumar - Shifting Gears for better quality or faster delivery - Shift Left or Shift Right!

    Ashish Kumar
    Ashish Kumar
    Agile & Lean Coach
    Siemens
    schedule 10 months ago
    Sold Out!
    45 Mins
    Case Study
    Intermediate

    In an Agile world,we are being asked to move faster—reducing the length of time to delivery while still continuing to improve the quality. At the same time, we are faced with increased pressure to reduce testing costs. The main aim of Shifting left is ‘Early Defect Detection to Defect Prevention’.

    Bugs are cheaper when caught young. Being a testing professional during start of my career where i have seen the agile transformation from ground zero. I have experienced all what a tester can face usually in start of a agile way of working.

    Most difficult phase is shift from defect identification or detection mode towards a more collaborative approach of defect prevention.

    Why a tester was no more quality police and he/she is not the sole responsible person for quality but it was whole team. Development teams need to focus on quality from the beginning, instead of waiting for errors & bugs to be discovered late in the game.

    These learning and many more in my early career helped me when I started coaching agile teams for delivering better quality product. In recent time where I am leading a transformation , one of the major ask was to break the boundaries and silos between development and system testing teams. This topic is close to my heart because i have practiced , performed and tried to perfect time and again.

    I want to share How “Shift-Left” Testing Can Help Your Product Quality? What it means to testing community, with vastly different skill sets, are getting involved in the testing process. More specifically, it means that development teams are being incorporated into the testing cycle earlier than ever before.

    \