As we continue to experience the multiple pandemic waves, we quickly learn to adapt and sustain to new ways of living while building on the opportunities manifested by the crisis. Today, even IT systems are expected to have a high user expectation to be ‘Always-on’, make chaos a part of Business-As-Usual (BAU) while ensuring a healthy sustenance to overcome the failures.

The shift in the user expectation demands a rigorous move towards resilience engineering practices to build anti-fragile applications. Smart balancing of proactive (shift-left) and post-production (shift-right) strategies are required to ensure the application is designed and built with resiliency first, instead of considering reactive chaos testing as a quality gate ritual for production release.

To facilitate early understanding of system recovery characteristics, we need to develop a resilience culture to experiment and fail fast, early in the life cycle and implement chaos engineering activities as a part of delivery pipeline. By provisioning a robust observability platform with intelligence (AI/ML) to trace, correlate and report the failure impact on the critical business processes, the product engineering team gets the opportunity to proactively detect, diagnose, and improve the system resiliency. 

Our approach towards low-blast radius focused early chaos testing combined with ongoing failure attacks (across application, network, and infrastructure) on production environment has helped systems to recover from real time failures while reducing MTTR by more than 50%. Resiliency further improves over time as more intensive failure experiments are carried out in production as well as test environment and self- healing automation is employed.

Managing stringent SLOs to meet the ever-increasing user demand though continuous monitoring of service level indicators, meaningfully managing error budgets & TOIL automation powered by Site Reliability Engineering (SRE) principles helps in maintaining the customer trust and preserving the promise of high availability despite a complex distributed multi-cloud ecosystem.

In this tutorial session, you will learn and experience the practical challenges faced during the culture transformation journey of a Financial Fintech. Join the session to experience how marrying chaos engineering practices with SRE practices can help in improving the Business Resilience.


Outline/Structure of the Tutorial

  • Introduction to Chaos Engineering (10 mins)
  • Drivers for Engineering for Business Resilience (5 mins)
  • Gaps and transformation opportunities in current Chaos Engineering practices (10 mins)
  • The New normal in Business Process Resilience (10 mins)
  • #3 secret ingredients of best-in-class Resilient Applications (30 mins)
    • Smart balancing of Proactive and Reactive resilience practices across SDLC phases
    • Embracing the power of continuous Observability platform  
    • Site Reliability Engineering practices for assuring high availability 
  • Challenges faced in adopting to the New Normal of Business Process Resilience (10 mins)
  • Experience share – Transforming and building failure driven culture for a Financial Fintech (15 mins)

Learning Outcome

  • Best practices for implementing Chaos Engineering principles
  • Knowledge about popular Chaos Engineering tools in the market
  • Shift left and Shift right practices for assuring business resiliency
  • Benefits of setting up robust observability platform
  • SRE principles for assuring high availability
  • How to strategize resilience engineering transformation journey

Target Audience

Product Owners, Architects, Developers, Performance Testers / Engineers, Site Reliability Engineers and DevOps Engineers

Prerequisites for Attendees

  • Basic knowledge in Agile & DevOps
  • Knowledge in Non-Functional Testing / Engineering
  • Basic understanding of various solution architectures
  • Cloud engineering basics



schedule Submitted 1 year ago

  • Ramya Ramalinga Moorthy

    Ramya Ramalinga Moorthy / Sai Subramanian Sivasailem - Building an anti-fragile, highly scalable system to assure Business Resilience

    20 Mins
    Experience Report

    With the increased adoption of cloud native applications, microservices and distributed deployments the infrastructure failure points are now multifold causing significant business impacts on account of outages. Although these new-age architectures provide increased velocity and flexibility to release application features rapidly, isolating the faults to triage & debug issues have become extremely complex resulting in high MTTR.

    Chaos engineering can help assure application resiliency through controlled injection of infrastructure failures there by proactively detecting probable outage hotspots and enable architects to design anti-fragile resilient systems.

    Conventional chaos engineering experiments are siloed and do not provide E2E business process visibility on impacted service calls and performance bottlenecks. In this session, we will deep dive into how Chaos Engineering tests can amplify significant value when correlated with Application Business Process that improves application’s performance, observability, and resiliency.

    The session with also cover an experience share on transforming and building failure driven culture for a financial Fintech banking customer where the scope of the engagement was to digitize the wealth management platform through resilience principles to achieve 99.99% high availability SLO.