Practical approach to chaos engineering for startups and SMEs

The principles of chaos engineering have been around ever since the "resilience" of the system has to be tested, but in last few years the exposure of the concept and the tools/guidelines for these concepts have been trending, and for good reasons.

Apparently, testing if a system will fail due to a fault is a "thing" now. #sarcasm

In last couple of years there has been a change in the very definition of "chaos engineering" that is very subtle in nature, but the impact is extremely profound.

As per "Principles of Chaos", in 2017, the definition mentioned "distributed systems", however, as per the update in 2018, the word "distributed" is not there, only "systems".

".... experimenting on distributed systems....."


".... experimenting on distributed systems....."

This does not mean, obviously, that now "chaos engineering has dropped support of distributed systems" (duh!), but rather a very simple, yet powerful, statement that the testing system resilience is not just for the distributed systems (aka "the big shots"), but also to relatively smaller-scale systems.

As per March 2019 data, more than 55% of the start-ups in last 5 years are still in business, and 40% of these start-ups are in profit.

In start-up world, the concept of having a system that will not crash/break/cause panic (basically "Chaos") is a concept that struggles to fight with the concept of "running it successfully".

The decision of having a resilient system in place is not in the hands of the investors/founders, it is in fact the responsibility of the people who are here - the audience, the organizers, the speakers - for, these are the people who have the skill set to do what is needed (#spidermanUncleBenGreatPowerResponsibility).

While working with start-ups and identifying the business challenges, we(at ucreate) make sure that the founders understand the purpose of having a resilient system in place and how does it impacts post-MVP phase.

While the cause of chaos for Netflix or Medium or confengine could be instance corruption, cache-server-connectivity-issue, corrupted/mis-configured DNS records etc, but for an SME, the chaos might have a different projection and scale - "auto-sleep-configuration triggered due to low traffic on a cloud" or "the person who deploys the builds during recovery, has left the organization 6 months back" - not really the kind of things chaos engineering principles can be used as a guide, but the underlying concept is that the challenges of practical implementation of chaos engineering in start-ups is radically different from what is being done on distributed systems.


Outline/Structure of the Talk

1. Why is chaos engineering?

2. The real source of chaos - people.

3. How to build systems around chaos.

4. Scaling chaos.

5. When it really happens.

6. When it still happens.

Learning Outcome

1. Understand the challenges of implementing chaos engineering principles to start-ups and Small and Medium sized Enterprises, where the focus is relatively less on preventing system-crashes, but on other business vectors.

2. Architect applications with a "chaos-first" (or "chaos-ready") approach.

3. Bridge the gap between "best-practices" and "limited-practices" of chaos engineering and how to stay agile while moving in the direction of progress.

4. Being humane in realizing that a failure will happen, and when you fix it, it will still happen.

Target Audience

Developers, DevOps Engineers, Release Engineers, Site Reliability Engineers, QA, Scrum Masters, Tech Evangelists, Cloud Evangelists, CTOs, Founders

Prerequisites for Attendees

While the topic pertains to technical experts that work around making and breaking things, but anyone with interest in understanding the technical and the human factors in creating and maintaining applications can attend, given they bring the adventurer inside them.

schedule Submitted 1 year ago

Public Feedback