Rolling out Error Budgets across a 1000 person global engineering organisation
Zendesk has been struggling with reliability from it’s beginning - in many ways it has been a victim of its own overnight success. Over the last few years we’ve had to take drastic measures to address major outages, such as implementing company-wide change freezes.
These measures hurt when you have 1000 engineers in 120 product development teams across the globe, and in many ways create more risk when the freeze begins to thaw.
In order to avoid these freeze’s we have recently moved to implement concepts from the Site Reliability Engineering (SRE) discipline, specifically implementing Error Budgets along with SLOs/SLIs. The aim of this is to “scope” the freeze to those systems that have more reliability issues.
We’ve had some wins in introducing this approach, but are still very much at the beginning of this journey. This talk will tell the story of this journey along with providing some practical suggestions around tooling and practices to implement.
Outline/Structure of the Talk
1. Who is Zendesk (products we build, the scale of the systems, size of the engineering org)
2. History of our reliability challenges
3. Recent reliability events
4. Refresher on key SRE Concepts (SLIs/SLOs/Error Budgets)
5. How we rolled them out at Zendesk
6. The Lessons learnt from the practical application of Error Budgets
7. Where we are on the journey and what we still have to do
Learning Outcome
- Basic concepts of SRE such as Error Budgets / Service Level Indicators / Service Level Objectives
- When and why to introduce SRE Concepts
- How to do it effectively
- The Benefits to expect
Target Audience
Engineering managers and executives who are responsible for the reliability of the system they build
Prerequisites for Attendees
A high level understanding of SRE (Site Reliability Engineering) concepts