Automating Operations with Machine Learning
How much money would you save if AI could detect and fix your outages as soon as they happen? In a multi-billion dollar business, outages are very expensive. MTTR has a direct effect on the bottom-line, so every second count in resolving issues. But with millions of metrics being generated by thousands of microservices, how do you choose which metrics to pay attention to? How do you make your alerts meaningful to avoid alert fatigue and desensitisation? How do you respond to those alerts in a timely manner?
In this talk, Matt covers how Expedia is using Machine Learning to "close the loop" involved in detecting, diagnosing and remediating outages post-release. You will learn about how to use ML to build models for anomaly detection in metrics. You will also learn about "ML-Ops" and how to build a platform for training and deploying ML models.