Automating Operations with Machine Learning

How much money would you save if AI could detect and fix your outages as soon as they happen? In a multi-billion dollar business, outages are very expensive. MTTR has a direct effect on the bottom-line, so every second counts in resolving issues. But with millions of metrics being generated by thousands of microservices, how do you choose which metrics to pay attention to? How do you make your alerts meaningful to avoid alert fatigue and desensitisation? How do you respond to those alerts in a timely manner?

In this talk, Matt covers how Expedia is using Machine Learning to "close the loop" involved in detecting, diagnosing and remediating outages post-release. You will learn about how to use ML to build models for anomaly detection in metrics. You will also learn about "ML-Ops" and how to build a platform for training and deploying ML models.

 
1 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist
 

Target Audience

All

schedule Submitted 1 week ago

Public Feedback

comment Suggest improvements to the Speaker