Real time Anomaly detection on telemetry data using neural networks
Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and now include distributed tracing systems like Haystack. Combining them with real time intelligent alerting mechanisms with accurate alerts helps in automated detection of these problems.
We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve one single customer request. Now what happens when one or more services fail at the same time? We are going to look at how Expedia determines these failed services in automated manner and provide high quality of service, which has led to huge improvements in our mean time to detect(MTTD) and know (MTTK).
In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source) using OpenTracing APIs . We will do a deep dive in our architecture and demonstrate how we ingest terabytes of tracing data in production for hundreds of our micro-services and use this data for trending service errors/latencies/rate. With the increasing number of microservices, there felt the need to have a real time intelligent alerting and monitoring system to contribute to the goal of reducing MTTD and MTTK and move towards 24/7 reliability.
With unique behavioural patterns for each of the service errors, leveraging neural networks to understand the behaviour changes for each of the micro-service and raise alert was indeed a challenging task. The task uncovered a few unexpected challenges, and the solution was less straightforward than we initially estimated. But ultimately the anomaly detector using neural network produced results that beat our expectations, once again validating the interest in neurocomputing that is overtaking the industry.
To achieve this, we predict the service failures in the microservices using recurrent neural networks on telmetry data and perform anomaly detection on predicted values. We will show how we train a recurrent neural network and auto-tune hyperparameters using Bayesian optimization methods. We will also deep dive into the architecture for the automated training pipeline and how the anomaly detection works in streaming manner using kafka(kstreams) as the backbone and model deployed on cloud in a cost effective manner. At the end , we will also discuss the possible areas for improvement to reduce false positives which includes having human intervention as the feedback loop.
Outline/Structure of the Talk
- Expedia's business use-cases
- Recurrent Neural Networks - LSTM
- Current Anomaly Detection Methods
- Issues with the current anomaly detection methods
- Expedia’s Five Step Methodology
- High Level Architecture of fully automated training and deployment pipeline
- Using bayesian optimization to auto-tune hyper-params of neural networks
- Leveraging kafka to perform real-time anomaly detection
At the end of the talk the audience would :
- Understand how to train neural networks and tune hyper-params for hundreds of time-series metrics in an automated fashion.
- Understand how to leverage kafka streams along with neural networks for perform anomaly detection in real time.
- Understand how to build a simple and automated training and deployment pipeline running in production without human intervention.
- Understand how to use telemetry data to improve developer productivity.
Technical folks and data scientists with intermediate knowledge in neural networks and streaming
Prerequisites for Attendees
Neural Networks along with basic knowledge in streaming and an interest in understanding how the observability data fits in with neural networks to reduce MTTR and MTTK.