Real time Anomaly detection on telemetry data using neural networks

schedule Aug 8th 02:45 - 03:05 PM place Neptune people 175 Interested

Description:

Observability is the key in modern architecture to quickly detect and repair problems in microservices. Modern observability platforms have evolved beyond simple application logs and now include distributed tracing systems like Haystack. Combining them with real time intelligent alerting mechanisms with accurate alerts helps in automated detection of these problems.

Abstract

We at Expedia work on a mission of connecting people to places through the power of technology. To accomplish this, we build and run hundreds of micro-services that provide different functionalities to serve one single customer request. Now what happens when one or more services fail at the same time? We are going to look at how Expedia determines these failed services in automated manner and provide high quality of service, which has led to huge improvements in our mean time to detect(MTTD) and know (MTTK).

In this talk, we will present the journey of distributed tracing in Expedia that started with Zipkin as a prototype and ended up building our own solution(in open source) using OpenTracing APIs . We will do a deep dive in our architecture and demonstrate how we ingest terabytes of tracing data in production for hundreds of our micro-services and use this data for trending service errors/latencies/rate. With the increasing number of microservices, there felt the need to have a real time intelligent alerting and monitoring system to contribute to the goal of reducing MTTD and MTTK and move towards 24/7 reliability.

With unique behavioural patterns for each of the service errors, leveraging neural networks to understand the behaviour changes for each of the micro-service and raise alert was indeed a challenging task. The task uncovered a few unexpected challenges, and the solution was less straightforward than we initially estimated. But ultimately the anomaly detector using neural network produced results that beat our expectations, once again validating the interest in neurocomputing that is overtaking the industry.

To achieve this, we predict the service failures in the microservices using recurrent neural networks on telmetry data and perform anomaly detection on predicted values. We will show how we train a recurrent neural network and auto-tune hyperparameters using Bayesian optimization methods. We will also deep dive into the architecture for the automated training pipeline and how the anomaly detection works in streaming manner using kafka(kstreams) as the backbone and model deployed on cloud in a cost effective manner. At the end , we will also discuss the possible areas for improvement to reduce false positives which includes having human intervention as the feedback loop.

 
 

Outline/Structure of the Talk

  • Expedia's business use-cases
  • Recurrent Neural Networks - LSTM
  • Current Anomaly Detection Methods
  • Issues with the current anomaly detection methods
  • Expedia’s Five Step Methodology
  • High Level Architecture of fully automated training and deployment pipeline
  • Using bayesian optimization to auto-tune hyper-params of neural networks
  • Leveraging kafka to perform real-time anomaly detection
  • Demo
  • Results

Learning Outcome

At the end of the talk the audience would :

  • Understand how to train neural networks and tune hyper-params for hundreds of time-series metrics in an automated fashion.
  • Understand how to leverage kafka streams along with neural networks for perform anomaly detection in real time.
  • Understand how to build a simple and automated training and deployment pipeline running in production without human intervention.
  • Understand how to use telemetry data to improve developer productivity.

Target Audience

Technical folks and data scientists with intermediate knowledge in neural networks and streaming

Prerequisites for Attendees

Neural Networks along with basic knowledge in streaming and an interest in understanding how the observability data fits in with neural networks to reduce MTTK and MTTD.

schedule Submitted 7 months ago

Public Feedback

comment Suggest improvements to the Speaker
  • Naresh Jain
    By Naresh Jain  ~  6 months ago
    reply Reply

    Thank you for the proposal.

    I noticed that there are 3 presenters on this 45 mins topic. Can you please clarify why we need 3 speakers and how all 3 speakers plan to contribute?

    • Ashish Aggarwal
      By Ashish Aggarwal  ~  6 months ago
      reply Reply

      Thanks Naresh for the review.
      I think there will be one or maximum two presenters for the final talk (if selected).

      At the time of submission, we submit all three names who have contributed in this project in one way or other.

      Let me know if you need any corrections at our end.


      Thanks!

      • Naresh Jain
        By Naresh Jain  ~  6 months ago
        reply Reply

        Request you to please put the final speakers on the proposal. It will help the program committee make the selection.

        • Ashish Aggarwal
          By Ashish Aggarwal  ~  5 months ago
          reply Reply

          Hey Naresh,

          Sorry for late response. We have identified our speakers for the talk. As last discussed last Fri, do you want to hold a 5-10 min with the presenters ?

          • Naresh Jain
            By Naresh Jain  ~  5 months ago
            reply Reply

            Hi Ashish,

            Can you transfer the proposal to the actual speaker? After looking at that we can decide if we need a 5-10 mins video call with the speaker.

            • Keshav Peswani
              By Keshav Peswani  ~  5 months ago
              reply Reply

              Hi Naresh,

              We have added both of the speakers . We would be happy to do a video call in case any explanation is needed from our side.

              Thanks

              Keshav

        • Ashish Aggarwal
          By Ashish Aggarwal  ~  6 months ago
          reply Reply

          Done