A Spurious Outlier Detection System For High Frequency Time Series Data

As we are living in the age of IoT, more and more processes are using information gathered from well placed sensors to infer and predict better about their businesses. These sensor data are typically continuous and of enormous volume. Like any other data sources, they are also contaminated by noise (outliers) which may or may not be preventable. Presence of these outlier points will adversely affect the performance of any analytical model. Note that we are differentiating between contextual anomalies and noisy outliers. Former is of importance to us to build predictive models. Here we propose an integrated and scalable approach to detect spurious outliers. The main modules of this proposed system are taken from the literature. But to our knowledge, no such concerted approach exists where an end-to-end robust system is proposed like here. Even though this method was developed specifically using manufacturing IoT data, this is equally applicable for any domain dealing with time series data like CPG, Retail, Healthcare, Agrotech etc.

 
 

Outline/Structure of the Talk

  1. Introduction - 1min
  2. Problem Statement and Objectives - 2min
  3. Basic Thresholding and Transformations - 3min
  4. EWMA based approach - 2min
  5. Basic Framework and its components with results - 8min
  6. Updated Framework (work in progress) - 1min
  7. Summary and scope of improvements - 1min
  8. Q&A with suggestions - 2min

Learning Outcome

  1. Appreciate the challenges while dealing with high volume time series data and how we at Noodle are trying to solve these every day.
  2. Hopefully will make some members of the audience curious about time series analysis.

Target Audience

Data Scientists, Business Analysts, Product Managers

schedule Submitted 5 months ago

Public Feedback

comment Suggest improvements to the Author
  • Ashay Tamhane
    By Ashay Tamhane  ~  5 months ago
    reply Reply

    Thanks Soham for an interesting proposal. Could you give a brief insight/example on why traditional techniques did not work for you? Will help us in better evaluating the proposal.

    • Soham Chakraborty
      By Soham Chakraborty  ~  5 months ago
      reply Reply

      Hi Ashay

      I will try to address the key points why traditional techniques is not suitable for high-frequency time series data.

      1. Usual adopted methods like quantile-based trimming or other thresholding techniques (mean+/-stddev) are insufficient to identify local outliers as they are mostly focused towards identifying global outlier points. Hence we adopted a window-based approach to solve this issue.
      2. Even with adopting window based method, the above mentioned methods were found to be inadequate for this type of data (high frequency time series). Other established methods like LOF (Local Outlier Factor) was performing equally poorly. 
      3. Our proposal was built on using data from an actual manufacturing setup where thousands of sensors are capturing information constantly. These kinds of complex processes exhibit dynamic sensor time series data with changing products/product grades. A typical problem we faced was that the time series was highly non-stationary with even the ranges of sensor values changing over time. None of the usual techniques were found to work to a satisfactory level under these constraints.
      4. To mitigate the above issues, we propose a novel window based outlier detection technique which is found to perform significantly better than the conventional techniques hitherto found in the literature. 

      Hope this answers your query.

      Thanks

      Soham

       

      • Ashay Tamhane
        By Ashay Tamhane  ~  4 months ago
        reply Reply

        Thanks for the detailed response.

  • Kuldeep Jiwani
    By Kuldeep Jiwani  ~  5 months ago
    reply Reply

    Hi Soham,

    A quick question on what kind of datasets have you tested this upon. As the statistics like measure of skewness, etc can vary significantly.

    Is it from your work things or some open source data?

    • Soham Chakraborty
      By Soham Chakraborty  ~  5 months ago
      reply Reply

      Hi Kuldeep

      This is a problem we faced working with actual data from our clients like metal recyclers and steel manufacturers. Like I mentioned in the abstract, we work with industrial clients providing AI powered solutions. As such, all of our data are IOT sensor based, i.e. high frequency time series data. None of this was used on any open source data. This is an internal work developed to solve this particular issue we faced.

       

      Hope this answers your question.

      Thanks

      Soham

  • Natasha Rodrigues
    By Natasha Rodrigues  ~  5 months ago
    reply Reply

    Hi Soham,

    Thanks for your proposal! To help the program committee understand your presentation style, can you provide a link to your past recording or record a small 1-2 mins trailer of your talk and share the link to the same?

    Thanks,

    Natasha

    • Soham Chakraborty
      By Soham Chakraborty  ~  5 months ago
      reply Reply

      Hi Natasha

      I made a short video introducing myself and giving an overview of the subject matter in the proposal. Attaching the link of the same here. Do let me know if you can't access it. Will wait to hear from you.

      Video Link: https://drive.google.com/open?id=1xNMs41JoS3idU_TCmL47vOHgyI2TLg3i

      Thanks

      Soham

      • Natasha Rodrigues
        By Natasha Rodrigues  ~  5 months ago
        reply Reply

        Hi Soham,

        Thank you for this, kindly update the same in the video/link section of your proposal as well.

        Regards,

        Natasha

        • Soham Chakraborty
          By Soham Chakraborty  ~  5 months ago
          reply Reply

          Updated in the proposal.

          Thanks,

          Soham