Non-Parametric PDF estimation for advanced Anomaly Detection
Anomaly Detection have been one of most sought after analytical solutions for businesses operating in the domain of Network Operation, Service Operation, Manufacturing etc. and many other sectors where continuity of operations is essential. Any degradation in operational service or an outage, implies high losses and possible customer churn. The data in such real world applications is generally noisy, have complex patterns and often correlated.
There are techniques like Auto-Encoders available for modelling complex patterns, but they can't explain the cause in original feature space. The traditional univariate anomaly detection techniques uses the z-score and p-value methods. These rely upon unimodality and choice of correct parametric form. If assumptions are not satisfied then there would be a high number of False-Positives and False-Negatives.
This is where the need for estimating a PDF (Probability Density Function) arises that too without assuming a prior parametric form i.e. Non-Parametric approach. The PDF needs to be modelled as close to the true distribution as possible. That is it should have a low bias and low variance to avoid over-smoothing and under-smoothing. Only then we would have better chances of identifying true anomalies.
Approaches like KDE - Kernel Density Estimation assist in such non-parametric estimations. As per research the type of kernel has a lesser role to play than the bandwidth for a good PDF estimation. The default bandwidth selection technique used in both Python and R packages over-smooths the PDF and is not suitable for Anomaly Detection.
We will explain another method, where we run optimisation over a cost function based on modelling Gaussian kernel via FFT (Fast Fourier Transform), to obtain the appropriate bandwidth. Then we will show how we can apply it for Anomaly Detection even when the data is multi-modal (have multiple peaks) and the distribution can be of any shape.
Based on research paper under publication "Optimal Kernel Density Estimation using FFT based cost function", currently scheduled for ICDM 2020, New York
Outline/Structure of the Talk
High level flow:
- Traditional Anomaly Detection techniques
- How to use z-score and p-value
- The shortcomings of such approaches
- Case Studies of wrong parametric form leading to False alerts
- Introduction to KDE and role of bandwidth in estimating PDF
- Existing bandwidth techniques
- Proposed technique for bandwidth estimation
- Finally applying all the above to detect complex anomalies
Audience will learn the following:
- Concepts of Anomaly detection based on z-score and p-value
- How to use KDE, also useful for EDA phase
- How to find the optimal PDF for a the given data
- Apply Anomaly detection where data has complex patterns
Data Scientist, Applied Analytics folks, Industry experts
Prerequisites for Attendees
No significant prerequisites needed as the talk will build up from basic concepts.