Sessionisation via stochastic periods for root event identification
In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Associating a chain of events in order of time helps in doing a root event analysis. In certain cases a time ordered correlation and root event identification is good enough to automatically identify signatures of various malicious actors and take appropriate corrective actions to stop cyber attacks, stop malicious social campaigns, etc.
Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner.
The main focus of this talk will be to showcase applied data science techniques to discover stochastic periods. There are many ways to obtain periods in data, so the journey would begin by a walk through of existing techniques like FFT (Fast Fourier Transform) then discuss about Gaussian Mixture Models. After highlighting the short comings of these techniques we will succinctly explain one of the most general non-parametric Bayesian approaches to solve this problem. Without going too deep in the complex math, we will get back to applied data science and discuss a much simpler technique that can solve the same problem if certain assumptions are satisfied.
In this talk we will demonstrate some time based pattern we discovered while working on a security analytics use case that uses Sessionisation. In the talk we will demonstrate such patterns based on an open source malware attack datasets that is available publicly.
Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling, Bayesian non-parametric methods
Outline/Structure of the Talk
The layout of the presentation should proceed in the following flow:
- Setting the context, explaining the relevance of time stamped data
- Visuals from real world examples to illustrate the concept of Sessionisation
- Showcasing how to apply Sessionisation in real world applications
- Root event identification via time ordered correlation
- Decomposing a time sequence of events into pulse train
- Statistically showing what is the relevant part that we need to capture
- Explaining and demonstrating usage of FFT
- Emphasis the need of modelling via mixture models like GMM (Gaussian Mixture Models)
- Limitation of GMM as k is needed
- Non-parametric bayesian modelling for infinite GMM (Gaussian Mixture Models)
- Unsupervised applied data science approach to model probability distributions
- Use of kernel density estimation
- Bring it all together to obtain sessions and patterns from timestamped data
Learning Outcome
The following should be the learning outcomes of the talk:
- Understanding the importance of time stamped data
- Need for probabilistic modelling
- Understanding of existing techniques like FFT, GMM, etc
- How to solve the problem in an unsupervised manner
- The most important, learning how to figure out your own way when faced with tricky problems
Target Audience
Data science enthusiast and aspirants
Prerequisites for Attendees
The talk would try to present things in an intuitive manner, so that not much is needed to know beforehand.
Video
Links
Have given talks on the following:
- ODSC India 2018: "Topological space creation and clustering at BigData scale"
- https://confengine.com/odsc-india-2018/proposal/6545/topological-space-creation-and-clustering-at-bigdata-scale
- Includes PPT slides, video links, brief description of the talk
- https://confengine.com/odsc-india-2018/proposal/6545/topological-space-creation-and-clustering-at-bigdata-scale
- OracleCode 2017: "Performance Diagnostic Techniques for Big Data Solutions Using Machine Learning"
- https://developer.oracle.com/code/newdelhi
- Video of talk not been posted
- https://www.pscp.tv/w/1gqGvbWlAaAGB
- https://developer.oracle.com/code/newdelhi
- GIDS (Great International Developer Summit) 2008: Performance diagnostics tool for Java: AD4J
schedule Submitted 4 years ago
People who liked this proposal, also liked:
-
keyboard_arrow_down
Anuj Gupta - Natural Language Processing Bootcamp - Zero to Hero
480 Mins
Workshop
Intermediate
Data is the new oil and unstructured data, especially text, images and videos contain a wealth of information. However, due to the inherent complexity in processing and analyzing this data, people often refrain from spending extra time and effort in venturing out from structured datasets to analyze these unstructured sources of data, which can be a potential gold mine. Natural Language Processing (NLP) is all about leveraging tools, techniques and algorithms to process and understand natural language based unstructured data - text, speech and so on.
Being specialized in domains like computer vision and natural language processing is no longer a luxury but a necessity which is expected of any data scientist in today’s fast-paced world! With a hands-on and interactive approach, we will understand essential concepts in NLP along with extensive case- studies and hands-on examples to master state-of-the-art tools, techniques and frameworks for actually applying NLP to solve real- world problems. We leverage Python 3 and the latest and best state-of- the-art frameworks including NLTK, Gensim, SpaCy, Scikit-Learn, TextBlob, Keras and TensorFlow to showcase our examples. You will be able to learn a fair bit of machine learning as well as deep learning in the context of NLP during this bootcamp.
In our journey in this field, we have struggled with various problems, faced many challenges, and learned various lessons over time. This workshop is our way of giving back a major chunk of the knowledge we’ve gained in the world of text analytics and natural language processing, where building a fancy word cloud from a bunch of text documents is not enough anymore. You might have had questions like ‘What is the right technique to solve a problem?’, ‘How does text summarization really work?’ and ‘Which are the best frameworks to solve multi-class text categorization?’ among many other questions! Based on our prior knowledge and learnings from publishing a couple of books in this domain, this workshop should help readers avoid some of the pressing issues in NLP and learn effective strategies to master NLP.
The intent of this workshop is to make you a hero in NLP so that you can start applying NLP to solve real-world problems. We start from zero and follow a comprehensive and structured approach to make you learn all the essentials in NLP. We will be covering the following aspects during the course of this workshop with hands-on examples and projects!
- Basics of Natural Language and Python for NLP tasks
- Text Processing and Wrangling
- Text Understanding - POS, NER, Parsing
- Text Representation - BOW, Embeddings, Contextual Embeddings
- Text Similarity and Content Recommenders
- Text Clustering
- Topic Modeling
- Text Summarization
- Sentiment Analysis - Unsupervised & Supervised
- Text Classification with Machine Learning and Deep Learning
- Multi-class & Multi-Label Text Classification
- Deep Transfer Learning and it's promise
- Applying Deep Transfer Learning - Universal Sentence Encoders, ELMo and BERT for NLP tasks
- Generative Deep Learning for NLP
- Next Steps
With over 10 hands-on projects, the bootcamp will be packed with plenty of hands-on examples for you to go through, try out and practice and we will try to keep theory to a minimum considering the limited time we have and the amount of ground we want to cover. We hope at the end of this workshop you can takeaway some useful methodologies to apply for solving NLP problems in the future. We will be using Python to showcase all our examples.