In todays world majority of information is generated by self sustaining systems like various kinds of bots, crawlers, servers, various online services, etc. This information is flowing on the axis of time and is generated by these actors under some complex logic. For example, a stream of buy/sell order requests by an Order Gateway in financial world, or a stream of web requests by a monitoring / crawling service in the web world, or may be a hacker's bot sitting on internet and attacking various computers. Although we may not be able to know the motive or intention behind these data sources. But via some unsupervised techniques we can try to infer the pattern or correlate the events based on their multiple occurrences on the axis of time. Thus we could automatically identify signatures of various actors and take appropriate actions.

Sessionisation is one such unsupervised technique that tries to find the signal in a stream of events associated with a timestamp. In the ideal world it would resolve to finding periods with a mixture of sinusoidal waves. But for the real world this is a much complex activity, as even the systematic events generated by machines over the internet behave in a much erratic manner. So the notion of a period for a signal also changes in the real world. We can no longer associate it with a number, it has to be treated as a random variable, with expected values and associated variance. Hence we need to model "Stochastic periods" and learn their probability distributions in an unsupervised manner. This would be done via non-parametric Bayesian techniques with Gaussian prior.

In this talk we will do a walk through of a real security use cases solved via Sessionisation for the SOC (Security Operations Centre) centre of an international firm with offices in 56 countries being monitored via a central SOC team.

In this talk we will go through a Sessionisation technique based on stochastic periods. The journey would begin by extracting relevant data from a sequence of timestamped events. Then we would apply various techniques like FFT (Fast Fourier Transform), kernel density estimation, optimal signal selection, Gaussian Mixture Models, etc. and eventually discover patterns in time stamped events.

Key concepts explained in talk: Sessionisation, Bayesian techniques of Machine Learning, Gaussian Mixture Models, Kernel density estimation, FFT, stochastic periods, probabilistic modelling

2 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist

Outline/Structure of the Talk

The layout of the presentation should proceed in the following flow:

  • Setting the context, explaining the relevance of time stamped data
  • Visuals from real world examples to illustrate the concept of Sessionisation
    • Showcasing how to apply Sessionisation in real world applications
  • Decomposing a time sequence of events into pulse train
    • Statistically showing what is the relevant part that we need to capture
  • Emphasis the need of modelling via mixture models like GMM (Gaussian Mixture Models)
    • Limitation of GMM as k is needed
  • Unsupervised approach to model probability distributions
    • How to use FFT in such scenarios
    • Use of kernel density estimation
  • Bring it all together to obtain sessions and patterns from timestamped data

Learning Outcome

The following should be the learning outcomes of the talk:

  • Understanding the importance of timestamped data
  • Need for probabilistic modelling
  • Understanding of existing techniques like FFT, GMM, etc
  • How to solve the problem in an unsupervised manner
  • The most important, learning how to figure out your own way when faced with tricky problems

Target Audience

Data science enthusiast and aspirants

Prerequisites for Attendees

The talk would try to present things in an intuitive manner, so that not much is needed to know beforehand.

schedule Submitted 1 month ago

Public Feedback

comment Suggest improvements to the Speaker

  • Liked Dipanjan Sarkar

    Dipanjan Sarkar - A Hands-on Introduction to Natural Language Processing

    Dipanjan Sarkar
    Dipanjan Sarkar
    Data Scientist
    Red Hat
    schedule 3 months ago
    Sold Out!
    480 Mins

    Data is the new oil and unstructured data, especially text, images and
    videos contain a wealth of information. However, due to the inherent
    complexity in processing and analyzing this data, people often refrain
    from spending extra time and effort in venturing out from structured
    datasets to analyze these unstructured sources of data, which can be a
    potential gold mine. Natural Language Processing (NLP) is all about
    leveraging tools, techniques and algorithms to process and understand
    natural language-based data, which is usually unstructured like text,
    speech and so on. In this workshop, we will be looking at tried and tested
    strategies, techniques and workflows which can be leveraged by
    practitioners and data scientists to extract useful insights from text data.

    Being specialized in domains like computer vision and natural language
    processing is no longer a luxury but a necessity which is expected of
    any data scientist in today’s fast-paced world! With a hands-on and interactive approach, we will understand essential concepts in NLP along with extensive case-
    studies and hands-on examples to master state-of-the-art tools,
    techniques and frameworks for actually applying NLP to solve real-
    world problems. We leverage Python 3 and the latest and best state-of-
    the-art frameworks including NLTK, Gensim, SpaCy, Scikit-Learn,
    TextBlob, Keras and TensorFlow to showcase our examples.

    In my journey in this field so far, I have struggled with various problems,
    faced many challenges, and learned various lessons over time. This
    workshop will contain a major chunk of the knowledge I’ve gained in the world
    of text analytics and natural language processing, where building a
    fancy word cloud from a bunch of text documents is not enough
    anymore. Perhaps the biggest problem with regard to learning text
    analytics is not a lack of information but too much information, often
    called information overload. There are so many resources,
    documentation, papers, books, and journals containing so much content
    that they often overwhelm someone new to the field. You might have
    had questions like ‘What is the right technique to solve a problem?’,
    ‘How does text summarization really work?’ and ‘Which are the best
    frameworks to solve multi-class text categorization?’ among many other
    questions! Based on my prior knowledge and learnings from publishing a couple of books in this domain, this workshop should help readers avoid the pressing
    issues I’ve faced in my journey so far and learn the strategies to master NLP.

    This workshop follows a comprehensive and structured approach. First it
    tackles the basics of natural language understanding and Python for
    handling text data in the initial chapters. Once you’re familiar with the
    basics, we cover text processing, parsing and understanding. Then, we
    address interesting problems in text analytics in each of the remaining
    chapters, including text classification, clustering and similarity analysis,
    text summarization and topic models, semantic analysis and named
    entity recognition, sentiment analysis and model interpretation. The last
    chapter is an interesting chapter on the recent advancements made in
    NLP thanks to deep learning and transfer learning and we cover an
    example of text classification with universal sentence embeddings.