Revisiting Market Basket Analysis (MBA) with the help of SQL Pattern Matching

Market Basket Analysis or Affinity Analysis using Association Rules based model is a cross domain Solution Framework used for in Retail Analytics (Shopping Baskets), Clickstream/Web Traffic Analytics, Customer Behaviour Analytics, Fraud Analytics etc.

Market Basket Analysis (MBA) is used to discover/identify patterns from transactional data (a master-detail transactional set of line items) and serves many down-stream Business processes like Recommendations, Merchandising/Inventory Planning, Product Assortments etc.

MBA is extensively used in the industry. There are quite a few extensions possible to MBA like (a) Multi-Level Association Rules by allowing the core item/product hierarchy level to be flexible, (b) Multi-Dimensional Association Rules by including additional nuggets of information 'tags' along additional dimensions of interest, (c) Sequential Association Rules by considering the order of events within the transaction and eliciting signals relating to directionality of the Rule including possible causal indicators.

MBA is typically performed as an offline batch/etl/analytic process with the results of the modeling extracted and saved for subsequent perusal by the Domain/Business Analyst.

In this solution/revisiting of the MBA process, we decouple the Rule/Pattern identification/discovery phase (finding patterns/rules via Association Rules model build) from the Rule/Pattern KPI calculation phase related to the usefulness evaluation of the patterns (scoring patterns/rules via KPIs).

MBA Rules/Patterns are typically evaluated via the Support, Confidence and Lift KPIs. Some experts have advocated for the definition of additional KPIs like Conviction, Imbalance Ratio (IR), Kulc factor (Kulczynski) to identify interesting Rule/Patterns. We define these KPIs as well as many custom KPIs which help qualify the Rule/Patterns and aid in Rule/Pattern Discovery/Exploration phase.

The SQL approach to MBA allows us to

=> Include the pattern matching capability within an offline ETL workflow (match and pre-calculate results) or within a view (match on demand, dynamic calculation ) or a combination of both (both pre-calculated as well as on-demand) for regular BI Tools to leverage .

=> We can cover special/edge cases of interest in special domains like Fraud Patterns etc with insufficient coverage (very low support) but which need to be identified nevertheless. The pattern space can be very voluminous but in certain cases, we can identify/analyze user defined seeded patterns using SQL w/o having to build the MBA model.

=> We can also address Sequential Rules/Patterns where transaction order of items are considered during the matching process. For e.g if the Market Basket Rule is "b,p,r => c" then we can use SQL sequential logic to derive the most dominant sequential pattern within the antecedents "b", "p" and "r" is "p,b,r" 67% of the time and also that overall including both antecedents (b, p, r) and consequent (c) the dominant sequential pattern amongst the 4 basket products is "p,b,c,r" 50% of the time. This acts as a nudge to the domain analyst/business user to perhaps approve a update/transform workflow process to change the business rule from "b,p,r => c" initially sorted by product(s)/(ids) indicating pure association by the Apriori Model to "p,b,r => c" indicating the influence/calculation of the dominant sequential pattern amongst the antecendents.

=> Another advantage is to allow the Domain Analyst/Business User to perform adhoc reporting via standard BI operations like slice and dice on the dataset and recalculating the Rule/Pattern KPIs.

=> Re-evaluate a Rule/Pattern against a different dataset from that it was identified (say, against a recent/streaming input data stream). See how Patterns discovered during the "Big Sale" period are doing in current Promotion/Campaign.

=> Establish Rule/Pattern Lifecycle beyond that of a MBA 'model' -- Establish a Rules curation process to determine how a discovered Rule/Pattern can be designated as an 'Insight' for further use in related (downstream) systems.

1 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist

Outline/Structure of the Case Study

Agenda/Structure of the Presentation

Intro to SQL Pattern Matching - 5 min

Data Model/Structure details - 5 min

MBA model implementation using in Database Data Mining - Oracle Machine Learning (OML) - 5 min

Post-processing: SQL Pattern Matching process (MBA use case) - 5 min

Post-processing: Additional KPIs as extensions to MBA - 5 min

BI Tool Semantic Model: Business Layer Model for the schema/data model - 5 min

Demo in Oracle Analytics Cloud - 10 min

Q&A - 5 min

Learning Outcome

* Learn about Advanced SQL techniques like MATCH_RECOGNIZE (ANSI SQL: 2016) used for row pattern matching

* SQL based approach to MBA makes it possible to extend functionality of typical MBA in many useful/interesting ways

=> Evaluate the Sequential nature and/or assess Directionality of the Rule/Pattern

=> Model within standard BI solution by encapsulating the MBA Rule/Pattern scoring mechanism within a DB view (sql + ML = win-win)

Target Audience

Developers, Architects, Data Engineers, ML Practioners

Prerequisites for Attendees

A general understanding of Market Basket Analysis (MBA), also referred to as building an Association Rules model in Data Mining will help. This will be briefly touched upon in the talk as also some of the issues/concerns with usage of the traditional MBA KPIs necessitating additional KPIs to qualify the Rule/Pattern.

schedule Submitted 3 months ago

Public Feedback

comment Suggest improvements to the Speaker

  • Liked Subhasish Misra

    Subhasish Misra - Causal data science: Answering the crucial ‘why’ in your analysis.

    Subhasish Misra
    Subhasish Misra
    Staff Data Scientist
    Walmart Labs
    schedule 3 months ago
    Sold Out!
    45 Mins

    Causal questions are ubiquitous in data science. For e.g. questions such as, did changing a feature in a website lead to more traffic or if digital ad exposure led to incremental purchase are deeply rooted in causality.

    Randomized tests are considered to be the gold standard when it comes to getting to causal effects. However, experiments in many cases are unfeasible or unethical. In such cases one has to rely on observational (non-experimental) data to derive causal insights. The crucial difference between randomized experiments and observational data is that in the former, test subjects (e.g. customers) are randomly assigned a treatment (e.g. digital advertisement exposure). This helps curb the possibility that user response (e.g. clicking on a link in the ad and purchasing the product) across the two groups of treated and non-treated subjects is different owing to pre-existing differences in user characteristic (e.g. demographics, geo-location etc.). In essence, we can then attribute divergences observed post-treatment in key outcomes (e.g. purchase rate), as the causal impact of the treatment.

    This treatment assignment mechanism that makes causal attribution possible via randomization is absent though when using observational data. Thankfully, there are scientific (statistical and beyond) techniques available to ensure that we are able to circumvent this shortcoming and get to causal reads.

    The aim of this talk, will be to offer a practical overview of the above aspects of causal inference -which in turn as a discipline lies at the fascinating confluence of statistics, philosophy, computer science, psychology, economics, and medicine, among others. Topics include:

    • The fundamental tenets of causality and measuring causal effects.
    • Challenges involved in measuring causal effects in real world situations.
    • Distinguishing between randomized and observational approaches to measuring the same.
    • Provide an introduction to measuring causal effects using observational data using matching and its extension of propensity score based matching with a focus on the a) the intuition and statistics behind it b) Tips from the trenches, basis the speakers experience in these techniques and c) Practical limitations of such approaches
    • Walk through an example of how matching was applied to get to causal insights regarding effectiveness of a digital product for a major retailer.
    • Finally conclude with why understanding having a nuanced understanding of causality is all the more important in the big data era we are into.
  • Liked Shrutika Poyrekar

    Shrutika Poyrekar / kiran karkera / Usha Rengaraju - Introduction to Bayesian Networks

    90 Mins

    { This is a handson workshop . The use case is Traffic analysis . }

    Most machine learning models assume independent and identically distributed (i.i.d) data. Graphical models can capture almost arbitrarily rich dependency structures between variables. They encode conditional independence structure with graphs. Bayesian network, a type of graphical model describes a probability distribution among all variables by putting edges between the variable nodes, wherein edges represent the conditional probability factor in the factorized probability distribution. Thus Bayesian Networks provide a compact representation for dealing with uncertainty using an underlying graphical structure and the probability theory. These models have a variety of applications such as medical diagnosis, biomonitoring, image processing, turbo codes, information retrieval, document classification, gene regulatory networks, etc. amongst many others. These models are interpretable as they are able to capture the causal relationships between different features .They can work efficiently with small data and also deal with missing data which gives it more power than conventional machine learning and deep learning models.

    In this session, we will discuss concepts of conditional independence, d- separation , Hammersley Clifford theorem , Bayes theorem, Expectation Maximization and Variable Elimination. There will be a code walk through of simple case study.

  • Liked Akash Tandon

    Akash Tandon - Traversing the graph computing and database ecosystem

    Akash Tandon
    Akash Tandon
    Data Engineer
    schedule 3 months ago
    Sold Out!
    45 Mins

    Graphs have long held a special place in computer science’s history (and codebases). We're seeing the advent of a new wave of the information age; an age that is characterized by great emphasis on linked data. Hence, graph computing and databases have risen to prominence rapidly over the last few years. Be it enterprise knowledge graphs, fraud detection or graph-based social media analytics, there are a great number of potential applications.

    To reap the benefits of graph databases and computing, one needs to understand the basics as well as current technical landscape and offerings. Equally important is to understand if a graph-based approach suits your problem.
    These realizations are a result of my involvement in an effort to build an enterprise knowledge graph platform. I also believe that graph computing is more than a niche technology and has potential for organizations of varying scale.
    Now, I want to share my learning with you.

    This talk will touch upon the above points with the general premise being that data structured as graph(s) can lead to improved data workflows.
    During our journey, you will learn fundamentals of graph technology and witness a live demo using Neo4j, a popular property graph database. We will walk through a day in the life of data workers (engineers, scientists, analysts), the challenges that they face and how graph-based approaches result in elegant solutions.
    We'll end our journey with a peek into the current graph ecosystem and high-level concepts that need to be kept in mind while adopting an offering.

  • Liked Kshitij Srivastava

    Kshitij Srivastava / Manikant Prasad - Data Science in Containers

    45 Mins
    Case Study

    Containers are all the rage in the DevOps arena.

    This session is a live demonstration of how the data team at Milliman uses containers at each step in their data science workflow -

    1) How do containerized environments speed up data scientists at the data exploration stage

    2) How do containers enable rapid prototyping and validation at the modeling stage

    3) How do we put containerized models on production

    4) How do containers make it easy for data scientists to do DevOps

    5) How do containers make it easy for data scientists to host a data science dashboard with continuous integration and continuous delivery

  • Liked Dr. Neha Sehgal

    Dr. Neha Sehgal - Open Data Science for Smart Manufacturing

    45 Mins

    Open Data offers a tremendous opportunity in transformation of today’s manufacturing sector to smarter manufacturing. Smart Manufacturing initiatives include digitalising production processes and integrating IoT technologies for connecting machines to collect data for analysis and visualisation.

    In this talk, an understanding of linkage between various industries within manufacturing sector through lens of Open Data Science will be illustrated. The data on manufacturing sector companies, company profiles, officers and financials will be scraped from UK Open Data API’s. The work I plan to showcase in ODSC is part of UK Made Smarter Project, where the work has been useful for major aerospace alliances to find out the champions and strugglers (SMEs) within manufacturing sector based on the open data gathered from multiple sources. The talk includes discussion on data extraction, data cleaning, data transformation - transforming raw financial information about companies to key metrics of interest - and further data analytics to create clusters of manufacturing companies into "Champions" and "Strugglers". The talk showcased examples of powerful R Shiny based dashboards of interest for suppliers, manufacturer and other key stakeholders in supply chain network.

    Further analysis includes network analysis for industries, clustering and deploying the model as an API using Google Cloud Platform. The presenter will discuss about the necessity of 'Analytical Thinking' approach as an aid to handle complex big data projects and how to overcome challenges while working with real-life data science projects.

  • Liked Vidhya Veeraraghavan

    Vidhya Veeraraghavan - Story Teller - Analytics in Banking & Financial Sector

    45 Mins
    Case Study

    As kids, we always enjoyed stories. Some scary, some holy, some imbibing moral values & some just for fun.

    Analytics is fun when you approach it with passion and curiosity. I know this because I have done this. With few case studies, I wish to illuminate your wits about Analytics and how it is being actively used in Banking and Financial Sector.

    Come join me for a fun ride.