Unsupervised learning approach for identifying retail store employees using footfall data

Analysis of customer visits (or footfall) in the store traced via geolocation enabled devices, helps digital firms understand customers and their buying behavior better. Insights gained through geo footfall analysis help clients and advertisers make an informed decision, choose profitable regions, recognize relevant advertising opportunities and analyze their competitors to increase the success rate. But all this information can be disingenuous if people who walk past the store without entering, and staff of the store are not excluded. Therefore, two groups of people contributing to the footfall at the store can be considered outliers - people passing by the store, and employees of the store. The behavior of these outliers is expected to be different from the actual customers.

Since the data collected by geofencing the stores and pings from the SDK of the geo-enabled devices do not contribute much in tagging these outliers exclusively, these outliers are not very evident and cannot be removed by extreme value analysis. To tackle this problem we have formulated a multivariate approach to identify and remove these outliers from our source data. As we have no labeled data that marks a footfall as an employee or customer, we are using an unsupervised outlier detection model using the DBSCAN algorithm to provide a coherent and complete dataset with the labeled outliers. In this process, different techniques were taken into consideration to handle the effectiveness of features. Features like time spent by a visitor in and around the stores compared to other locations, monthly visit frequency, daily visit frequency, etc. were dominant in tagging the outliers.

Discovering the structure of data was another key step to optimize parameters of the DBSCAN algorithm for our use case namely, epsilon and minimal points.

Finally, the evaluation was done against the results obtained with that of the k-means algorithm, which showed that DBSCAN has a higher detection rate and a low rate of false positives in discovering outliers for the given problem statement.


Outline/Structure of the Talk

  1. Brief Introduction to location-based targeting (2 min)
  2. Problem statement (2 min)
  3. Impact on footfall insights(3 min)
  4. Datasets and Target Variable (3 min)
  5. Approaches (4 mins)
  6. Challenges (2mins)
  7. Results (2 min)
  8. Q&A (2 min)

Learning Outcome

  1. How Unsupervised Learning can be used in outlier detection?

  2. How relevant is outlier detection in the Digital Marketing domain?

  3. Customer footfall analysis and insights.

  4. Impact of the proposed approach on pre-campaigns.

Target Audience

Data Scientists, Students, Marketers, Digital Advertisers, Business Owners, Machine Learning Practitioners,Data Analyst, Business Analyst

Prerequisites for Attendees

Basics of Classification Modelling and Statistics.

schedule Submitted 11 months ago

Public Feedback

    • Ujwala Musku

      Ujwala Musku - Supply Path Optimization in Video Advertising Landscape

      Ujwala Musku
      Ujwala Musku
      Data Scientist II
      MiQ Digital
      schedule 11 months ago
      Sold Out!
      20 Mins

      In the programmatic era, with a lot of players in the market, it is quite complex for a buyer to reach the destination, namely advertising slot from the source, namely publisher. Auction Duplication, internal deals between DSP & SSP, and fraudulent activities are making the existing complex route even more complex day by day. Due to the aforementioned reasons, it is fairly evident that a single impression is being sold through multiple routes by multiple sellers at multiple prices. The new dilemma that has emerged recently is: Which route/path should the buyer choose and what should be the fair price to pay?

      In this talk, we will discuss a framework that solves the problem of choosing the best path at the right price in programmatic Video Advertising. Initially, we will give an overview of all the different approaches tried i.e., Clustering, Classification Modelling, DEA, and Scoring based on Classification modeling. Out of these, DEA and Scoring Methodology had better results, and hence a detailed comparison of results and why a particular approach worked better will be illustrated. The final framework explains the two best-worked techniques: 1. Data Envelopment Analysis and 2.Scoring based on Classification Modeling. DEA is a non-parametric method used to rank the Unsupervised dataset of various supply paths by estimating the relative efficiencies. These efficiencies are calculated by comparing all the possible production frontiers of decision-making units (here supply paths). As a statistical and machine learning hybrid, the Scoring method calculates the score against each supply path, helping us decide whether a path is worth bidding.

      The results of these models are compared with each other to choose the best one based on campaign KPI i.e., CPM (Cost per 1000 impressions) and CPCV (Cost per completed view of the video ad). A 4 - 8% improvement in CPM is observed in multiple test video ad campaigns, however, there is a dip in the number of impressions delivered. This is tackled by including impressions as an input in both the techniques. These clear improvements in CPM indicate that the technique results in better ROI compared to the heuristic approach. This approach can be used in various sectors like Banks (determining Credit Score) and Retail Industries(supply path optimization in Operations).

    • Aditya Jain

      Aditya Jain - Optimizing ROI of Digital Advertising using Bid Landscaping

      Aditya Jain
      Aditya Jain
      Data Scientist II
      MIQ Digital
      schedule 11 months ago
      Sold Out!
      20 Mins

      The world economy is a $80 Trillion economy driven by $500 Billion spent on advertising. Out of this, Digital Advertising forms the largest chunk at more than $300 Billion for the year 2019. With the backdrop of COVID-19 hitting the world as several economists doubt recession, business continuity and protection of livelihoods is of paramount importance to the businesses. Programmatic Advertising offers unparalleled targetability, flexibility, and measurability that could help businesses control their advertising costs. Real Time Bidding enabled Programmatic Advertising has allowed advertisers to competitively evaluate the value that each potential ad-space delivers and place a real time bid to win the ad-space and give businesses a unique opportunity of effective advertising.

      Effective advertising consists of two aspects - knowing where to bid, and knowing what to bid. These two pieces of information together enable efficient management of a digital ad-campaign. For the purpose of this talk, I will consider "where" a solved problem. However "what" is still an actively researched problem. Bid-Landscape is a model, mapping the distribution of bids vs wins that allows us to calculate an optimal bid for an auction in Real Time Bidding scenario. Typically, Gaussian and Log-normal distributions are used to approximate the distribution of bids and wins. Such assumptions are seldom true in practice and do not produce a generalized model. To complicate things further, this modeling has to be done on left censored data. Left censored refers to an arrangement where only the auction winner knows the winning price, where as the other participants only know that they have lost the auction.

      To overcome an assumed distribution, our team used a deep neural network to learn the bid landscape. We have also leveraged Domain Embeddings, a novel approach for this task. This embedding is learnt using a character CNN model. Character CNN allows us to extrapolate the learning to unseen domain names. We use Feed Forward layers to model the bid-landscape. The dataset used is left censored and contains approximately 14 Billion rows.

      In this talk I will be discussing various approaches that we tried and what worked. Bid Landscape is a dynamic distribution that changes over time. Hence I also discuss the validity of a model across multiple time periods.

      I will conclude the talk with a small discussion on the business impact of this model and results obtained on a few live digital ad-campaigns.