Five Key Pitfalls in Data Analysis

Data Science is all about deriving actionable insights through data analysis.
There is no denying the fact that such insights have a tremendous business value.
But what if -
Some crucial data has been left out of consideration ?
Wrong inferences have been drawn during analysis ?
Results have been graphically misrepresented?
Imagine the adverse impact on your business if you take wrong decisions based on such cases.

In this talk we will discuss the following 5 key pitfalls to lookout for in the data analysis results before you take any decisions based on them
1. Selection Bias
2. Survivor Bias
3. Confounding Effects
4. Spurious Correlations
5. Misleading Visualizations

These are some of the most common points that are overlooked by the beginners in Data Science.

The talk will draw upon many examples from real life situations to illustrate these points.


Outline/Structure of the Talk

  • Goal of Data Science (1 min.)
  • Data Wrangling - Absolutely Necessary but Not Sufficient (1 mins.)
  • Issues Not Addressed by Data Wrangling (1 mins)
  • Selection Bias (6 mins.)
  • Survivor Bias (5 mins.)
  • Confounding Effects (6 mins.)
  • Spurious Correlations (5 mins.)
  • Misleading Visualizations (5 mins.)
  • Conclusions (5 mins.)
  • Q & A (5 mins.)

[Please Note:

Link to a free preview video of my course "What is Data Science?" hosted in Udemy Platform.

I don't have the video for the topic I am proposing here.]

Learning Outcome

  1. A beginner will be able to avoid the pitfalls discussed in this talk and produce more authentic data analysis results.
  2. By understanding these pitfalls a decision maker will be able to constructively question the analyst's finding and avoid taking wrong and costly decisions.

Target Audience

Data Science Beginners; People who take data-driven decisions;

Prerequisites for Attendees

No Prerequisites.

schedule Submitted 1 year ago

  • Liked Subhasish Misra

    Subhasish Misra - Causal data science: Answering the crucial ‘why’ in your analysis.

    45 Mins

    Causal questions are ubiquitous in data science. For e.g. questions such as, did changing a feature in a website lead to more traffic or if digital ad exposure led to incremental purchase are deeply rooted in causality.

    Randomized tests are considered to be the gold standard when it comes to getting to causal effects. However, experiments in many cases are unfeasible or unethical. In such cases one has to rely on observational (non-experimental) data to derive causal insights. The crucial difference between randomized experiments and observational data is that in the former, test subjects (e.g. customers) are randomly assigned a treatment (e.g. digital advertisement exposure). This helps curb the possibility that user response (e.g. clicking on a link in the ad and purchasing the product) across the two groups of treated and non-treated subjects is different owing to pre-existing differences in user characteristic (e.g. demographics, geo-location etc.). In essence, we can then attribute divergences observed post-treatment in key outcomes (e.g. purchase rate), as the causal impact of the treatment.

    This treatment assignment mechanism that makes causal attribution possible via randomization is absent though when using observational data. Thankfully, there are scientific (statistical and beyond) techniques available to ensure that we are able to circumvent this shortcoming and get to causal reads.

    The aim of this talk, will be to offer a practical overview of the above aspects of causal inference -which in turn as a discipline lies at the fascinating confluence of statistics, philosophy, computer science, psychology, economics, and medicine, among others. Topics include:

    • The fundamental tenets of causality and measuring causal effects.
    • Challenges involved in measuring causal effects in real world situations.
    • Distinguishing between randomized and observational approaches to measuring the same.
    • Provide an introduction to measuring causal effects using observational data using matching and its extension of propensity score based matching with a focus on the a) the intuition and statistics behind it b) Tips from the trenches, basis the speakers experience in these techniques and c) Practical limitations of such approaches
    • Walk through an example of how matching was applied to get to causal insights regarding effectiveness of a digital product for a major retailer.
    • Finally conclude with why understanding having a nuanced understanding of causality is all the more important in the big data era we are into.
  • Liked Chaitanya Krishna Thanneeru

    Chaitanya Krishna Thanneeru - Taxonomy Building using ML

    45 Mins
    Case Study

    Topic Modeling the art of extracting latent topics/themes that exist in a set of documents. In this talk we will discuss the use cases of Topic Modeling, particularly pertaining to Latent Dirichlet Allocation (LDA), and the implementation work by the Data Science Applications team at Meredith for the purposes of designing auto-taggers, classifiers for the topics in the custom enterprise taxonomy against hundreds of thousands of documents. We will talk about the best practices of choosing the optimal number of topics for hundreds of thousands of documents, how named entity extraction is employed to derive context in the feature space, alignment of machine learning techniques to support the work of taxonomists, the integration with the enterprise architecture to support expert assessor population for curating training data for Google’s AutoML and other deep learning capabilities.
    Latent semantic analysis has been shown to be ideal for quickly clustering the document space. Applied in a hierarchical manner on top-level clusters to derive child clusters and informed with inputs from the subject matter experts and taxonomists, namely taxonomy terms and synonyms, makes it possible to get a sense of the coverage in the content space against the enterprise taxonomy model.
    Where there are shortcomings, additional training data needs to be obtained in order to effectively build auto-tagging solutions. One technique for data augmentation is query formulation, again utilizing entity extraction from owned content along with the taxonomy categories and synonyms, to construct social listening streams to surface new off-property content to become part of the training corpus.

  • Liked Shrutika Poyrekar

    Shrutika Poyrekar / kiran karkera / Usha Rengaraju - Introduction to Bayesian Networks

    90 Mins

    { This is a handson workshop . The use case is Traffic analysis . }

    Most machine learning models assume independent and identically distributed (i.i.d) data. Graphical models can capture almost arbitrarily rich dependency structures between variables. They encode conditional independence structure with graphs. Bayesian network, a type of graphical model describes a probability distribution among all variables by putting edges between the variable nodes, wherein edges represent the conditional probability factor in the factorized probability distribution. Thus Bayesian Networks provide a compact representation for dealing with uncertainty using an underlying graphical structure and the probability theory. These models have a variety of applications such as medical diagnosis, biomonitoring, image processing, turbo codes, information retrieval, document classification, gene regulatory networks, etc. amongst many others. These models are interpretable as they are able to capture the causal relationships between different features .They can work efficiently with small data and also deal with missing data which gives it more power than conventional machine learning and deep learning models.

    In this session, we will discuss concepts of conditional independence, d- separation , Hammersley Clifford theorem , Bayes theorem, Expectation Maximization and Variable Elimination. There will be a code walk through of simple case study.

  • 45 Mins

    In this digital era when the attention span of customers is reducing drastically, for a marketer it is imperative to understand the following 4 aspects more popularly known as "The 4R's of Marketing" if they want to increase our ROI:

    - Right Person

    - Right Time

    - Right Content

    - Right Channel

    Only when we design and send our campaigns in such a way, that it reaches the right customers at the right time through the right channel telling them about stuffs they like or are interested in ... can we expect higher conversions with lower investment. This is a problem that most of the organizations need to solve for to stay relevant in this age of high market competition.

    Among all these we will put special focus on appropriate content generation based on targeted user base using Markov based models and do a quick hack session.

    The time breakup can be:

    5 mins : Difference between Martech and traditional marketing. The 4R's of marketing and why solving for them is crucial

    5 mins : What is Smart Segments and how to solve for it, with a short demo

    5 mins : How marketers use output from Smart Segments to execute targeted campaigns

    5 mins: What is STO, how it can be solved and what is the performance uplift seen by clients when they use it

    5 mins: What is Channel Optimization, how it can be solved and what is the performance uplift seen by clients when they use it

    5 mins: Why sending the right message to customers is crucial, and introduction to appropriate content creation

    15 mins: Covering different Text generation nuances, and a live demo with walk through of a toy code implementation