Beyond Individual Contribution: How to Lead Data Science Teams

Despite the increasing number of data scientists who are being asked to take on managerial and leadership roles as they grow in their careers, there are still few resources on how to manage data scientists and lead data science teams. There is also scant practical advice on how to serve as head of a data science practice: how to set a vision and craft a strategy for an organization to use data science.

In this talk, I will describe my experience as a data science leader both at a political party (the Democratic Party of the United States of America) and at a fintech startup (Even.com), share lessons learned from these experiences and conversations with other data science leaders, and offer a framework for how new data science leaders can better transition to both managing data scientists and heading a data science practice.

 
2 favorite thumb_down thumb_up 0 comments visibility_off  Remove from Watchlist visibility  Add to Watchlist
 

Outline/Structure of the Talk

  • Introduction to my background as a data science leader
    • Lead of the digital analytics team at Democratic Party of the United States
    • Head of Data Science at Even.com
  • Overview of general management philosophy as context for the remainder of the presentation
    • The Upside Down Organization (i.e., managers report to their team)
    • The Boundary Detector and Breaker (e.g., leaders push the rest of the company to think about and integrate data science)
  • Presentation of framework on how to lead data science teams
    • Crafting strategy and defining vision for the team (including identifying leverage opportunities for the data science practice in an organization with implications for hiring, growing, and deploying data scientists)
    • Creating a culture that balances scientific rigor with business pragmatism (including an emphasis of measurement of impact normalized by bandwidth spent)
    • Mentoring and growing data scientists from diverse backgrounds and with diverse talents (helping data scientists manage up and finding opportunities to help data scientists grow and learn from each other)
  • Closing with synthesis and suggestions on implementation
    • Wrap-up slides to share with external audiences on framework and how to use it

Learning Outcome

Participants will have learned a framework for how to manage data scientists and lead a data science practice as well as why and when this framework can be useful to them. Participants should leave the presentation better prepared to tackle new or existing roles as data science managers and leaders or better able to identify promising candidates for these roles.

Target Audience

data scientists looking to move up into management roles, data science managers, data science executives

schedule Submitted 1 week ago

Public Feedback

comment Suggest improvements to the Speaker

  • Liked Subhasish Misra
    keyboard_arrow_down

    Subhasish Misra - Causal data science: Answering the crucial ‘why’ in your analysis.

    45 Mins
    Talk
    Intermediate

    Causal questions are ubiquitous in data science. For e.g. questions such as, did changing a feature in a website lead to more traffic or if digital ad exposure led to incremental purchase are deeply rooted in causality.

    Randomized tests are considered to be the gold standard when it comes to getting to causal effects. However, experiments in many cases are unfeasible or unethical. In such cases one has to rely on observational (non-experimental) data to derive causal insights. The crucial difference between randomized experiments and observational data is that in the former, test subjects (e.g. customers) are randomly assigned a treatment (e.g. digital advertisement exposure). This helps curb the possibility that user response (e.g. clicking on a link in the ad and purchasing the product) across the two groups of treated and non-treated subjects is different owing to pre-existing differences in user characteristic (e.g. demographics, geo-location etc.). In essence, we can then attribute divergences observed post-treatment in key outcomes (e.g. purchase rate), as the causal impact of the treatment.

    This treatment assignment mechanism that makes causal attribution possible via randomization is absent though when using observational data. Thankfully, there are scientific (statistical and beyond) techniques available to ensure that we are able to circumvent this shortcoming and get to causal reads.

    The aim of this talk, will be to offer a practical overview of the above aspects of causal inference -which in turn as a discipline lies at the fascinating confluence of statistics, philosophy, computer science, psychology, economics, and medicine, among others. Topics include:

    • The fundamental tenets of causality and measuring causal effects.
    • Challenges involved in measuring causal effects in real world situations.
    • Distinguishing between randomized and observational approaches to measuring the same.
    • Provide an introduction to measuring causal effects using observational data using matching and its extension of propensity score based matching with a focus on the a) the intuition and statistics behind it b) Tips from the trenches, basis the speakers experience in these techniques and c) Practical limitations of such approaches
    • Walk through an example of how matching was applied to get to causal insights regarding effectiveness of a digital product for a major retailer.
    • Finally conclude with why understanding having a nuanced understanding of causality is all the more important in the big data era we are into.
  • 20 Mins
    Demonstration
    Advanced

    In this digital era when the attention span of customers is reducing drastically, for a marketer it is imperative to understand the following 4 aspects more popularly known as "The 4R's of Marketing" if they want to increase our ROI:

    - Right Person

    - Right Time

    - Right Content

    - Right Channel

    Only when we design and send our campaigns in such a way, that it reaches the right customers at the right time through the right channel telling them about stuffs they like or are interested in ... can we expect higher conversions with lower investment. This is a problem that most of the organizations need to solve for to stay relevant in this age of high market competition.

    Among all these we will put special focus on appropriate content generation based on targeted user base using Markov based models and do a quick hack session.

    The time breakup can be:

    5 mins : Difference between Martech and traditional marketing. The 4R's of marketing and why solving for them is crucial

    5 mins : What is Smart Segments and how to solve for it, with a short demo

    5 mins : How marketers use output from Smart Segments to execute targeted campaigns

    5 mins: What is STO, how it can be solved and what is the performance uplift seen by clients when they use it

    5 mins: What is Channel Optimization, how it can be solved and what is the performance uplift seen by clients when they use it

    5 mins: Why sending the right message to customers is crucial, and introduction to appropriate content creation

    15 mins: Covering different Text generation nuances, and a live demo with walk through of a toy code implementation

  • Liked Jitendra Rudravaram
    keyboard_arrow_down

    Jitendra Rudravaram / aswin narayanan - Bayesian Modeling with PYMC3

    20 Mins
    Talk
    Beginner

    Bayesian Modeling with PYMC3 to predict Dividends ; A classic small data problem.

  • Liked Pushker Ravindra
    keyboard_arrow_down

    Pushker Ravindra - Data Science Best Practices for R and Python

    20 Mins
    Talk
    Intermediate

    How many times did you feel that you were not able to understand someone else’s code or sometimes not even your own? It’s mostly because of bad/no documentation and not following the best practices. Here I will be demonstrating some of the best practices in Data Science, for R and Python, the two most important programming languages in the world for Data Science, which would help in building sustainable data products.

    - Integrated Development Environment (RStudio, PyCharm)

    - Coding best practices (Google’s R Style Guide and Hadley’s Style Guide, PEP 8)

    - Linter (lintR, Pylint)

    - Documentation – Code (Roxygen2, reStructuredText), README/Instruction Manual (RMarkdown, Jupyter Notebook)

    - Unit testing (testthat, unittest)

    - Packaging

    - Version control (Git)

    These best practices reduce technical debt in long term significantly, foster more collaboration and promote building of more sustainable data products in any organization.

  • Liked Siboli mukherjee
    keyboard_arrow_down

    Siboli mukherjee - Real time Anomaly Detection in Network KPI using Time Series

    20 Mins
    Experience Report
    Intermediate

    Abstract:

    How to accurately detect Key Performance Indicator (KPI) anomalies is a critical issue in cellular network management. In this talk I shall introduce CNR(Cellular Network Regression) a unified performance anomaly detection framework for KPI time-series data. CNR realizes simple statistical modelling and machine-learning-based regression for anomaly detection; in particular, it specifically takes into account seasonality and trend components as well as supports automated prediction model retraining based on prior detection results. I demonstrate here how CNR detects two types of anomalies of practical interest, namely sudden drops and correlation changes, based on a large-scale real-world KPI dataset collected from a metropolitan LTE network. I explore various prediction algorithms and feature selection strategies, and provide insights into how regression analysis can make automated and accurate KPI anomaly detection viable.

    Index Terms—anomaly detection, NPAR (Network Performance Analysis)

    1. INTRODUCTION

    The continuing advances of cellular network technologies make high-speed mobile Internet access a norm. However, cellular networks are large and complex by nature, and hence production cellular networks often suffer from performance degradations or failures due to various reasons, such as back- ground interference, power outages, malfunctions of network elements, and cable disconnections. It is thus critical for network administrators to detect and respond to performance anomalies of cellular networks in real time, so as to maintain network dependability and improve subscriber service quality. To pinpoint performance issues in cellular networks, a common practice adopted by network administrators is to monitor a diverse set of Key Performance Indicators (KPIs), which provide time-series data measurements that quantify specific performance aspects of network elements and resource usage. The main task of network administrators is to identify any KPI anomalies, which refer to unexpected patterns that occur at a single time instant or over a prolonged time period.

    Today’s network diagnosis still mostly relies on domain experts to manually configure anomaly detection rules such a practice is error-prone, labour intensive, and inflexible. Recent studies propose to use (supervised) machine learning for anomaly detection in cellular networks . ellular networks, a common practice adopted by network administrators is to monitor a diverse set of Key Performance Indicators (KPIs), which provide time-series data measurements that quantify specific performance aspects of network elements and resource usage. The main task of network administrators is to identify any KPI anomalies, which refer to unexpected patterns that occur at a single time instant or over a prolonged time period.

    Today’s network diagnosis still mostly relies on domain experts to manually configure anomaly detection rules such a practice is error-prone, labour intensive, and inflexible. Recent studies propose to use (supervised) machine learning for anomaly detection in cellular networks .