Data Science Project Governance Framework

Data Science Project Governance Framework is a framework that can be followed by any new Data Science business or team. It will help in formulating strategies around how to leverage Data Science as a business, how to architect Data Science based solutions and team formation strategy, ROI calculation approaches, typical Data Science project lifecycle components, commonly available Deep Learning toolsets and frameworks and best practices used by Data Scientists. I will use an actual use case while covering each of these aspects of building the team and refer to examples from my own experiences of setting up Data Science teams in a corporate/MNC setup.

A lot of research is happening all around the world in various domains to leverage Deep Learning, Machine Learning and Data Science based solutions to solve problems that would otherwise be impossible to solve using simple rule based systems. All the major players in the market and businesses are also getting started and setting up new Data Science teams to take advantages of modern State-of-the-Art ML/DL techniques. Even though most of the Data Scientists are great at knowledge of mathematical modeling techniques, they lack the business acumen and management knowledge to drive Data Science based solutions in a corporate/MNC setup. On the other hand, management executives in most of the corporates/MNCs do not have first hand knowledge of setting up new Data Science team and approach to solving business problems using Data Science. This session will help bridge the above mentioned gap and help Executives and Data Scientists provide a common ground around which they can easily build any Data Science business/team from ground zero.

GitHub Link -> https://github.com/indranildchandra/DataScience-Project-Governance-Framework

 
 

Outline/Structure of the Talk

  • How to launch a Data Science based solution as a business
  • How to define ROI (Return-on-Investment) for your Data Science based solution
  • ROI calculation worksheets and samples
  • Plan for building a Data Science based solution
  • How to form a team building a Data Science based solution
  • Lifecycle phases of a typical Supervised Machine Learning project and their interdependence
  • Dataflow management
  • Risk evaluation & Regulatory compliance strategy
  • Comparative review of Deep Learning toolsets and frameworks
  • Best practices practiced by Data Scientists

Learning Outcome

You will learn how to setup a team/business for building a Data Science based solution. You will also get a detailed understanding of how to measure the business impact of your Data Science based solution beforehand for better transparency, using ROI calculation techniques.

Target Audience

Executives, Data Scientists, Data Engineers, Data Specialists, Machine Learning Engineers, Data Science Enthusiasts setting up or looking to setup a new Data Science business or team.

Prerequisites for Attendees

Principles of Data Science, Machine Learning and Programming.

schedule Submitted 1 year ago

  • Liked Kabir Rustogi
    keyboard_arrow_down

    Kabir Rustogi - Generation of Locality Polygons using Open Source Road Network Data and Non-Linear Multi-classification Techniques

    Kabir Rustogi
    Kabir Rustogi
    Head - Data Sciences
    Delhivery
    schedule 1 year ago
    Sold Out!
    45 Mins
    Case Study
    Intermediate

    One of the principal problems in the developing world is the poor localization of its addresses. This inhibits discoverability of local trade, reduces availability of amenities such as creation of bank accounts and delivery of goods and services (e.g., e-commerce) and delays emergency services such as fire brigades and ambulances. In general, people in the developing World identify an address based on neighbourhood/locality names and points of interest (POIs), which are neither standardized nor any official records exist that can help in locating them systematically. In this paper, we describe an approach to build accurate geographical boundaries (polygons) for such localities.

    As training data, we are provided with two pieces of information for millions of address records: (i) a geocode, which is captured by a human for the given address, (ii) set of localities present in that address. The latter is determined either by manual tagging or by using an algorithm which is able to take a raw address string as input and output meaningful locality information present in that address. For example, for the address, “A-161 Raheja Atlantis Sector 31 Gurgaon 122002”, its geocode is given as (28.452800, 77.045903), and the set of localities present in that address is given as (Raheja Atlantis, Sector 31, Gurgaon, Pin-code 122002). Development of this algorithm are part of any other project we are working on; details about the same can be found here.

    Many industries, such as the food-delivery industry, courier-delivery industry, KYC (know-your-customer) data-collection industry, are likely to have huge amounts of such data. Such crowdsourced data usually contain large a amount of noise, acquired either due to machine/human error in capturing the geocode, or due to error in identifying the correct set of localities from a poorly written address. For example, for the address, “Plot 1000, Sector 31 opposite Sector 40 road, Gurgaon 122002”, a machine may output the set of localities present in this address as (Sector 31, Sector 40, Gurgaon, Pin-code 122002), even though it is clear that the address does not lie in Sector 40.

    The solution described in this paper is expected to consume the provided data and output polygons for each of the localities identified in the address data. We assume that the localities for which we must build polygons are non-overlapping, e.g., this assumption is true for pin-codes. The problem is solved in two phases.

    In the first phase, we separate the noisy points from the points that lie within a locality. This is done by formulating the problem as a non-linear multi-classification problem. The latitudes and longitudes of all non-overlapping localities act as features, and their corresponding locality name acts as a label, in the training data. The classifier is expected to partition the 2D space containing the latitudes and longitudes of the union of all non-overlapping localities into disjoint regions corresponding to each locality. These partitions are defined as non-linear boundaries, which are obtained by optimizing for two objectives: (i) the area enclosed by the boundaries should maximize the number of points of the corresponding locality and minimize the number of points of other localities, (ii) the separation boundary should be smooth. We compare two algorithms, decision trees and neural networks for creating such partitions.

    In the second phase, we extract all the points that satisfy the partition constraints, i.e., lie within the boundary of a locality L, as candidate points, for generating the polygon for locality L. The resulting polygon must contain all candidate points and should have the minimum possible area while maintaining the smoothness of the polygon boundary. This objective can be achieved by algorithms such as concave hull. However, since localities are always bounded by roads, we have further enhanced our locality polygons by leveraging open source data of road networks. To achieve this, we solve a non-linear optimisation problem which decides the set of roads to be selected, so that the enclosed area is minimized, while ensuring that all the candidate points lie within the enclosed area. The output of this optimisation problem is a set of roads, which represents the boundary of a locality L.

  • Liked Favio Vázquez
    keyboard_arrow_down

    Favio Vázquez - Complete Data Science Workflows with Open Source Tools

    90 Mins
    Tutorial
    Beginner

    Cleaning, preparing , transforming, exploring data and modeling it's what we hear all the time about data science, and these steps maybe the most important ones. But that's not the only thing about data science, in this talk you will learn how the combination of Apache Spark, Optimus, the Python ecosystem and Data Operations can form a whole framework for data science that will allow you and your company to go further, and beyond common sense and intuition to solve complex business problems.

  • Liked Karthik Bharadwaj T
    keyboard_arrow_down

    Karthik Bharadwaj T - Failure Detection using Driver Behaviour from Telematics

    45 Mins
    Case Study
    Beginner

    Telematics data have a potential to unlock revenue of 1.5 trillion. Unfortunately this data has not been tapped by many users.

    In this case study Karthik Thirumalai would discuss how we can use telematics data to identify driver behaviour and do preventive maintenance in automobile.

  • Liked Karthik Bharadwaj T
    keyboard_arrow_down

    Karthik Bharadwaj T - 7 Habits to Ethical AI

    45 Mins
    Talk
    Beginner

    While AI is been put to use in solving great problems of the world, it is subjected to questions the morality of how it is constructed, used and put into use. Karthik Thirumalai addresses the 7 habits of building ethical AI solutions and how it could be put to use for a better world. These habits Data Governance, Fairness, Privacy and Security, Accountability, Transparency, Education help organizations to successfully implement their AI strategy which reflects fundamental human principles and moral values.

  • Liked SUDIPTO PAL
    keyboard_arrow_down

    SUDIPTO PAL - Use cases of Financial Data Science Techniques in retail

    SUDIPTO PAL
    SUDIPTO PAL
    STAFF DATA SCIENTIST
    Walmart Labs
    schedule 1 year ago
    Sold Out!
    20 Mins
    Talk
    Intermediate

    Financial domains like Insurance and Banking have uncertainty itself as an inherent product feature, and hence makes extensive use of Statistical models to develop, valuate and price their products. This presentation will showcase some of the techniques like Survival models and cashflow prediction models, popularly used in financial products, how can they be used in Retail data science, by showcasing analogies and similarities.

    Survival models were traditionally used for modeling mortality, then got extended to be used for modeling queues, waiting time and attrition. We showcase, 1) How the waiting time aspect can be used to model repeat purchase behaviors of customers, and utilize the same for product recommendation on particular time intervals. 2) How the same survival or waiting time problem can be solved using discrete time binary response survival models (as opposed to traditional proportional hazard and AFT models for survival). 3) Quick coverage of other use cases like attrition, CLTV (customer lifetime value) and inventory management.

    We show a use case where survival models can be used to predict the timing of events (e.g. attrition/renewal, purchase, purchase order for procurement), and use that to predict the timing of cashflows associated with events (e.g. subscription fee received from renewals, procurement cost etc.), which are typically used for capital allocation.

    We also show how the backdated predicted cashflows can be used as baseline to make causal inference about strategic intervention (e.g. campaign launch for containing attritions) by comparing with actual cashflows post-intervention. This can be used to retrospectively evaluate the impact of strategic interventions.

  • Liked Dr Hari Krishna Maram
    keyboard_arrow_down

    Dr Hari Krishna Maram - Future of Technology

    Dr Hari Krishna Maram
    Dr Hari Krishna Maram
    Chairman
    Vision Digital
    schedule 1 year ago
    Sold Out!
    20 Mins
    Talk
    Executive

    Future of Technology covered trends in technology across the globe and innovation changing the future

  • Liked AbdulMajedRaja
    keyboard_arrow_down

    AbdulMajedRaja - What happens out there? In the Real-World, With R

    AbdulMajedRaja
    AbdulMajedRaja
    Analyst (IC)
    Cisco Systems
    schedule 1 year ago
    Sold Out!
    45 Mins
    Talk
    Beginner

    This talk contains two sections predominantly - 1st explaining what’s all (non-obvious) that are possible with R and 2nd, How well-known organizations are using R in their company. R is one of the most popular programming languages preferred in Data Science / Analytics.