Privacy-Law Aware ML Data Preparation

The new PDP (Personal Data Protection) Law, which is similar to GDPR
and CCPA, is being implemented in India. All enterprise data services
including analytics and data science within the scope of the law are
required to comply with the same. Almost all major geographies have now
passed similar laws. The expectation of responsible data handling from
organizations is also increasing.

Enrich, our product, is a high-trust data preparation platform for
enterprises that provides data input to analysts and models at scale
everyday. Such data preparation services are on organizations’
compliance and privacy-activity critical path because of their
‘fan-out’ nature. They provide a convenient location to enforce policy
and safety mechanisms.

In this talk we discuss some of the mechanisms that we are building
for clients in our data preparation platform, Enrich. They include
opensource compliance checklist to help with the process, ‘right to
forget’ service using anonymized lookup key service, and metadata
service to enable tracking of the datasets. The focus will be on the
generic capabilities, and not on Scribble or our product.

Note: Will update this over the next few days and weeks

 
 

Outline/Structure of the Talk

  • 1. PDP and Impact (4 mins)
    • Provisions with Architectural Significance
  • 2. Scribble as Data Processor (2 mins)
  • 3. PDP Awareness (10 mins)
    • Opensource Compliance checklist
    • Data Inventory & Classification
    • Data Quality Monitoring
    • Consent manager & Data sanitization
  • 4. Open Challenges (1-2 mins)
    • Extending to enterprise beyond ML data prep

Learning Outcome

Audience will learn:

1. What it will take to support PDP Law

2. Machine-readable compliance checklist

3. Example implementation from Scribble

Target Audience

Data Scientists, Data product Managers, ML Engineers

Prerequisites for Attendees

1. Familiarity with production data science and feature engineering

schedule Submitted 8 months ago

Public Feedback

comment Suggest improvements to the Author
  • Ravi Balasubramanian
    By Ravi Balasubramanian  ~  7 months ago
    reply Reply

    Hi Venkata,

    Thanks for the proposal. Can you please expand your talk to address few of the below details: 

    • Majority of governments have implemented some form of GDPR kind of in-country data movement laws - The talk in its current form is highlighting the work your company is undertaking in this area.
    • Can you please highlight and provide more details on the technical ML aspects you are doing with some concrete examples?.
    •  Can you please expand the challenges on data preparation to satisfy these laws through a concrete example or use case for the broader audience to understand and appreciate.
    • Venkata Pingali
      By Venkata Pingali  ~  7 months ago
      reply Reply
      Hi! Ravi, Did my note address your thoughts/questions? Happy to take suggestions/further questions.
      • Ravi Balasubramanian
        By Ravi Balasubramanian  ~  7 months ago
        reply Reply

        Yes, Thanks.

        • Venkata Pingali
          By Venkata Pingali  ~  7 months ago
          reply Reply
          Great! Looking forward to community interactions around this and related topics. Had a good experience at ODSC 2019.
    • Venkata Pingali
      By Venkata Pingali  ~  7 months ago
      reply Reply

      Hi! Ravi,

      Thanks for your interest. Few thoughts inline:

      > Majority of governments have implemented some form of GDPR kind of in-country data movement laws - The talk in its current form is highlighting the work your company is undertaking in this area.

      > Can you please highlight and provide more details on the technical ML aspects you are doing with some concrete examples?.

      I have spoken at length at ODSC 2019 about architecture of our feature engineering platform. The video linked to this proposal is from that talk.

      The focus of this talk is on compliance with PDP law. BTW, it covers cross-country movement of data as well. Because we process substantial data for our customers and we had to look at it and evolve our thinking and systems.

      The talk is about sharing what significant problems we had to look at and  our approaches for the same. Where reusable opensource is available, highlighted the same.

      Given the short time (20 mins, ~12 slides), had to keep everything crisp and to the point.

      >  Can you please expand the challenges on data preparation to satisfy these laws through a concrete example or use case for the broader audience to understand and appreciate.

      There are 60 or so elements (contained in the checklist). We picked four that had the highest impact (e.g., right to forget), and highlighted them on Slide 6. The rest of slides were about addressing these four elements. I was going to give (anonymized) examples from our deployments like I did before.

      The objectives are NOT
      (a) extensive tutorial on PDP
      (b) Product demo

      The objectives are
      (a) Grasp the implications of PDP
      (b) Learn from our experience

      Let me know if this note addresses your concerns. If not happy to take suggestions to make the talk more interesting to the audience.

  • Madalasa Venkataraman
    By Madalasa Venkataraman  ~  7 months ago
    reply Reply

    Thanks for the proposal, Venkata!

    Could you share the slideware for the current presentation with a time breakup?

    • Venkata Pingali
      By Venkata Pingali  ~  7 months ago
      reply Reply

      Please check now. It is not quite done but is beginning to take shape. 

    • Venkata Pingali
      By Venkata Pingali  ~  7 months ago
      reply Reply
      Working on them. When do you need them by to meet your process requirements? 
  • Natasha Rodrigues
    By Natasha Rodrigues  ~  7 months ago
    reply Reply

    Hi Venkata,


    Thanks for your proposal! Requesting you to update the Outline/Structure section of your proposal with a time-wise breakup of how you plan to use 20 mins for the topics you've highlighted?

    To help the program committee understand your proposal a little better, can you add the slides related to this topic.

    Thanks, Natasha

    • Venkata Pingali
      By Venkata Pingali  ~  7 months ago
      reply Reply
      Hi! Natasha, I was able to select 20 mins only. Is it possible to select a 40 min slot? Based on your suggestion, will update the flow timings.
      • Natasha Rodrigues
        By Natasha Rodrigues  ~  7 months ago
        reply Reply

        Hi Venkata,

        Currently we are requesting for only 20-min sessions.

        Thanks,

        Natasha 

         

        • Venkata Pingali
          By Venkata Pingali  ~  7 months ago
          reply Reply
          Got it. Will update the outline with timing.