Respecting privacy with synthetically generated "look-alike" data sets
Safely handling data that contains sensitive or private information about people is a multi-million dollar problem at many companies. It adds time into the data engineering process, it can cost a lot in software licenses for specialised tools, and brings a range of reputational and legal risks.
Recent advances in deep learning have prompted an interesting way to attack this problem. By fitting a certain class of model on a source data set that contains sensitive information, we can produce a generator that outputs a supply of synthetic "look alike" data. This output data will preserve many of the statistical relationships between fields as the source does, and offers mathematical guarantees around the identifiability of individuals in the source data set.
This talk will provide an overview of the approach and show how it can speed data engineering effort and reduce risk.
Outline/Structure of the Talk
4 minutes: frame problem and who suffers from it
3 minutes: Overview solution traditional solutions and where they fail (light and humorous)
3 minutes: Description of specific solution suggested here (pair of deep neural nets trained as part of a GAN) and why it solves traditional problems
8 minutes: walk-through of a worked example
2 minutes: implementation tips and unanswered questions ("future work")
4 minutes: summary
6 minutes: Q&A
Knowledge of the existence of a new and helpful method for respecting customer privacy, intuition around implementation, where to look (papers, other talks) for implementation details
Data engineers, managers (analytics, IT), machine learning practitioners
Prerequisites for Attendees
Familiarity with the problems of ETL, basic concepts of machine learning