Respecting privacy with synthetically generated "look-alike" data sets

location_city Sydney schedule May 15th 09:50 - 10:20 AM place Wesley Theatre people 188 Interested

Safely handling data that contains sensitive or private information about people is a multi-million dollar problem at many companies. It adds time into the data engineering process, it can cost a lot in software licenses for specialised tools, and brings a range of reputational and legal risks.

Recent advances in deep learning have prompted an interesting way to attack this problem. By fitting a certain class of model on a source data set that contains sensitive information, we can produce a generator that outputs a supply of synthetic "look alike" data. This output data will preserve many of the statistical relationships between fields as the source does, and offers mathematical guarantees around the identifiability of individuals in the source data set.

This talk will provide an overview of the approach and show how it can speed data engineering effort and reduce risk.


Outline/Structure of the Talk

4 minutes: frame problem and who suffers from it
3 minutes: Overview solution traditional solutions and where they fail (light and humorous)
3 minutes: Description of specific solution suggested here (pair of deep neural nets trained as part of a GAN) and why it solves traditional problems
8 minutes: walk-through of a worked example
2 minutes: implementation tips and unanswered questions ("future work")
4 minutes: summary
6 minutes: Q&A

Learning Outcome

Knowledge of the existence of a new and helpful method for respecting customer privacy, intuition around implementation, where to look (papers, other talks) for implementation details

Target Audience

Data engineers, managers (analytics, IT), machine learning practitioners

Prerequisites for Attendees

Familiarity with the problems of ETL, basic concepts of machine learning



schedule Submitted 3 years ago