Privacy-preserving entity resolution and logistic regression on encrypted data
We consider a scenario of two data providers, A and B, each of whom manages a dataset of private information consisting of two different feature sets related to common customers/entities. They jointly aim to learn a linear model using stochastic gradient algorithms like SGD/SAG. The setting is federated learning, where data is kept locally and a shared model is learned on top of local computation. In contrast with the large majority of work on distributed learning, in our scenario data is split vertically, i.e. by features. We also assume that only A knows the target variable.
We have developed a secure system that solves this problem in two phases: privacy-preserving entity resolution and logistic regression over encrypted data. With the aid of a coordinator, C, we design a three-party protocol that is secure under the honest-but-curious adversary model. Our system allows A and B to learn a classifier collaboratively, without either exposing their data in the clear or even sharing which entities they have in common. Privacy-preserving entity resolution is achieved through the use of Bloom filters in the form of Cryptographic Longterm Keys, and computation on encrypted values is based on the Paillier partially homomorphic encryption system.
Outline/structure of the Session
The talk will begin with a motivating example to illustrate the particular restrictions and complexities of performing machine learning in a privacy-preserving context. It will then proceed to introduce and explain the technology that was used to create a system that resolves the problems in the case of logistic regression via stochastic gradient descent. The talk will then present benchmark results from real deployments demonstrating the viability of the solution in practice. The talk will conclude with a brief overview of improvements-in-progress as well as ideas for future work.
This talk introduces the audience to the difficulties of applying machine learning techniques in situations where the data may be subject to access and/or movement restrictions on account of its sensitive nature. The talk will explain the details of a practical solution to the specific problem of performing logistic regression in this context.
Data analysts, data scientists, chief data officers, information security compliance officers, and anyone concerned with data privacy