Covariate Shift - Challenges and Good Practice
What is covariate shift?
A fundamental assumption in supervised machine learning is that both the training and query data are drawn from the same population/distribution. However, in real-life applications this is very often not the case as the query data distribution is unknown and cannot be guaranteed a-priori. Selection bias in collecting training samples will change the distribution of the training data from that of the overall population. This problem is known as covariate shift in the machine learning literature, and using a machine learning algorithm in this situation can result in spurious and often over-confident predictions.
How do we detect covariate shift?
Covariate shift is only detectable when we have access to query data. Visualization of training and query data would be helpful to gain an initial impression. Machine learning models can be used to detect covariate shift. For example, Gaussian Process could model the similarity between each query point from feature space of training data. One-class SVMs could detect outliers of training data. Both strategies detect query points that live in a different domain of the feature space from the training dataset.
Strategies for handling covariate shift
We suggest two strategies to mitigate covariate shift: re-weighting training data, and active learning with probabilistic models.
First, re-weighting the training data is the process of matching distribution statistics between the training and query sets in feature space. When the model is trained (and validated) on re-weighted data, it is expected to generalise better to query data. However, significant overlap between training and query datasets is required.
Secondly, there may be a situation where we can acquire the labels of a small portion of the query set, potentially at great expense, to reduce the effects of covariate shift. Probabilistic models are required in this case because they indicate the uncertainty in their prediction. Active learning enables us to optimally select small subsets of query points that aim to maximally shrink the uncertainty in our overall prediction.
Outline/structure of the Session
- Outline the issue of covariate shift and how it can lead to pathological prediction problems
- How to detect covariate shift when we have query data on hand
- Detection of covariate shift in live production
- Regression/classification methods that are robust to covariate shift
- Two strategies for handling covariate shift
- First strategy: Re-weighting training data (for training and validation)
- Second strategy: Active learning with probabilistic models
- Gain an understanding of covariate shift and its effects in production
- Learn methods for detecting covariate shift in a number of situations you are likely to encounter in production
- Learn how to effectively deal with covariate shift
Data scientists and machine learners