After performing all the tasks of machine learning, you are yet to perform one of the interesting tasks which is to analyze your models and evaluate their performance. In order to do this, you divide the dataset into two parts where the first part, which often includes the majority of samples, is used to train the machine learning models while the remaining samples are used to test how well they are performing on the test data.
After performing the training ML models to generate good results on various metrics, you decide to deploy the machine learning model that performs the best in the test set. However, it should be noted that there is an important mechanism that should be understood before deploying the model in real time. While the ML model performed quite impressively in the case of the test data, deploying it in real time can sometimes be detrimental to the value that this algorithm creates if the phenomenon of data leakage is not checked before deployment.

What is Data Leakage

Data leakage refers to a mistake in which they accidentally share information between the test and training datasets. Typically, when splitting a dataset into testing and training sets, the goal is to ensure that no data is shared between the two. This is because the purpose of test set is to simulate real-world data. However, when evaluating a model, we do have full access to both our train and test sets, so it is up to us to ensure that no data in the training set is present in the test set.
Data leakage often results in unrealistic high levels of performance on the test set, because the model is being ran on data that it had already seen in the training set. The model effectively memorizes the training dataset, and is easily able to correctly output the labels for those test dataset examples. Clearly, this is not ideal, as it misleads the person evaluating the model. When such a model is then used on truly unseen data, performance will be much lower than expected.
Data leakage is a serious problem for 3 reasons:

  1. You may be creating overly optimistic models that are practically useless and cannot be used in production.
  2. It is a problem if you are running a machine learning competition. Top models will use the leaky data rather than be good general model of the underlying problem.
  3. It is a problem when you are a company providing your data. Reversing an anonymization can result in a privacy breach that you did not expect.

In machine learning, data leakage (also known as target leakage) is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores to overestimate the model’s utility when run in a production environment.


source: Leakage (machine learning) – Wikipedia

Data Leakage during Data Preprocessing

While solving a Machine learning problem , firstly we do the data cleansing and preprocessing which involves the following steps:

  • Evaluating the parameters for scaling features.
  • Finding the minimum and maximum values of a particular feature.
  • Handling the outliers.

Fill or remove the missing values in our dataset.
The above steps should be done using only the training set. If we use the entire dataset to perform these operations, data leakage may occur. Applying preprocessing techniques to the entire dataset will cause the model to learn not only the training set but also the test set. As we all know that the test set should be new and previously unseen for any model.

How to detect Data Leakage?

Let’s consider the following cases for detecting data leakage

  • In general, if we see that the model which we build is so good more than expected (i.,e gives predicted and actual output the same), then we should get suspicious. At that time, the model might be somehow memorizing the relations between feature and target instead of learning it for the unseen data. So, it is advised that before the testing, the prior documented results are weighed against the expected results.
  • While doing the Exploratory Data Analysis, we may detect features that are very highly correlated with the target variable. Of course, some features are more correlated than the others but a surprisingly high correlation needs to be checked and handled carefully.
  • After the completion of the model training, if features are having very high weights, then we should pay close attention. Those features might be leaky.

Techniques to overcome Data Leakage

Perform data preparation within k-fold cross validation

One of the best ways to get rid of data leakage is to perform k-fold cross validation where the overall data is divided into k parts. After dividing into k parts, we use each part as the cross-validation data and the remaining as training data. After measuring the performance for each set of k parts, we take the average to present the overall performance of the model. Note, if you scale your entire dataset, then estimate the performance of your model using cross validation, you have committed the sin of data leakage.

Dropping the duplicates

When we have duplicate rows, there is a possibility that one of the duplicate rows is in the training set while the other is in the test set. In this case since the model was already trained on the training set with this row and it is also present in the test set, we get an increased performance in the test set which is actually not true. Therefore, it is a good idea to check if the dataset contains any duplicate values.

Performing temporal splitting

In the case of time series forecasting, when performing methods such as random splitting, we are randomly permuting the rows so that we are then able to divide the data into training and test set. this method should not be performed due to the temporal dependency of the target variable to the present input value and the previous time steps as well. If we randomly split the dataset, there is already a presence of future information for the ML model to make the best predictions. As a result, we are going to get a very good performance for the metric under consideration. Therefore, care must be taken such that we are splitting the data temporally rather than randomly to avoid data leakage.

Using validation dataset

Another approach is to split your training dataset into train and validation sets, and store away the validation dataset. When you have completed your model creation, evaluate it on the validation dataset.
This can give you a sanity check to see if your estimation of performance has been overly optimistic and has leaked.


Summary
We’ve seen situations where the presence of leaky data could lead to us believing that the models are performing quite well on the test set which is far from true. We have also looked into ways at which we can reduce data leakage. Thanks for taking the time to read this post.

Recommended for you:
Machine Learning Optimization Techniques
Overfitting and Underfitting
Cross Validation in Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *