24 January 2017

Blog

Avoiding Accidental Contamination of Data [3 Examples]

You’re doing it wrong! It’s time to learn the right way to validate models. Something important to be aware of? Accidental contamination.

In the past, we’ve talked about the importance of calculating the predictive accuracy of a model by applying the model on data points it has not previously been trained on, and comparing the model’s predictions with the known true result for this test set.

If you repeat this multiple times for non-overlapping test cases, just like you do in case of the cross validation  you end up with a reliable estimation about how well your model will perform on new cases.

This approach in general is a good basis for doing model selection, i.e., answering the question which type of model (think: “random forest or linear regression?”) is going to perform best on my data in the future.

So all is good then, right? Wrong! The problem is that it is still very easy to leak information about the testing data into the training data if you perform a cross validation in the wrong way.

We call this phenomenon contamination of the training data. Contamination provides access to information the machine learning method should not have access to during training. 

With contamination, the model will perform better than you expect it to perform when compared to situations when this information is not available. This is exactly why accidental contamination leads to an over-optimistic estimation about how well the model will perform in the future.

This effect can be as drastic as the example of the k-Nearest Neighbor classifier in our previous post, where the training error (all information is available in this case) was 0% while the testing error was 50% and hence no better than randomly guessing the class of the data points.

As we will see below, the same effect happens if you leak information about the test data into the training phase.  The test error becomes effectively a training error then and is lower than you would expect without the contamination. Therefore, all the efforts you did by using a cross validation to avoid leakage of information is pointless if you are doing it wrong.

And unfortunately, most data science platforms do not even support the correct way of performing a cross validation! Shocking, isn’t it? This is the main reason why so many data scientists make this mistake.

3 examples of contamination

Here are a few typical situations where accidental contamination of training data by leaking information about the test data can happen.

This is by no means a complete list of examples, but it should be enough to give you the general idea about the problem and what to look for if you want to avoid this.

1. Contamination through normalization

Let’s start with a very common situation: You would like to normalize data so that all columns have a similar range and no column overshadows others.

This is particularly important before using any similarity-based models, e.g., k-Nearest Neighbors.

This seems to be a simple task: just normalize the data and then train the model and validate it with a cross validation. This is how this would look like in RapidMiner:

Figure 1: This RapidMiner Studio process simply performs a normalization (z-transformation) on the data before the model is validated.

Looks good, you say? Wrong!  Because this approach shown will lead to a wrong error estimation.

If you normalize the data before the cross validation, you actually leak information about the distribution of the test data into the way you normalize the training data.

Although you are not using the test data for training the model, you nevertheless instill some information about it into the training data. This is exactly why we call this effect contamination.

What you should do instead is perform the normalization inside of the cross validation and only on the training data. You then take the information you gathered about the training data and use it for the transformation of the test data. This is where the RapidMiner Studio visual interface comes in handy since you can easily see the correct setup:

Figure 2: This is how you properly evaluate the impact of the normalization on the model accuracy. The parameters derived from the normalization on the training data are delivered into the test phase and used there instead of the other way around.

Please note that we are feeding data directly into the cross validation operator and performing all necessary steps inside. The first thing you do in the training phase on the left is to normalize only the training data.

The transformed data is then delivered to the machine learning method (Logistic Regression in this example). Both the predictive model and the transformation parameters from the normalization are then delivered into the testing phase on the right.

Here we first transform the test data based on the distributions found in the training data before we apply the predictive model on the transformed test data. Finally, we calculate the error as usual.

This table summarizes the cross validated test errors, once with the normalization performed before the cross validation like in Figure 1 and the second with the normalization inside of the cross validation like in Figure 2.

It clearly shows the effect on the error if you validate the model without measuring the impact of the data pre-processing itself:

Table 1: The cross validated test errors with a normalization before the cross validation (“Wrong”) or inside. You can clearly see that the contamination of training data in the “Wrong” cases leads to overly optimistic error estimations.

If the normalization is done before the cross validation, the calculated error is too optimistic in all cases and the data scientist would run into a negative surprise when going into production.

On average, the actual error is 3.9% higher for Logistic Regression and 3.4% higher for a k-Nearest Neighbors with k=5.  The difference is 0% for Random Forest simply because the models do not change at all if the data is normalized or not.

In general, you can see that the effect is not as drastic as just using the training error, but still there are differences of more than 8% in some of the cases, higher than most people would expect. This is caused by only validating the model itself and not the impact of the data pre-processing.

A final thing which is important to notice: the model will actually perform roughly the same way in production, no matter if you perform the normalization outside or inside the cross validation.

When building a production model, you would typically use the complete dataset anyway and hence also apply the normalization on the complete data.

But the performance of the model will be the lower one, pretty much in the range of the one shown in the column “Nested” in the table above.

So the correct validation is not helping you create better models, instead it is telling you the truth about how well (or poorly) the model will work in production without letting you run into a catastrophic failure later on.

2. Contamination through parameter optimization

Next let’s look at the impact of optimizing parameters of the model, e.g., in case of Random Forest the number of trees in the ensemble or in the case of k-Nearest Neighbors the parameter k determining how many neighbors are used for finding the prediction.

Data scientists often search for optimal parameters to improve a model’s performance. They can do this either manually by changing parameter values and measuring the change of the test error by cross validating the model using the specified parameters.

Or they can use automatic approaches like grid searches or evolutionary algorithms. The goal is to always find the best performing model for the data set at hand.

As above in the case of normalization, using the cross validated test errors found during this optimization sounds like a good plan to most data scientists – but unfortunately, it’s not.

This should become immediately clear if you think of a (automatic) parameter optimization method just as another machine learning method which tries to adapt a prediction function to your data.

If you only validate errors inside of this “bigger” machine learning method and deliver those errors to the outside, you effectively turn the inner test error into a kind of overly optimistic training error.

In case of parameter optimization, this is a well known phenomenon in machine learning. If you select parameters so that the error on a given test set is minimized, you are optimizing this parameter setting for this particular test set only – hence this error is no longer representative for your model.

Many data scientists suggest to use a so-called validation set in addition to a test set. So you can first use your test set for optimizing your parameter settings, and then you can use your validation set to validate the error of the parameter optimization plus the model on this set.

The problem with this validation set approach is that it comes with the same kind of single-test-set-problems we discussed before. You won’t get all the advantages of a proper cross validation if you use validation sets only.

But you can do this correctly by nesting multiple levels of cross validation into each other. The inner cross validation is used to determine the error for a specific parameter setting, and can be used inside of an automatic parameter optimization method.

This parameter optimization becomes your new machine learning algorithm. Now you can simply wrap another, outer cross validation around this automatic parameter optimization method to properly calculate the error which can be expected in production scenarios.

This sounds complex in writing but again a visual representation can help us to understand this correct setup. The two figures below show the outer cross validation with the parameter optimization inside:

Figure 3: Properly validating a parameter optimization requires two cross validations, an inner one to guide the search for optimal parameters and an outer one (shown here) to validate those parameters on an independent validation set.

Figure 4: Inside of the outer cross validation. In the training phase on the left is an automated search for an optimal parameter set which uses another, inner cross validation (not shown). Then the optimal parameters are used for evaluating the final model.

Since the need for independent validation set is somewhat known in machine learning, many people believe that the impact of correct validation on the errors must be quite high.

But it turns out that while there definitely is an impact on the test errors, it is smaller than the impact on the correct test error in the normalization case above. The following table shows the results:

Table 2: We used an automatic grid search for parameter optimization. The column “Wrong” shows the naive approach of showing the cross validated error which was delivered from the validation guiding the search. “Nested” shows the correct estimations.

As expected, the properly validated errors using the “Nested” approach are higher than the errors taken directly out of the validation used for guiding the search (the “Wrong” column).

But the differences are rather small with roughly 0.5% on average and a maximum deviation of 1%. Still, even a difference of 1% between your expectation and how well the model performs in production can be a huge and costly surprise.

3. Contamination through feature selection

As a last example, let’s look at another pre-processing step which is frequently performed for optimizing the accuracy of machine learning models, namely feature selection.

The goal is to select an optimal subset of features or data columns used by the model. In theory, many models can make this selection themselves, but in reality noisy columns can throw models off and it’s better to keep only those columns which provide meaningful information.

Similar to the parameter optimization, data scientists could manually select a subset of features and evaluate the model’s accuracy by calculating a cross validated test error using only the data of the reduced feature set.

Since the number of combinations grows exponentially with the number of columns in the input data, this isn’t often a feasible approach; the reason why automated optimization techniques are widely used.

But just as in the case of parameter optimization, picking the optimal feature set is basically an extension of training the predictive model, and it needs to be properly validated as well.

Of course, you could use an extra hold-out set for evaluating any built model at the end. But as before, you end up with the disadvantage of only using a single validation set.

The better alternative is to nest the automatic feature selection (using an inner cross validation to guide the search for the optimal set of features) into an outer cross validation again, just as we did in case of the parameter optimization.

The correct setup starts with an outer cross validation just as described above where we used an outer cross validation to evaluate the performance of the parameter optimization. The inner setup of this cross validation looks slightly different though, as follows:

Figure 5: An automatic evolutionary feature selection is used on the training data of the outer cross validation. The optimal feature set is selected before the model is trained and delivered to the test phase where it is applied again before prediction.

And here are the results for the wrong version of the cross validation where you don’t properly validate the effect of the feature selection and only use the performance value of the optimal feature set as found during the optimization runs. The value in the columns “Nested” are determined with the correct approach using an outer cross validation as depicted above:

Table 3: We used an automatic evolutionary feature selection. The column “Wrong” shows the naive approach of showing the cross validated error which was delivered from the validation guiding the search while “Nested” delivers the proper estimations.

Here you see that the amount by which your expectations would be off is on average up to 4.2%. The largest deviation was even as high as 11% (5-NN on Sonar) – which is even higher than the maximum of 8% we saw in the normalization case above.

Think about this: your error in production would be double (22% vs. 11%) what you expected; just because you did not the right kind of validation which takes into account the effect of the data pre-processing. You can very easily avoid this type of negative surprise!

It is worth pointing out that the calculated average and maximum deviations greatly depend on the data sets and model types you use. We have seen above that the effect of the wrong validation for normalization was zero for Random Forest since this model type does not care about the different scales of the data columns.

On the other hand, the effect was larger for feature selection using a Random Forest. And this is the problem: you never know in advance how large the effect will really be.

Key takeaways

Remember, the only way to be prepared about what is coming in the future is to properly validate the effects of the model itself as well as all other optimizations with a correct form of modular, nested cross validations.

Here are some key takeaways that I hope you consider:

Want to follow along with other examples? Load this file of data and processes into your RapidMiner repository: Data & processes.zip for “Learn the Right Way to Validate Models”. If you need direction on how to add files to your repository, this post will help: How to Share RapidMiner Repositories.

Download RapidMiner Studio, which offers all of the capabilities to support the full data science lifecycle for the enterprise.

Related Resources