## You’re Doing It Wrong! Learn the Right Way to Validate Models

Part 4: *Accidental Contamination*

Be sure to check out Part 1, Part 2 & Part 3 of this series.

You learned in the last blog post to calculate the predictive accuracy of a model by applying the model on data points it has not previously been trained on and to compare the model’s predictions with the known true result for this test set. If you repeat this multiple times for non-overlapping test cases, just like you do in case of the cross-validation, you end up with a reliable estimation about how well your model will perform on new cases. This approach in general is a good basis for doing model selection, i.e., answering the question which type of model (think: “random forest or linear regression?”) is going to perform best on my data in the future.

So all is good then, right? **Wrong!** *The problem is that it is still very easy to leak information about the testing data into the training data if you perform a cross-validation in the wrong way.*

We call this phenomenon ** contamination** of the training data. Contamination provides access to information the machine learning method should not have access to during training. With contamination, the model will perform better than you expect it to perform when compared to situations when this information is not available. This is exactly why

*accidental contamination leads to an over-optimistic estimation about how well the model will perform in the future*.

This effect can be as drastic as the example of the k-Nearest Neighbor classifier in Part 2, where the training error (all information is available in this case) was 0% while the testing error was 50% and hence no better than randomly guessing the class of the data points. As we will see below, the same effect happens if you leak information about the test data into the training phase. The test error becomes effectively a training error then and is lower than you would expect without the contamination. Therefore, all the efforts you did by using a cross-validation to avoid leakage of information is pointless if you are doing it wrong.

And unfortunately, most data science platforms do not even support the correct way of performing a cross-validation! Shocking, isn’t it? This is the main reason why so many data scientists make this mistake.

Let’s talk show a few typical situations where accidental contamination of training data by leaking information about the test data can happen. This is by no means a complete list of examples, but it should be enough to give you the general idea about the problem and what to look for if you want to avoid this.

**Example 1: Contamination through Normalization**

Let’s start with a very common situation: you would like to normalize data so that all columns have a similar range and no column overshadows others. This is particularly important before using any similarity-based models, e.g., k-Nearest Neighbors.

This seems to be a simple task: just normalize the data and then train the model and validate it with a cross-validation. This is how this would look like in RapidMiner:

Figure 1: This RapidMiner Studio process simply performs a normalization (z-transformation) on the data before the model is validated.

Looks good, you say? **Wrong!** Because this approach shown will lead to a wrong error estimation. If you normalize the data *before* the cross-validation, you actually leak information about the distribution of the test data into the way you normalize the training data. Although you are not using the test data for training the model, you nevertheless instill some information about it into the training data. This is exactly why we call this effect *contamination*.

What you should do instead is perform the normalization *inside* of the cross-validation and *only* on the training data. You then take the information you gathered about the training data and use it for the transformation of the test data. This is where the RapidMiner Studio visual interface comes in handy since you can easily see the correct setup:

Figure 2: This is how you properly evaluate the impact of the normalization on the model accuracy. The parameters derived from the normalization on the training data are delivered into the test phase and used *there* instead of the other way around.

Please note that we are feeding data directly into the cross-validation operator and performing all necessary steps *inside*. The first thing you do in the training phase on the left is to normalize *only* the training data. The transformed data is then delivered to the machine learning method (Logistic Regression in this example). Both the predictive model and the transformation parameters from the normalization are then delivered into the testing phase on the right. Here we first transform the test data based on the distributions found in the training data *before* we apply the predictive model on the transformed test data. Finally, we calculate the error as usual.

This table summarizes the cross-validated test errors, once with the normalization performed *before* the cross-validation like in Figure 1 and the second with the normalization *inside* of the cross-validation like in Figure 2. It clearly shows the effect on the error if you validate the model without measuring the impact of the data pre-processing itself:

Table 1: The cross-validated test errors with a normalization before the cross-validation (“Wrong”) or inside. You can clearly see that the contamination of training data in the “Wrong” cases leads to overly optimistic error estimations.

If the normalization is done before the cross-validation, the calculated error is too optimistic in all cases and the data scientist would run into a negative surprise when going into production. On average, the actual error is 3.9% higher for Logistic Regression and 3.4% higher for a k-Nearest Neighbors with k=5. The difference is 0% for Random Forest simply because the models do not change at all if the data is normalized or not. In general, you can see that the effect is not as drastic as just using the training error, but still there are differences of more than 8% in some of the cases, higher than most people would expect. This is caused by only validating the model itself and not the impact of the data pre-processing.

A final thing which is important to notice: the model will actually perform roughly the same way in production, no matter if you perform the normalization outside or inside the cross-validation. When building a production model, you would typically use the complete dataset anyway and hence also apply the normalization on the complete data. But the performance of the model will be the *lower* one, pretty much in the range of the one shown in the column “Nested” in the table above. So the correct validation is not helping you create *better models*, instead it is *telling you the truth about how well (or poorly) the model will work in production* without letting you run into a catastrophic failure later on.

**Example 2: Contamination through Parameter Optimization **

Next let’s look at the impact of optimizing parameters of the model, e.g., in case of Random Forest the number of trees in the ensemble or in the case of k-Nearest Neighbors the parameter k determining how many neighbors are used for finding the prediction. Data scientists often search for optimal parameters to improve a model’s performance. They can do this either manually by changing parameter values and measuring the change of the test error by cross-validating the model using the specified parameters. Or they can use automatic approaches like grid searches or evolutionary algorithms. The goal is to always find the best performing model for the data set at hand.

As above in the case of normalization, using the cross-validated test errors found during this optimization sounds like a good plan to most data scientists – *but unfortunately, it’s not*. This should become immediately clear if you think of a (automatic) parameter optimization method just as another machine learning method which tries to adapt a prediction function to your data. If you only validate errors inside of this “bigger” machine learning method and deliver those errors to the outside, you effectively turn the inner test error into a kind of overly optimistic training error.

In case of parameter optimization, this is a well-known phenomenon in machine learning. If you select parameters so that the error on a given test set is minimized, you are optimizing this parameter setting for this particular test set only – hence this error is no longer representative for your model. Many data scientists suggest to use a so-called *validation set* in addition to a test set. So you can first use your test set for optimizing your parameter settings, and then you can use your validation set to validate the error of the parameter optimization *plus* the model on this set.

The problem with this validation set approach is that it comes with the same kind of single-test-set-problems we discussed before. You won’t get all the advantages of a proper cross-validation if you use validation sets only.

But you can do this correctly by *nesting* multiple levels of cross-validation into each other. The inner cross-validation is used to determine the error for a specific parameter setting, and can be used *inside* of an automatic parameter optimization method. This parameter optimization becomes your new machine learning algorithm. Now you can simply wrap another, outer cross-validation around this automatic parameter optimization method to properly calculate the error which can be expected in production scenarios.

This sounds complex in writing but again a visual representation can help us to understand this correct setup. The two figures below show the outer cross validation with the parameter optimization inside:

Figure 3: Properly validating a parameter optimization requires two cross-validations, an inner one to guide the search for optimal parameters and an outer one (shown here) to validate those parameters on an independent validation set.

Figure 4: Inside of the outer cross validation. In the training phase on the left is an automated search for an optimal parameter set which uses another, inner cross-validation (not shown). Then the optimal parameters are used for evaluating the final model.

Since the need for independent validation set is somewhat known in machine learning, many people believe that the impact of correct validation on the errors must be quite high. But it turns out that while there definitely is an impact on the test errors, it is smaller than the impact on the correct test error in the normalization case above. The following table shows the results:

Table 2: We used an automatic grid search for parameter optimization. The column “Wrong” shows the naive approach of showing the cross-validated error which was delivered from the validation guiding the search. “Nested” shows the correct estimations.

As expected, the properly validated errors using the “Nested” approach are higher than the errors taken directly out of the validation used for guiding the search (the “Wrong” column). But the differences are rather small with roughly 0.5% on average and a maximum deviation of 1%. But still, even a difference of 1% between your expectation and how well the model performs in production can be a huge and costly surprise.

**Example 3: Contamination through Feature Selection**

As a last example, let’s look at another pre-processing step which is frequently performed for optimizing the accuracy of machine learning models, namely feature selection. The goal is to select an optimal subset of features or data columns used by the model. In theory, many models can make this selection themselves, but in reality noisy columns can throw models off and it’s better to keep only those columns which provide meaningful information.

Similar to the parameter optimization, data scientists could manually select a subset of features and evaluate the model’s accuracy by calculating a cross-validated test error using only the data of the reduced feature set. Since the number of combinations grows exponentially with the number of columns in the input data, this isn’t often a feasible approach; the reason why automated optimization techniques are widely used. But just as in the case of parameter optimization, picking the optimal feature set is basically an extension of training the predictive model, and it needs to be properly validated as well.

Of course you could use an extra hold-out set for evaluating any built model at the end. But as before, you end up with the disadvantage of only using a single validation set. The better alternative is to nest the automatic feature selection (using an inner cross-validation to guide the search for the optimal set of features) into an outer cross-validation again, just as we did in case of the parameter optimization.

The correct setup starts with an outer cross-validation just as described above where we used an outer cross-validation to evaluate the performance of the parameter optimization. The inner setup of this cross-validation looks slightly different though, as follows:

Figure 5: An automatic evolutionary feature selection is used on the training data of the outer cross validation. The optimal feature set is selected before the model is trained and delivered to the test phase where it is applied again before prediction.

And here are the results for the wrong version of the cross-validation where you don’t properly validate the effect of the feature selection and only use the performance value of the optimal feature set as found during the optimization runs. The value in the columns “Nested” are determined with the correct approach using an outer cross-validation as depicted above:

Table 6: We used an automatic evolutionary feature selection. The column “Wrong” shows the naive approach of showing the cross-validated error which was delivered from the validation guiding the search while “Nested” delivers the proper estimations.

Here you see that the amount by which your expectations would be off is on average up to 4.2%. The largest deviation was even as high as 11% (5-NN on Sonar) – which is even higher than the maximum of 8% we saw in the normalization case above. Think about this: your error in production would be double (22% vs. 11%) what you expected; just because you did not the right kind of validation which takes into account the effect of the data pre-processing. You can very easily avoid this type of negative surprise!

It is worth pointing out that the calculated average and maximum deviations greatly depend on the data sets and model types you use. We have seen above that the effect of the wrong validation for normalization was zero for Random Forest since this model type does not care about the different scales of the data columns. On the other hand, the effect was larger for feature selection using a Random Forest. And this is the problem: *you never know in advance how large the effect will really be.* **The only way to be prepared about what is coming in the future is to properly validate the effects of the model itself as well as all other optimizations with a correct form of modular, nested cross-validations.**

**Key Takeaways**

- You can
*contaminate*your training data set by applying data transformations before the cross-validation which leads to information leakage about the test data into the complete data set. - Think about pre-processing, parameter optimizations, or other (automatic) optimization schemes just as
*another form of machine learning*trying to find the best predictive model. - This
*“bigger” machine learning needs to be evaluated*just like the actual machine learning method itself. - Most data science products do
*not*allow you to perform the model validation in a correct way, i.e. taking into account the effect of pre-processing or model optimizations. - We have seen three examples showing
*how large the effect on model validation*can be when done incorrectly. - For an un-validated
*normalization*, the effect is on average up to 3.9% with a maximum of 8%. - For an un-validated
*parameter optimization*, the effect is on average up to 0.6% with a maximum of 1%. - For an un-validated
*feature selection*, the effect is on average up to 4.2% with a maximum of 11%. - It is important to understand that these effects can be
*even higher*for different data sets or different forms of un-validated pre-processing. - The only way to avoid surprises when going into production is to
*properly validate before*.

Next week: In the next post we’ll discuss the Consequences of Accidental Contamination. See you then!

*Want to follow along with all of the examples in this Series? Load the below zip file of data and processes into your RapidMiner repository. If you need directions on how to add files to your repository, the following RapidMiner Community post will walk you through the process:** **How to Share RapidMiner Repositories**.*

Data & processes.zip for “Learn the RIGHT Way to Validate Models” blog post series

Hi!

This post was extremely informative, thank you so much! And because of it I am actually now implementing a few new practices in my own work, mainly to do with preventing contamination!

I often use PCA before the classifier, and I am aware that we should make sure that the coefficients used to transform the training data are then used to transform the test data. In terms of parameter optimisation, I was already using k-fold cross-validation. However, the classifier performance I was reporting was indeed the one resulting from this optimisation phase, which I now understand to be incorrect.

After reading your blog post, I am left with a few questions. I would be very grateful if you could shed some light on them!

1) PCA is always performed before the classification. When doing k-fold cross-validation for the purpose of optimising the classifier, should we re-calculate the matrix of PCA coefficients for each one of the k-folds?

2) When doing nested cross-validation, is the entire dataset used both for the inner loop and the outer loop?

As far as I understand, for the *inner loop*, the entire dataset is divided into k-folds.

A classification model is calculated for a certain combination of classifier parameters. The k-folds are run, and the final miss-classification rate (for example) is calculated.

This is repeated for X number of combinations of classifier parameters, and the best are chosen based on which ones returned the lowest miss-classification rate.

Moving on to the *outer loop*. The entire dataset is divided into k-folds (different to the ones used in the inner loop). The best model (with parameters found in the inner loop), is then tested and trained with the k-folds, and a final performance is calculated.

The performance of our model will be the performance of the outer loop.

3) What about the use of PCA in the outer loop?

Which coefficients should I use to transform the data in the outer loop? I assume these coefficients should not derive from the inner loop?

Should I instead repeat the procedure of calculating PCA coefficients for each one of the k-folds here in the outer loop?

Thank you so much, and I would appreciate any help on these questions!

Hi Barbara,

Glad to hear that the post had an impact!

Let me try to answer your questions.

1) … should we re-calculate the matrix of PCA coefficients for each one of the k-folds?

Yes. In RapidMiner, you can re-calculate the PCA for each training fold and deliver the so-called “preprocessing model” (output port named “pre” of the PCA operator) via the “through” port to the test subprocess. There you then will have two “Apply Model” operators, one for applying the same PCA to the test data and then another one for applying the prediction model on the transformed data set. I will try to add the XML of a simple example process at the end of this comment.

2) When doing nested cross-validation, is the entire dataset used both for the inner loop and the outer loop?

No, for the inner loop only (k-1)/k * number of examples are used (e.g. 90% for a outer 10-fold cross validation). The outer loop divides the data as usual into training and testing parts. Then only the training parts are delivered to the inner loop which then again makes the division based on that subset of data.

3) What about the use of PCA in the outer loop? Which coefficients should I use to transform the data in the outer loop? I assume these coefficients should not derive from the inner loop?

You mean if you have both the PCA and let’s say some parameter optimization? I would do the PCA ONLY in the inner loop right before the learner (and use the preprocessing model in the test subprocess of the inner loop). The outer loop only “focuses” on the impact of the parameter optimization then.

Hope this helps,

Ingo

Looks like posting the XML here does not work. Let me know if you need it and I can post it on the community portal…