You’re Doing It Wrong! Learn the Right Way to Validate Models
Part 3: Cross-Validation as the Gold Standard.
Now that we’ve established that using the training error is a terrible idea, let’s dive deeper into different ways to calculate test errors.
Validate Using Holdout Datasets
The first thing to notice is that it is often very difficult or expensive to get more data where you have values for y. For example:
- If you create a model to predict which customers are more likely to churn, then you build the model on data about the people who have churned in the past. Generating more churn data is exactly what you try to prevent so this is not a good idea.
- In predictive maintenance case you want to predict if and when a machine needs maintenance or will fail. Nobody in the right mind will purposely break more machines just to create more training data!
So training data is expensive and generating more of it for testing is difficult.
It is a good practice in such cases to use a part of the available data for training and a different part for testing the model. This part of the data used for testing is also called a holdout dataset. Practically all data science platforms have functions for performing this data split. In fact, below is the RapidMiner Studio process we used to calculate the test errors for the datasets in the previous section:
Of course you have a conflict here. Typically, a predictive model is better the more data it gets for training. So this would suggest that you use as much data as possible for training. At the same time, you want to use as much data as possible to test in order to get reliable test errors. Often a good practice is to use 70% of your data for training and the remaining 30% for testing.
Repeated Hold-Out Testing
Using a hold-out dataset from your training data in order to calculate the test data is an excellent way to get a much more reliable estimation on the future accuracy of a model. But still there is a problem: how do we know that the hold-out set was not particularly easy for the model? It could be that the random sample you selected is not so random after all, especially if you only have small training datasets available. What if you end up with all the tough data rows for building the model and the easy ones for testing – or the other way around? In both cases your test error might be less representative of the model accuracy than you think.
One idea might be to just repeat the sampling of a hold-out set multiple times and use different samples each time for the hold-out set. For example, you might create 10 different hold-out sets and 10 different models on the remaining training datasets. And in the end you can just average those 10 different test errors and will end up with a better estimate which is less dependent on the actual sample of the test set. This procedure has a name – repeated hold-out testing. It was the standard way of validating models for some time, but nowadays it has been replaced by a different approach.
Although in principle the averaged test errors on the repeated hold-out sets are superior to a single test error on any particular test set, it still has one drawback: we will end up with some data rows being used in multiple of the test sets while other rows have not been used for testing at all. As a consequence, the errors you make on those repeated rows have a higher impact on the test error which is just another form of a bad selection bias. Hmm… what’s a good data scientist to do?
The answer: k-fold cross-validation.
Cross-Validation As the Gold Standard
With k-fold cross-validation you aren’t just creating multiple test samples repeatedly, but are dividing the complete dataset you have into k disjoint parts of the same size. You then train k different models on k-1 parts each while you test those models always on the remaining part of data. If you do this for all k parts exactly once, you ensure that you use every data row equally often for training and exactly once for testing. And you still end up with k test errors similar to the repeated holdout set discussed above. The picture below shows how cross-validation works in principle:
For the reasons discussed above, a k-fold cross-validation is the go-to method whenever you want to validate the future accuracy of a predictive model. It is a simple method which guarantees that there is no overlap between the training and test sets (which would be bad as we have seen above!). It also guarantees that there is no overlap between the k test sets which is good since it does not introduce any form of negative selection bias. And last but not least, the fact that you get multiple test errors for different test sets allows you to build an average and standard deviation for these test errors. This means that instead of getting a test error like 15% you will end up with an error average like 14.5% +/- 2% giving you a better idea about the range the actual model accuracy will likely be when you put into production.
Despite all its obvious advantages, a proper cross-validation is still not implemented in all data science platforms on the market, which is a shame and part of the reason why many data scientists fall into the traps we are discussing in this blog series. The images below show how to perform a cross validation in RapidMiner Studio:
Before we move on to the next section, let’s also perform a cross validation on our four datasets we introduced in Part 2 of this series, using the three machine learning models Random Forest, Logistic Regression, and a k-Nearest Neighbors learner with k=5. Here are the results:
If you compare the averages test errors above with the single fold test errors we calculated before, you can see that the differences are sometimes quite high. In general, the single test errors are in the range of one standard deviation away from the average value delivered by the cross validation but the differences can still be dramatic (see for example Random Forest on Ionosphere). This is exactly the result of the selection bias and the reason you should always go with a cross-validation instead of a single calculation on only one hold-out data set.
But there is also a drawback which is the higher runtime. Performing a 10-fold cross-validation on your data means that you now need to build 10 models instead of one, which dramatically increases the computation time. If this becomes an issue, you will see the number of folds being decreased to values as little as 3 to 5 folds instead.
- You can always build a holdout set of your data not used for training in order to calculate the much more reliable test error.
- Cross-validation is a perfect way to make full use of your data without leaking information into the training phase. It should be your standard approach for validating any predictive model.
Next up: Accidental Contamination! Read Part 4 of the series here.
Want to follow along with all of the examples in this Series? Load the below zip file of data and processes into your RapidMiner repository. If you need directions on how to add files to your repository, the following RapidMiner Community post will walk you through the process: How to Share RapidMiner Repositories.