You’re doing it wrong! It’s time to learn the right way to validate models.
Unfortunately, you will find a lot of references in machine learning literature to training errors. This is a bad practice and should be avoided altogether. Training errors can be dangerously misleading! Here’s why.
Why training errors are misleading
Let’s assume we have a data set with 2 x-columns and 1 y-column completely filled with random data, i.e., the data for the x-columns are just random numbers and the values “positive” and “negative” for y are also just randomly assigned to the rows. The result looks like the table here:
Table 1: A table with random data for the x-columns as well as for the classes in y.
The image below shows a chart of this data with the two x-columns on the axes of the chart and the random values for y used as color:
Figure 1: This is a data set in 2 dimensions where all values are completely random, including the class y which is depicted with the colors red (positive) and blue (negative).
The data set we have used for this image is 1,000 rows large and is also equally distributed, i.e., there are the same amount of 500 positive and 500 negative rows in this data.
Now think about this: this data is completely random for both the positions of the points and the color of the points. Do you think it is possible to build a predictive model for such a random data set?
In fact, it is not. If the data is truly random, there is no pattern to be found and hence no function can be trained. The best you can achieve is to either produce random predictions or just go with one of the two classes “positive” or “negative” in all cases.
And what kind of classification error should such a predictive model have? Well, that’s easy.
Both classes are equally distributed (500 out of 1,000 rows for each class) and if we always predict one class in all cases, we should be correct in exactly half of all the cases and wrong for the other half, i.e., we should end up with a classification error of 50%.
So let’s do a little experiment and calculate the training error and test error for a widely used machine learning method, k-Nearest Neighbors.
Here’s how this algorithm works: all data points from the training set are simply stored during the training phase. When the model is applied to a new data point, the predictive function looks for the k most similar data points in the stored training data and uses their classes (or the majority of the classes in case they are different
And here is the result: the training error of the 1-Nearest Neighbor classifier is 0% (!) – a perfect classifier! And much better than the 50% error we have expected. How is this possible?
The reason is simple: if you apply a 1-Nearest Neighbor classifier on the same data you trained it on, the nearest neighbor for each point will always be the point itself. Which, of course, will lead to correct predictions in all cases, even though this is random data.
But for data points which are not part of the training data, the randomness will kick in and the 0% classification error you hoped for will quickly turn into the 50% we expected.
And this is exactly the kind of negative surprise you can get if you make the decision to go into production based on validating the model’s performance on the results of the training error. Using the training error gives you an inaccurate model validation.
Below is the RapidMiner Studio process we have used to calculate the training error for the 1-Nearest Neighbors classifier on the random data set. It delivers a 0% classification error.
Figure 2: Calculating the training error of a 1-Nearest Neighbor classifier always leads to a 0% training error – even on completely random data.
We can compare this 0% error of the 1-Nearest Neighbors classifier now to the test error calculated on a completely independent test set but following the same random data generation principles. The process below is using two disjoint data sets of random data, both with a size of 1,000 rows:
Figure 3: Calculating the test error of 1-Nearest Neighbors for random data shows a completely different picture: we get a test error of 51.5% – close to our expectation of 50%.
This process actually delivers a test error of exactly 51.5% which is very close to our expected error of 50%. You can see that the test error is a much better estimation about how your model will perform on unseen data than the training error which would give you the false believe of a perfect model.
Now, you might argue that a 1-Nearest Neighbor classifier is not a realistic approach (which I would argue against!) and that data typically is not random (which I would accept for many cases).
So how do the training error and the test error look different for more realistic scenarios?
The table below shows the difference between both error types for three different machine learning methods on four different datasets from the UCI data set repository for machine learning:
Table 2: The table shows the test and the training errors for three different machine learning methods on four different data sets. The differences are sometimes dramatic (up to 18%) with an average of almost 10%.
The column “default” is the error rate you would get if you went with just the majority class as prediction in all cases. You can clearly see that all machine learning methods have been able to improve the test error from the default error in all cases, so the model has clearly learned something useful.
But you can also see how far off the training errors are – in fact, none of them would turn out to be true if you went into production. The test errors on the other hand will at least be very close to what you will observe in production.
On average, the deviation between the two error types is about 10% – in one case, namely Random Forest on the Sonar data set, the difference went as high as 18%!
One thing is very important: the best thing we can do is deliver an estimation about how well the model will perform in the future. If done in the right way, this estimation will be close to what can be achieved but there is no guarantee that the estimated performance will be exactly what can be expected.
In any case, however, the test error is a much better estimation about how well the model will perform for new and unseen cases in the future. The training error is not helpful at all, as we have clearly seen above.
That’s why I recommend that you don’t use training errors at all. They are misleading because they always deliver an overly optimistic estimation about model accuracy.
In conclusion, you should never use the training error for estimating how well a model will perform. In fact, it’s better to ignore the training error all together.
Now that we’ve established that using the training error is a terrible idea, let’s dive into why cross-validation is the gold standard.
Want to follow along with other examples? Load this file of data and processes into your RapidMiner repository: Data & processes.zip for “Learn the Right Way to Validate Models”. If you need direction on how to add files to your repository, this post will help: How to Share RapidMiner Repositories.
Request a demo and learn how RapidMiner offers all of the capabilities to support the full data science lifecycle for your enterprise.