## You’re Doing It Wrong! Learn the Right Way to Validate Models

### Part 1: Training Error & Test Error

All data scientists have been in a situation where you *think* a machine learning model will do a great job of predicting something, but once it’s in production, it doesn’t perform as well as expected. In the best case, this is only an annoying waste of your time. But in the worst case, a model performing unexpectedly can cost millions of dollars – or potentially even human lives!

So was the predictive model wrong in those cases? Possibly. But often it is not the model that’s wrong, but * how the model was validated*. A wrong validation delivers over-optimistic expectations of what will happen in production.

Since the consequences are often dire, I’m going to discuss how to prevent mistakes in model validation and the necessary components of a *correct*** **validation.

To kick off the discussion, let’s get grounded in some of the basic concepts of validating machine learning models: predictive modeling, training error, test error and cross validation

**What Models can be Validated?**

Let’s make sure that we are on the same page and quickly define what we mean by a “predictive model.” We start with a data table consisting of multiple columns **x _{1}**,

**x**,

_{2}**x**,… as well as one special column

_{3}**y**. Here’s a small example:

*Table 1: A data table for predictive modeling. The goal is to find a function that maps the x-values to the correct value of y.*

A *predictive model* is a function which maps a given set of values of the **x**-columns to the correct corresponding value of the **y**-column. Finding a function for the given dataset is called *training the model*.

Good models not only avoid errors for **x**-values they already know, but, in particular, they are also able to create predictions for situations which are only somewhat similar to the situations which are stored in the existing data table. For example, such a model can predict that the **y** value for the **x**-values of (1, 2, 5,…) should be “negative,” since those values are closer to the values in the second row of our table. This ability to generalize from *known* situations to *unknown* *future *situations is the reason we call this particular type of model *predictive*.

In this blog series, we’ll focus only on predictive models for a column **y** with categorical values. In principle, the validation concepts also hold if we want to predict numerical values (called regression), or if there is no such column at all (we call this *unsupervised learning*).

**Training Error vs. Test Error**

The one thing true for all machine learning methods, whether it is a decision tree or deep learning: you want to know how well your model will perform. You do this by measuring its accuracy.

Why? First of all, because measuring a model’s accuracy can guide you to select the best-performing algorithm for it and fine-tune its parameters so that your model becomes more accurate. But most importantly, you will need to know how well the model performs *before* you use it in production. If your application requires the model to be correct for more than 90% of all predictions but it only delivers correct predictions 80% of the time, you might not want the model to go into production at all.

So how can we calculate the accuracy of a model? The basic idea is that you can *train* a predictive model on a given dataset and then use that underlying function on data points where you already know the value for **y**. This results in two values of **y**: the actual one, as well as the prediction from the model, which we will call **p**. The following table shows a dataset where we applied the trained model on the training data itself, leading to a situation where there is a new prediction **p** for each row:

*Table 2: A table with training data. We created a predictive model and applied it to the same data. This leads to a prediction for each row, stored in column p. Now we can easily compare how often our predictions are wrong.*

It is now relatively easy to calculate how often our predictions are wrong by comparing the predictions in **p** to the true values in **y** – this is called the *classification error*. To get the classification error, all we need to do is count how often the values for **y** and **p** differ in this table, and then divide this count by the number of rows in the table.

There are two important concepts used in machine learning: the *training error* and the *test error*.

**Training error**. We get this by calculating the classification error of a model on the*same data the model was trained on*(just like the example above)**Test error**. We get this by using*two completely disjoint datasets*: one to train the model and the other to calculate the classification error. Both datasets need to have values for**y**. The first dataset is called*training data*and the second,*test data*.

Let’s walk through an example of each. We will use the data science platform RapidMiner Studio to illustrate how the calculations and validations are actually performed. You can download RapidMiner Studio for free at www.rapidminer.com and follow along with these examples if you like.

Let’s get started with the basic process you can use to calculate the ** training error** for any given dataset and predictive model:

*Figure 1: Creating a Random Forest model in RapidMiner Studio and applying it to training data. The last operator called “Performance” then calculates the training error.*

First we load a dataset (“Retrieve Sonar”) and deliver this training data into a “Random Forest” operator and an “Apply Model” operator which creates the predictions and adds them to the input training data. The last operator on the right, called “Performance,” then calculates the training error based on both the true values for **y** as well as the predictions in **p**.

Next let’s look at the process to calculate the ** test error**. It will soon be apparent why it is so important that the datasets to calculate the test error are completely disjoint (i.e., no data point used in the training data should be part of the test data and vice versa).

*Figure 2: The test error comes from using two disjoint datasets: one to train the model and a separate one to calculate the classification error.*

Calculating any form of error rate for a predictive model is called *model validation*. As we discussed, you need to validate your models before they go into production in order to decide if the expected model performance will be good enough for production. But the same model performance is also often used to guide your efforts to optimize the model parameters or select the best model type. It is very important to understand the difference between a training error and a test error. *Remember that the training error is calculated by using the same data for training the model and calculating its error rate. For calculating the test error, you are using completely disjoint data sets for both tasks.*

**Key Take Aways**

- In machine learning,
*training a predictive model*means finding a function which maps a set of values**x**to a value**y** - We can calculate how well a predictive model is doing by comparing the predicted values with the true values for
**y** - If we apply the model to the data it was trained on, we are calculating the
*training error* - If we calculate the error on data which was
*unknown*in the training phase, we are calculating the*test error*

In Part 2 of our series “Learn the Right Way to Validate Models,” we’ll discuss ** why you should ignore the training error**. Wait, what? That’s right – I said it! Stay tuned.

*Want to follow along with all of the examples in this Series? Load the below zip file of data and processes into your RapidMiner repository. If you need directions on how to add files to your repository, the following RapidMiner Community post will walk you through the process: How to Share RapidMiner Repositories.*

Data & processes.zip for “Learn the Right Way to Validate Models” blog post series