05 March 2020


Model Accuracy Isn’t Enough: You Need Resilient Models

There are two things that still surprise me in data science. The first is how much time we waste on optimizing models. The second is how little impact this time has on models in production.

Key phases of data science projects

Let me explain. The first observation is about where data scientists spend most of their time. I don’t mean the typical 80%-on-data-prep argument here.

Rather, I’m referring to the various stages of a data science project—prototyping, substantiation, and operationalization.

Machine learning project phases


Every data science project starts with a prototyping phase where you explore the data, prep it, and build many, many model candidates. Since every project starts there, all projects reach at least this phase.


The next phase is called substantiation. Models are further refined in this phase, often being retrained on larger data sets. You also perform more feature engineering here, and of course this is also where most of the internal selling takes place. You will need buy-in from the stakeholders in the business to make a change.

However, most projects will never make it to this phase. Either because there is not enough in the data, or because there was no buy-in. We estimate that only 30% of all projects make it to this second phase.

Of the models that make it to this stage, there are certainly some models that are beneficial and should go into production. But the challenges aren’t over yet.


In the operationalization stage, additional technical hurdles are waiting and without proper change management, the organization won’t get much out of the models you’ve built. We did some analysis and it looks like less than 1% of all projects get to this stage.

Data scientists naturally spend of most of their time on the prototyping phase, since that’s where most of the work is and where most of their projects remain.

Tool vendors and even RapidMiner have also been guilty of supporting this behavior for a very long time. RapidMiner features like the visual process designer, Auto Model, and Turbo Prep are all designed to make your work more productive in the prototyping phase.

What’s weird though is that this investment is exactly anti-proportional to the value those models have for the organizations they’re being built for. People and tools are invested in the prototyping stage, but a model which is not in production does not have any value for a business.

Let’s discuss one of the greatest examples of this mistaken thinking: the famous Netflix Prize challenge!

Netflix Prize and Production Failure

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for movies: if you give a certain new movie a high rating, you’re also going to like this other movie – so let’s recommend it to you and you’ll stay with Netflix longer. The grand prize was the sum of one million dollars given to the team that was able to improve Netflix’s own algorithm by at least 10%.

Netflix prize timeline

And indeed, the contributions of over 40,000 data science teams globally led to an impressive improvement of more than 10% for recommender success. Surely this would translate into business impact, right?

Well, if you now think that Netflix started using this winning model to get business impact, you would not be more wrong! In fact, the finalist models were never operationalized. Netflix said that “the increase in accuracy did not seem to justify the effort needed to bring those models into production”.

Look, I am not saying that this 4-year race among 40,000 experts was without merits. It was definitely fun and also educational for many of the data scientists involved. But it definitely didn’t pay off for Netflix.

Of course, some people will argue that sometimes an additional 0.01% in accuracy can mean a lot of money. I’ve heard this argument a hundred times before. Data scientists seem to quote use cases from financial services for this, even if they are not working in that field themselves.

I will argue though that this is just not true for most applications—just like it was not true in the Netflix case. And this is for two reasons:

  1. There is a well-known discrepancy between testing error and what you will see in production
  2. While business impact and data science performance criteria are often correlated, “optimal” models are not always the best for delivering the biggest business impact

Why the “best” model is often not the best

I won’t go into the details here, but there is something called overfitting to the test set. Let’s assume that you’re holding out some of your data to create a test set in order to validate your model during the development process. While you’re building and refining your model, you use this hold-out set over and over to evaluate the model.

Keep in mind that there’s a long-standing culture in data science of chasing the highest levels of accuracy or AUC or whatever metric you’ve decided means that your model is the “best”. So obviously you’ll prefer models which perform better on the test set according to that metric.

But this creates a problem because, by doing this, you’re slowly adapting your model to the test set; in a sense, it becomes a part of your training data. The same applies if you perform a cross-validation instead, although the adaption typically happens a bit more slowly. Instead of adapting to a single test case, you’ll adapt to k test sets in case of a k-fold cross-validation.

Because of this adaptation to the test set, machine learning competition sites like Kaggle use a clever trick: they introduce an additional validation set to choose the winner. What makes this clever is the fact that the contestants do not know which of the rows are part of the test set and which rows are part of the validation set.

While you can still easily overfit to the test set in this scenario, the model may not perform the best on the unknown validation set. The picture below shows how Kaggle calculates the winner based on the private leaderboard position.

Kaggle winner determination

In addition, contestants are only allowed to make a single final submission of scores to reduce the risk that people could send in random variations of scores and one just by chance performs a little bit better on the private leaderboard and wins.

This is all very reasonable for friendly competitions. In fact, without these methods in place you could win on the public leaderboard without ever even reading the data.

The picture below shows the distribution of Kaggle scores on the private leaderboard (inspired by this blog post). Most models will show an average performance on the private leaderboard. The winner is of course the model with the highest score.

However, because many people and teams are submitting models, there is still some level of overfitting happening. Just by random chance, some of the models will perform a little bit better on the unknown validation set even though they would show the same level of accuracy on other, additional test sets.

The best model is...

So, what is ACTUALLY the best model? It may not be the best one for the unknown test set used to determining rankings on the private leaderboard. The best model is the one that will hold up in production and will perform well on a variety of different test sets. This model is more likely further down the ranks and not at the top of the leaderboard. It’s probably still above average, but not necessarily the one at the top.

I want to emphasize here that my point is NOT to criticize Kaggle or its grandmasters. Kaggle contests can be a lot of fun and very educational. And the people who repeatedly perform well in Kaggle competitions obviously have a lot of skill. But, from a business perspective, the winning model is often not the model that should be in put into production.

Because in production, you want to have a model that’s not just good by chance in specific situations but is performing well consistently. I call models that do well in a variety of circumstances resilient models – more about this later!

For now, let’s focus on the other point above: that the best model from a data science perspective is often not the one with the biggest business impact.

Accuracy does not equal business impact

To demonstrate the difference between accuracy and business impact, I created a cost-sensitive churn model with RapidMiner’s Auto Model; the picture below shows the results. You can see that chasing higher accuracies is often just a waste of time because it’s not the accuracy that’s most important, but rather the business impact that a model has.

Auto Model dashboard

If we sort the models according to the Gains column, we can see that the deep learning model produces additional gains of $80,000 even though its accuracy is lower than, for example, the naïve Bayes. The same is true for AUC and other criteria. Naïve Bayes is just better in this case according to all data science criteria, but if you assign costs and benefits to the different outcomes, deep learning delivers twice the business impact.

This is a phenomenon I’ve seen hundreds of times in the past 20 years, but for many data scientists, this is a huge surprise because they rarely—or never—put models into production.

Resilience over accuracy

Here’s another surprise. Overfitting models to the test set and “inaccurate” models delivering the biggest business impact are related phenomena.

The data you have available to you during modeling is limited. The more time you spend on optimizing your models, the more likely it is that you will start to overfit to the data you have.

The result of this overfitting is that, while the model will look better and better on your test and validation sets, its accuracy will likely be worse on unknown data sets. And as we’ve just seen, the models with the biggest impact are often not the ones with the highest accuracies anyways.

So, the next time you catch yourself optimizing a model for miniscule improvements, ask yourself if it’s worth the effort. Are you improving business impact, or only AUC? And do you truly believe that your model will hold up in production with consistent performance levels?

This last point leads us to the concept of resilience. A resilient model, while not the optimal model with respect to data science measures like accuracy or AUC, will perform well on a wide range of data sets beyond just the test set. It will also perform better for a longer period of time, as it’s more robust and less overfitted. This means that you don’t need to constantly monitor and retrain the model which can disrupt model use in production and potentially even create losses for the organization.

While there is no single KPI for measuring model resilience, there are a few ways you can evaluate how resilient your model is:

If you have model operations in place, you can detect many of these signals easily. Below is a screenshot of RapidMiner’s Model Ops. As you can see in the top right, the deep learning model has some early days with a lower error rate, but the error rates slowly creep up over time. There’s also more volatility in the error rates.

In this case, gradient boosted trees are more consistent. And the model also continues to deliver continuous business impact, as shown in the bottom right corner. While the deep learning model started off better, all the points above are good indicators that the GBT is not just the better model in general but also the more resilient one.

Churn model dashboard

Seeing this kind of performance over time and under real-world circumstances is the best way to determine the state of your model. Is the model just accurate, or is it also resilient? If you examine model performance over time, it’s easy to determine if you’ve built a resilient model or just a one-hit wonder.

Takeaways and next steps

I highly recommend that you always prioritize building a resilient model. Linear or other simple models are typically good since they overfit less to a specific test set or a moment in time.

Then you can add more powerful models as challengers, running them next to the resilient model to monitor their performance. You can even keep tuning those more complex models and, thanks to the monitoring, you can prove that they are truly better over time.

Changing your mindset to focus on resilience instead of accuracy is a great way to guarantee continuous business impact.

If you’d like help with your next machine learning project, you can sign up for a free AI Assessment and someone from our team will help you evaluate potential uses cases and their business impacts—no obligation.

Related Resources