Machine Learning is becoming mainstream. There are hundreds of tutorials online about how to do machine learning, and model fitting programs and software are becoming commodities. There are great APIs in Python that let you do this, alongside “Auto ML” and graphical data science tools that make building models much easier.
One of the drivers in the current model impact epidemic relates not only to putting models into production where they can have the desired effect, but also the fact that models must be continually maintained to further drive impact. In my day-to-day work, I see that many meaningful models idle on hard discs because of a lack of maintenance.
During the development phase of a recent RapidMiner release, we dove deep into the problem of detecting drift as a key component of model maintenance. Drift, or more precisely, drift of concept, occurs when there are changes in the underlying pattern that a model is designed to detect. We can see a clear example of this if we look at the sentiment of the word “terrific” over time.
In modern American English, the word “terrific” has a positive connotation. But in the past, that wasn’t the case. Originally, the word “terrific” meant “frightful”—the word shares a root with the word “terror”. The meaning or concept of the word changed over time. If we imagine a model designed to detect sentiment, this change in meaning would lead the model to make errors as the word developed a positive connotation.
It’s thus of vital importance for us as data scientists to detect these drifts early and take countermeasures. In many cases I thought:
That’s easy! I learned to do this in my second semester math course!
But as it turned out, it’s not that trivial. In this blog post, I’d like to share with you my personal journey on exploring ways to detect drift, so that you can avoid my mistakes.
I need to admit up front that I am not a statistician. I am a particle physicist, so some parts of the following may be an ‘of course’ thing for you if you studied statistics. Nevertheless, I want to put it here for those of you who are like me and aren’t statisticians but may have the same kinds of questions about how to detect model drift.
My first idea to detect drift was hypothesis testing. I take a reference time frame and look at the distribution of scores. Then, I look at a more recent dataset and calculate the likelihood that these are drawn from the same underlying distribution. Looking back at my physicist view on stats, I quickly determined my default tests: chi-square, Kolmogorov-Smirnoff, and t-test for normally distributed data.
While exploring these methods I realized that the chi-square test doesn’t work on numerical observations, but rather on counts. This means we need to bin the numerical data first. Kolmogorov-Smirnoff works well on numerical data, however, and I prefer to the chi-square test.
After running some quick explorations with these tests, I saw a big problem: the p-values were changing! For slightly different data, like a different random seed, we get very different p-values.
But why is this a problem? Well, reading up on p-values will give you some food for thought, but the important thing in our case is a common misconception. According to Wikipedia:
Another concern is that the p-value is often misunderstood as being the probability that the null hypothesis is true.
Okay, but what does this mean for us? Our hypothesis test asked the question:
Can we confirm that the two samples are drawn from the same distribution?
A low p-value only means that we cannot confirm it. This does not mean that we can prove that these distributions are different! Not very useful…
When you dig down deeper into p-values and common misconceptions, you will see a lot of problems with the concept. So I asked myself—isn’t there something else to use? And there is—distance measures for distributions!
In financial services, it is common to use distribution measures—most prominently Kullback-Leibner (KL) and Jensen-Shannon (JS) divergence—to measure the “distance” between two probability distributions. That sounds promising!
KL is an old friend of any data scientist. There are similar measures with the names “entropy” and “information gain” in many areas of machine learning. Perhaps the most prominent example are purity measures in decision trees.
When working with KL and JS, we need to remind ourselves that these are divergences, and not truly distances, although they’re often referred to as such. This means that
D(X,Y) != D(Y,X)
but as soon as we define one of the distributions as the ‘reference’ distribution, we are fine here.
The other problem is that this is a divergence, so you will not have a nicely defined p-value but will instead get an unnormalized value. It’s up to the user to define the threshold of what is “different’”.
Implementing this kind of test seemed straight forward to me. Most stats packages have the divergences included—including my new friend SMILE. I went for:
double x = … // reference data observations double y = … // test data observations double p = Math. KullbackLeiblerDivergence(x,y)
The result of this bit of code is a negative value for the divergence. Wait a second—negative? How can a distance be negative? That’s a bit out of the scope of this post, but you can read for yourself if you’re curious about the details.
Essentially, the problem lies in the definition of JS and KL. To quote Wikipedia again:
This means we need to take a look at the probability density function and not the observations by themselves. In order to do this, we need to find a density estimator. I took the easy route and (again) used a histogram for this job, which works well. Defining the threshold is still a bit of head-scratcher to me, but it wouldn’t be data science if there is not at least one parameter to play with.
What I have a difficulty with is trust. While KL is a standard technique in machine learning and also in quantitative analytics in financial services, there is still a lot of math involved, which can create barriers to adoption. In my day-to-day job I work with customers of all different skill levels, including many analysts coming from Tableau or Excel, as well as engineers. Both of these groups are sometimes reluctant to use concepts they cannot easily interpret and understand. They would like to have measures which they can use from their own domains.
Learning from Statistical Process Control?
Statistical Process Control and Six Sigma are two methodologies used in production processes to monitor the ‘drift’ of control parameters that are used to ensure that your production quality is stable. This is the same thing we want to ensure in our model ops. So why not use SPC or 6σ techniques on machine learning deployments?
Well, one answer may be that the whole Six Sigma model is driven by normal distribution assumptions, which may or may not be a problem depending on the data being considered. On the other hand, the clear advantage is that this is already in the language of engineers and is a proven tool kit for similar scenarios. One of my next subjects to learn about is SPC. I will update you as soon as I have new results.
If you’re curious about exploring model drift with your data, give RapidMiner Studio a try. The latest version has quick, code-free deployments for machine learning models, including drift detection, notifications, and challenger models.
A wrong validation leads to over-optimistic expectations for the model’s performance. Learn how to validate models correctly with our new blog series.