Ingo Mierswa, RapidMiner
Kaggle is useless. Most models don’t provide business impact. Data scientists are wasting time. And deepfake is for losers. Dr. Ingo Mierswa has worked for 20 years on hundreds of data science projects, and in that time, he’s seen it all.
In this presentation, he’ll discuss some of the mistakes we make as data scientists – often inadvertently, but sometimes even though we should know better. What’s worse, the complexity of the data science field lets us hide the consequences of these mistakes from others. So how do we break free? Dr. Mierswa presents a manifesto for data science, a set of basic principles designed to guide our work and make sure that our models have the desired impact.
00:08 [music] [applause] Thank you, Carol, for the introduction. Yeah, so we will talk about the Deepfake is for Losers and other Secret Confessions of a Data Scientist. First of all, I’m truly honored to kick off RapidMiner Wisdom here today. And we all know, I mean, today is a sunny day, but we all know who is to blame in case that changes in the next 24 hours, so Carol over here. But in fact, before I moved to Boston– before we really jump into this I want to share one thing with you, and we will come back to this later, so it even will make sense. Before I moved to Boston, I actually checked out European cities, okay? That was seven years ago. European cities on the same latitude as Boston. Do you know that Barcelona is on the same latitude? Beautiful. And the lovely city of Dubrovnik. That is Dubrovnik here in Croatia. Look at this. Some of you may even recognize Dubrovnik because they used it as King’s Landing in the TV show Game of Thrones. And did you ever see it snow in King’s Landing? Right. Neither did I. So you will understand my disappointment when I encountered my first New England winter. [laughter]
01:17 So anyway, enough about that. So we will talk about exciting things today. In fact, about things typically people do not like to talk about. We’ll make a couple of confessions, and I will start with one right away. I do this now for 20 years, in fact. But I still– I do it every day, and I love it. I love doing data science. I love working with data. I love building machine learning models. And in the last couple of years, it is really, truly remarkable, all the great breakthroughs we made in artificial intelligence and machine learning. It was even covered by the mainstream media. Finally, people are talking about this and not just in science fiction movies. So fantastic. We will discuss some of those examples today and what we can learn from them. And to kick things off; and since we are a group of data scientists in this room, I have to ask you a question first. Okay? So whenever you want, you chime in. What is your favorite data set you’re using when you’re introducing data science to other people? What are you using? Any data set which comes to your mind?
02:30 [laughter] You guys are so predictable. [laughter] Of course, the Titanic data set. So let’s quickly talk about the Titanic. But not about the data set. What did you folks think? I am talking about the classic American film Titanic, of course. [music] Who does not remember young Leonardo DiCaprio as Jack Dawson in this movie? Things in this scene went very, very well for him. Well, as we all know, things went a little bit down from there, but did you also remember that the main actor in this movie was actually Chinese? No, well, let’s see. Most recently, new techniques emerged based on neural networks, which generate complex outcomes like videos or images. So think about what a remarkable breakthrough this is. Instead of just predicting numbers or classes, we can not predict how a face would look like, or how a voice would sound like. The possibilities are endless. So naturally, the first thing we do is we recreate Leonardo DiCaprio movies with Chinese actors instead. But all silliness aside, what is actually truly impressive about this application here is that those videos have been recreated from one still image only. In fact, this image here on the left. And that is truly impressive.
03:59 Well, I promised you we’d come back to this. We saw Dubrovnik as a fake King’s Landing earlier in the presentation already today. So let’s have a look into another deepfake video based on Game of Thrones this time. Use the same face for both actors. Sam here, and Jon Snow. And I’m not sure about Sam. I don’t find him that convincing. But look at Jon Snow in Chinese, [laughter] I mean, not bad. I mean, obviously, it’s pretty confusing to use the same face for multiple actors in the same scene. But that’s not the only confusing thing about deepfake videos because in the future, it will be incredibly hard to differentiate if a video or an image is actually real news or fake. Well, the good news is I’m pretty sure that when the time comes, we will be able to create just another machine learning algorithm, which helps us to detect if a video created by a first machine learning algorithm is actually fake or not. I mean, machines controlling other machines. That sounds totally reasonable to me. So this is probably going to be our future. So let’s have a look into a couple of other examples.
05:02 In particular, this one here, which actually I found pretty impressive. That was DeepMind’s AlphaGo was winning against the world champion in Go. And Go seems to be an interesting game. I don’t play it myself, but I hear it’s a very strategic game. And the space of possibilities is so large that many people even claimed that unlike for chess, computers will never be able to beat good human players in Go. Well, it turned into just another example of why it is a bad idea to bet against AI in the long run because, in fact. AlphaGo was winning four out of five games against Lee Sedol, here on the right, the world champion in Go. And an interesting side note on this here is that Lee Sedol actually resigned after this loss in Go. So this win now, actually also turns into an example of how machine learning is taking your jobs. [laughter] So Lee Sedol, by the way, said, “Even if I become the number one, there is an entity which cannot be defeated.” That sounds not that nice, but that brings me to my last example of AI’s beating human humans. And that’s the now famous Jeopardy case. I’m sure you all know about this. This got a lot of public attention when a couple of years ago, IBM’s Watson was winning the 1 million dollar Jeopardy prize against the human Jeopardy champions, Ken Jennings and Brad Rutter. Ken Jennings here on the left, he even wrote down, “I for one welcome our new computer overlords.” Really? Do you now, ken? I mean, do you really? What could possibly go wrong if computers constantly outperform humans? [laughter]
06:49 Anyway, back to Watson. After this great success IBM had with Watson with Jeopardy, IBM then decided to put Watson to work in one of the most critical areas of modern societies; in medicine. And in 2014, the world has been truly excited about the prospect that an AI will be the best doctor in the world. An IBM spokesperson even declared that if it’s not already the world’s best diagnostician, it will be soon. Well, not really. I mean, things started very promising, though. In 2013, IBM declared that MD Anderson, the Cancer Treatment Center, which is part of the University of Texas, is using the IBM Watson cognitive computing system, whatever that now means, for its mission to eradicate cancer. And that’s, of course, fantastic. But a few years later, unfortunately, this whole project was put on ice after investments of more than 62 million dollars have been already made. Watson was even called a flop in medicine. And doctors are losing faith in Watson’s AI doctor.
08:02 So why are we doing all of this? Why are some of the smartest people on the planet spend their time– why are they spending their time on creating deepfake videos, winning board games or game shows like Jeopardy, while the same smart people seem to utterly fail on real-world problems like cancer detection? I mean, sure, it is fun to project your own face on pretty much the only character who survived your favorite TV show. [laughter] But that can’t be all. Maybe the truth is simply that those fun applications like deepfakes, they are actually more successful. At least, it feels like they are. Because we seem to fail more often with machine learning in the real world. And also, the cost of failure is much higher. Think about this. The cost can be even devastating. It can even cost real human lives, and not just the lives of the people in [inaudible]. So this brings me to the three key questions of this keynote today. First, what is the current state of data science now? Second, how did we get there? What did happen? And third, and probably most important, what do we need to do to get to a better position?
09:25 So let’s get started with the current state. We’ll begin with the various stages of each data science project. Each project starts with what I call the prototyping phase, where you explore the data, prep it. You generate lots and lots of model candidates to figure out if there is something in your data. So sure, every project starts there. So naturally, all projects reach this phase. However, the next phase which is called substantiation– and things are a little bit different there. In this phase, you actually retrain models on larger data. You further refine your models. And that is also the phase where the internal selling takes place. It turns out that most projects actually never make it into this phase, either because there is just not enough in the data or because there is not enough buy-in from the organization. However, some models should actually go into production. And this is where additional hurdles, both technical and cultural, are waiting for you. We did some analysis recently, and it looks like less than 1% of all projects ever reach this phase.
10:32 So first of all, naturally, data scientists spend most of their time in the early prototyping phase because that’s where most of their projects remain. That makes sense. And given that, it is actually surprising how many mistakes we are still making in this phase. I mean, I have a list of some of the mistakes I made here. Look, I confess, I do this for 20 years, and in this time I made a couple of mistakes. I mean, these mistakes are things I started to look for in future situations or when I was managing data science teams. Making those mistakes is, of course, not a problem. I mean, we learn from them. We can grow. And the real problem starts actually when we start to hide them. But more about that later.
11:22 I mean, obviously, those are too many mistakes to discuss them all, so I will pick one, which I think is a common one, and it’s also, sometimes it’s hard to detect, so you don’t even know that you are making it. And this one mistake I would like to discuss is a new form of overfitting introduced as a result of feature engineering. So feature engineering, by the way, often makes the difference between a successful data science project and utter failure. But it’s also a creative process. It can be fun. So this is not a surprise that many data scientists spend a lot of time on feature engineering. So some, actually most data scientists, though, do it wrong. So let’s have a look into a couple of examples.
12:04 I have here a list of all the planets in the solar system. So I didn’t talk to him before, but where’s Martin? Martin, any comments on that list? [laughter]
12:17 Yeah, the dwarf planets.
12:21 The other two dwarf planets. Okay. You didn’t say exactly what I predict he would say, but almost. Martin’s pointing out that Pluto is no longer a planet but a dwarf planet. So let’s quickly fix this slide. So we have here a list of the eight main planets and the one dwarf planet, and apparently, there’s two more in the solar system. Martin, you’re also predictable. [laughter] So the thing is, this is the list, and our goal is to predict the number of deaths on each planet. So here’s the data. [laughter] We don’t have a lot of information. I mean, we have this picture here on the left, and we have the names really. Maybe the names is helpful. Let’s see. So Earth, that has five letters, ooh, ooh. Ah, so does Venus and also Pluto our dwarf planet here. Yeah, that doesn’t help. Ah, maybe it’s the picture. Maybe it’s the dominant color in the picture. Earth is pretty blue, right? That should help. Ah, dammit, so is Neptune. I don’t think actually that Neptune is blue, but who cares? That’s not it either. Ooh, ooh. I’m jumping on stage. Maybe I shouldn’t do that. [laughter] So I have it. I found something. I did a lot of feature engineering folks and I came up with the perfect feature helping me with this prediction task. It is the question if the planet has RapidMiner or not. Nailed it. There you have it. It’s also totally obvious that RapidMiner kills people. So you think now probably this is a silly example, and you would be right. However, in reality, things are actually worse, because, in reality, this mistake is actually really, really hard to detect.
14:00 So let’s see another example, a more realistic one. This is based on the flight data set. Many of you may be familiar with this one. It’s about predicting late arrivals for domestic flights in the United States. I did some basic data preparation, some basic cleansing, and then put it just into RapidMiner’s Auto Model. I did not even churn on the full-blown automatic feature engineering in Auto Model. I just did the basic feature engineering like extracting information from text columns or from date columns. Okay? Nothing fancy. But the results; the results look great. Look at the error rates for those three models. They are all are between 1.5 and 2 percent. Who wouldn’t use one of these models to predict flight delays and optimize airport and cargo operations, right? Well, not so fast. Let’s just have another look, just to be safe into the model. In fact, into the weights of the different attributes. So these weights indicate how important each of the columns is for this specific model here. And it turns out that one of the columns, which has been a result of the feature engineering, is somewhat important. It’s called date underscore div, ARR time, CRS ARR time. Sure, why not? Seems to be some difference between two dates, [smirk?]. I’ll take it. Well, maybe just to be sure, let’s have a quick look into the data. So this orange column is the data. Yeah, it’s a bunch of numbers. Looks good. All right, let’s go into production. Right, let’s rock this thing. Again, not so fast because it turns out that this column, CRS ARR time, is actually the scheduled arrival time. And while as such, that doesn’t tell us a lot, if you built the difference between the scheduled arrival time and the actual arrival time, well, that’s pretty much exactly the delay you want to predict. Oopsy. I just accidentally leaked the label information into my model. Yeah, that’s obviously a big no, no.
16:00 So look, don’t get me wrong. This type of feature engineering is often very useful. Think about building a churn model. If you take the last transaction date and, let’s say, the date of today, and you build the differences. And the longer that is, that can be valuable information to predict customer churn. That all makes sense. So doing this in an automated fashion, all fantastic. All good. But in this case, this was leaking information into the model. This is a model should not have. So the right thing, of course, is to remove those columns. And while this is, of course, the right thing to do, it has a dramatic impact on the model, on the model accuracy, as you can see here. Forget the other three, but those are the same models as before. The error rates went up from 1.5 to 10 percent. So that’s six times worse. And this is a common problem. Doing the right things often makes us look bad – oopsy – as a data scientist.
16:53 So look, feature engineering done well can be amazing. It can truly improve a model. Done poorly, it is misusing the so-called curse of dimensionality to its advantage. And that leads to models which look good, but they won’t hold up in production. So they’re useless, really. And even in case of full automation, I love automation, no question about that. Please, let’s always check the results, and let’s use common sense to make sure that we won’t have any negative surprises down the road. But is there even a down the road? Because the reality is this. Even if we don’t make any mistakes at all, we have a bigger problem because most of those models, they are actually never operationalized. Remember this slide from before and that I said most data scientists spend their time in the early phase? And by the way, also most tool vendors have been guilty of this behavior and supported that, actually. And RapidMiner did this for a very long time, too. Think about this. The process designer, Auto Model, Turbo Prep, all these, and more are features which are supposed to make you more productive in the prototyping phase. And there’s nothing wrong with that. Well, what is really weird though is that all those investments are there exactly anti-proportional to the value those models actually have for the organizations. Because people and tools keep investing into the early phase. But if a model is not in production, it doesn’t really have a lot or any value, really. And that’s a huge mistake.
18:26 So let’s discuss one of the greatest examples, in my opinion, for this mistake next. The famous Netflix prize challenge. The Netflix prize, for those of you who didn’t know, it was an open competition for the best collaborative filtering algorithm to predict user ratings for movies. So the idea is very simple. If I don’t know a movie, but Netflix predicts that I would give a movie which is unknown to me a high rating, they recommend the movie to me. I watch it. I like it. More likely, at least, I like it. So I spend more time on Netflix. For Netflix, this means ka-ching, so everybody’s happy. So the grand prize was the sum of 1 million dollars given to the team which is able to improve Netflix’s own algorithm by at least 10%. In addition, Netflix offered annual progress prizes for 1% improvements over the original state first, and then later over the previous year state. And those annual progress prizes still have been $50,000 each. And this race for the 1 million dollars turned into a true thriller.
19:33 It started on October 2nd, 2006, and only one week later, the first team had already beaten Netflix’s own algorithm. That was impressive. And by June 2007, over 20,000 teams have entered this competition. In August of the same year, many of those folks actually met at the KDD Cup– fantastic conference, by the way, where they exchanged ideas. And shortly after, the first annual progress prize for the 1% improvement over the original algorithm was granted. And in fact, that improvement was already 8.43% after one year. So some money was spent, and it seems that this was heating up things slightly because in the next year, more than 40,000 data science teams from 186 countries had entered the contest. And the friendly competition has been somewhat replaced by espionage and strategic coalitions. I mean, some of the top teams, they actually united to get an advantage over other teams and also to eliminate some of the competition. But even after two years, when another progress prize of $50,000 was granted, the original goal of 10% improvement was not quite met yet. But they came really close, and they made it in 2009. Because on June 26th, the first team submitted a solution with more than 10% improvement over the original algorithm. So according to the rules of the contest, this now triggered a final 30-day period. And in this period, other teams also had the chance to pull that off. And after the– or at the end of this period, actually, two teams managed to improve it by more than 10%, and they ended up with exactly the same results on the test set. That was unbelievable. What was more unbelievable though is that because of that, the winner was selected based on submission time. One of the teams submitted their solution 20-minutes earlier than the other team. 20 minutes. 20 minutes decided about who is getting the 1 million dollars.
21:58 Anyway, I told you, it’s a thriller. Anyway, contributions of over 40,000 teams globally led to an impressive 10% improvement over the original state for Netflix. And this would truly and, of course, translate into business impact for Netflix, right? So Netflix hired some of the top teams’ members, and they even licensed the algorithm. So if you think now that this data science success, this achievement was also paving their way to success as a public company, that’s a good idea. So they probably put the whole model into production, right? Wrong. They did not. In fact, those final models actually have never been operationalized. Netflix said that the increase in accuracy did not seem to justify the effort needed to bring those models into production. So look, I’m not saying that this almost four-year race among 40,000 data science expert teams was without merits. I mean, it was surely a lot of fun and education for many of the data scientists involved. Me, by the way, being one of them. But it surely did not pay off for Netflix. And that became a bit of a pattern. As I mentioned before, we think less than 1% of all models are operationalized. VentureBeat reported that only 13% of all projects make it into this phase. And Gartner– sorry, Mike. Gartner says that even out of those models which are supposed to be operationalized, still 60% of the models are actually not.
23:31 So, to summarize, it is easy to make mistakes. We all make them. But luckily, it does not really seem to matter because hardly any of those models are ever in production anyway. Great. [laughter] So how did we end up here? Well, let’s discuss this next. I mean, the first observation is very simple. Data scientists are people, and people just like dogs are doing whatever they are incentivized to do. This can become a problem though if the incentives are wrong. In data science, we have a long culture of focusing on model performance measurements, things like accuracy, classification error, AUC, R-squared, you name it. Okay? And we have seen already, for example, in the Netflix prize, that the question was not, hey, how can we make better TV shows, but how can we improve the accuracy of our algorithm by 10%? And this long tradition of focusing too much on data science criteria like accuracy has eventually led to a platform which is focused solely on this. Companies can post data science challenges, and then people around the globe compete for highest levels of accuracy and AUC. This platform, of course, is called Kaggle. And you thought that the 40,000 participants in the Netflix prize was a lot already? Well, Kaggle claims that more than 3 million people registered on their website.
24:59 You also remember this idea that we could use and train a model to detect deepfake videos? Well, it turns out that somebody already came up with this idea and even spends 1 million dollars prize money for that. So is Kaggle a good idea then? I would say it depends. What is certainly not great is that Kaggle further promotes this wrong incentive of model accuracy being the most important thing to focus on. So, for example, this deepfake competition here, it is using log loss. Sure, why not? I mean, look, there is nothing wrong with optimizing that thing here. I mean, why not? Just for the sake of fun and education. But fun and education are most likely not the main reasons why organizations are spending money on this. Probably those organizations, they want to get the truly best model. The model which has the best and biggest impact on their organizations. But are they actually getting this model? Well, to understand this, we first need to discuss how Kaggle is actually selecting a winner. And given that Kaggle is a data science platform, they, of course, know their stuff. So they’re using a hold-out test data set. So you have a training data set with the label information. You can build models on that. And then you can take the model, apply the model on some test data, create a set of scores, and whenever you want, you can submit a set of scores to Kaggle. All right. So far, so good. But here’s where things gets a little bit more interesting– get a little more interesting.
26:34 This test data set is divided into two parts, but Kaggle does not tell you which of the rows of the test data belongs to which of the two parts. All they tell you is your error rate, your average error rate on one of the two parts. And this error rate goes here into this public leaderboard. Why are they doing this? So you can– well, compare yourself with others during the competition. How well you are doing. Where you stand while the competition is still running. Sounds good, but that’s not how they pick a winner. Why not? I could just create a new model type. I generate a new algorithm. I call it the randomizer. The randomizer is a fantastic machine learning model. All it does, it doesn’t care about the data, it just produces random outcomes. And I keep producing those scores and submit them. And sooner or later, just by chance, I will be number one on the public leaderboard. But I didn’t learn anything. That model can’t be used, obviously, but still, I would win the competition. So obviously, I would overfit to this test set. So the winner, in fact, is not chosen based on the public leaderboard, but they use the rank on this other half, which is going to define the private leaderboard. And whoever is in number one at the end of the competition there is the winner.
27:45 And there’s one more thing. Kaggle, for most competitions at least, also does not allow you to submit more than one final set of scores. So you can keep submitting for the final leaderboard, but at the end, you need to select one of your submissions as your submission for the ranking on the private leaderboard. Why are they doing this? Pretty much for the same reason. You could just otherwise generate thousands of small variants of your score set, just hoping that just by chance you end up in a higher position on the private leaderboard. So obviously, that should not happen. This all sounds perfect, doesn’t it? They really thought it through. Yeah, they did. There’s still a problem, though. Let me illustrate this problem with a little experiment.
28:28 I had a dozen or so magic coins. Stupid me, I lost them. I accidentally placed them on your chairs together with normal coins. They all look exactly the same, by the way. That’s pretty stupid on my side. So luckily, it’s really easy to detect a metal coin, though. I actually have one right here. All right. So this is one of the coins. One side shows a cross. So whenever you would toss a coin, cross, you count the crosses. If you toss a magic coin 10 times, it will show a cross 8 or more times. That’s magic. All right? Let’s do it together. Help me find the magic coins, please. Everybody of you, please start tossing coins 10 times and count how many crosses you are getting. Toss your coins 10 times. Count the crosses. By the way, I have a bet with my wife. If I manage to do this without dropping it– I’m not getting anything, but. Ooh, ooh, almost. All right, keep tossing, everyone. All right, some folks are still tossing. Let’s give them another second. I, of course, have a magic coin. I told you before. So I had 8 crosses.
29:59 So now, could those folks with the magic coins, those are the coins which showed 8 or more crosses, please stand up. Only one, two, three, four, five– yeah, that sounds right. Okay, fantastic. Here is my dozen magic coins. Guys, you can keep it. You can sit down again. This is fantastic. You got yourself a magic coin. You are the winners. You’re the best. From now on, we should call you the coin grandmasters. There is nothing special about the stupid coins, we all know that. It’s all the same coins. But take the perspective of the winners. All they knew is if I have 8 crosses, I have a magic coin. Fantastic. So they believe it. By the way, it is actually statistically insignificant at the 5% level, but that’s a different topic. [laughter]
30:46 But all coins have been exactly the same, and still, some people have been winning just by chance. So let’s now imagine that we are not tossing stupid coins, but we are actually participating in a competition with fixed test sets. So just by chance, although we are all having, let’s say, the same model of the same quality, some models may perform a little bit better on this particular hidden test set. And here’s the thing. Kaggle came up with the idea; well, we can’t allow 200 submissions of any team, that makes sense. But mathematically, it doesn’t make any difference. It doesn’t make any difference from a mathematical standpoint if you have one team submitting 200 solutions, or 200 teams submitting one solution each. It’s the same. So due to random chance, some percentage of models will outperform the other ones, even if they are all just as good as each other. And this, of course, has an impact on who is crowned as the winner. So if this is the distribution of the scores on the private leaderboard, most models will be somewhere in average performance, and the winner here is on the right. It all makes sense. However, we just have seen that some or the equal coins or equal models, models of the same level of quality, could just by chance end up as a winner because they just did a little bit better on this particular hidden private leaderboard test set.
32:07 So what is the actually best model? I mean, not the best one for that particular test set, but one which holds up, which works well on a variety of test sets. This model is probably more likely– oops, sorry, is further down the ranks. Not necessarily the one at the top. So if our goal was to actually find the best model, it looks like we’re not doing a good job because all we found is a model which is good on a particular test set, but not a model which holds up in production, which works well on a variety of test set. Not a model which is resilient against the changes of the world because the world constantly does change. And that is what we really want for a model. We want a resilient model. So let me be very clear at this point that my point is not to criticize Kaggle or anyone who is actually organizing machine running competitions. Kaggle and the smart people working there, they contributed a tremendous amount of value to both industry and education. But it also needs to be pointed out that Kaggle does contribute to this idea of the wrong incentive. That everything’s supposed to be about optimizing model performance. And data scientists being people just do whatever they’re incentivized to do.
33:29 Kaggle even has a ranking system where the highest rank is called grandmaster, okay? Similar to chess. And some of these grandmasters desperately want to be the number one. And this level of ambition was even leading to the first case of fraud now in a Kaggle competition. This particular competition was about optimizing the online profiles of animals living in shelters to improve their chance to be adopted. But one team found a way to leak the label information from the test set into the model. It is very sad indeed that such brilliant people, including a highly respected Kaggle grandmaster, have gone to such lengths to defraud a welfare competition aimed at saving precious animal lives solely for their own financial gain. Yeah, it is. It is indeed very sad. And the financial gain wasn’t even that massive because this contest from Malaysia was offering $10,000. Not a million or something like that. And the Kaggle grandmaster, who I will not name here today, but who is known; he admitted himself that it was never about the money, but rather about the Kaggle points. The constant struggle to become number one in rating has compromised his judgment.
34:48 This all blew my mind as it is, to be honest. But another interesting thing is– it’s a bit of a side note, actually, is how this cheating scandal was actually done. How this grandmaster actually did the cheating. So remember this one here? So all they did is– well, they knew the attributes and the values for the test data set, obviously, because that’s where they want to create the [inaudible] for. So all they did is some smart web scraping. Go into the website Petfinder.com and figuring out if those animals actually have been adopted or not in the past. Well, after they had that information, they had the label. So obviously they can’t just add the label to the data set because that would have been easy to detect. Well, what did they do? You may guess it at this point. Some smart feature engineering. In fact, they added multiple columns, which together encoded the label information. And they did this in a format which was not readable to humans. So nobody was able to detect this easily. And the model obviously would have been 100% perfect. That would have been very, very suspicious, as we all know. So they just added some random noise. And here we go.
35:46 Look, I told you, feature engineering can make or break a model. But what is more shocking, we have seen that doing things right often makes your models look worse. So it begs the question. How often does this happen? How often are data scientists making mistakes and are hiding them or even commit fraud on purpose? Did your organization maybe suffer from this already? I don’t know. But it certainly happened more than once, or zero times at this point. And we should not be surprised because this is exactly what happens if you combine a system of wrong incentives with a lack of control. And this is also not a new problem. In fact, even scientific publications have been suffering from this since decades now, already. I highly recommend, by the way, to read one of those two papers. This is really fascinating. And here’s what our very own Dr. Martin Schmitz’s, the planet guy back there, has to say about those publications. “This paper investigated if new cool deep learning algorithms are outperforming traditional methods. The results are shocking to me. In most cases, simple methods are more than competitive, and a lot of the reported results are not even reproducible. The craziest thing is the last chapter on SpectralCF. It seems that a favorable test split was chosen for better results. And those are scientific publications.” But even if you do not do any fraud, okay? So if those wrong incentives, it can still lead to massive problems because chasing for higher accuracies is often nothing but a waste of time. Because it is not the accuracy of it should matter in the first place, it’s the business impact a model should have.
37:36 So let’s have a look at the results from another Auto Model run. If you sort those results according to the gains, we can see here on the third position there’s this deep learning model delivering 80,000 gains over the baseline model. Okay? But the accuracy or the classification error of this model is actually quite high, so it’s almost 13%. In fact, this is the highest error rate out of this group of four models here. The model at the bottom of this group ninth base is much better. It’s actually the best out of this group, 11.5%, but it only offers 44,000 gains. Wait a second. So it is the better model from a data science perspective, but it actually has less business impact. So if you would use the better model here, the more accurate model, the thing you have been chasing for for such a long time, you would actually lose half of your gains. That doesn’t make any sense. By the way, another interesting side note here is at the bottom of this list, there is even a model which would cost you money if you put it in production. So obviously, that would not be a good idea. But you can’t tell something like that from the error rate alone.
38:38 So here we have it. We waste a lot of time on modeling, and we are doing this for the wrong reasons. And sometimes we hide mistakes or even commit fraud to make us look better. What could make this more horrible? This. Because here’s the answer of another unnamed Kaggle grandmaster when he was challenged about Kaggle competitions. And the response pretty much said, “Well, it all does not really matter. Those models never go into production anyways.” Jeez. [laughter] I mean, it is important to me, again, that I do not want to attack Kaggle’s grandmasters. However, I hope it became clear that we have a lot of issues here, and those actual issues are actually much bigger than just Kaggle.
39:22 So let’s quickly summarize again. Our models are not getting operationalized. Some data scientists even fake their way through life for the wrong incentives. And we often do not even notice that we have a problem because, well, even if we put a model into production, we often do not have any model management in place, and we have a zero culture of accountability. So what can we do about this? And as so often, the solution path is a combination of two different things; technology and culture. And we need both to make this work. Let’s start actually with the easier one, which is technology.
39:56 We need to put the right technologies in place. So first of all, a model which is not in production does not have any or at least not a lot of value. We mentioned that. So we need to start wasting– stop wasting, sorry. Stop wasting time on chasing optimal models, and we need to start generating true business impact as early as possible. And then we can iterate from there. It’s truly a more agile approach to data science, to be honest. However, just putting a model into production is not good enough. We also need a thorough monitoring of those models. I said it before, our current situation is the result of a dangerous combination. Wrong incentives mixed with a lack of control. Model management can provide us some insights into the health of models, and it can deliver on some level of control at least. And it’s also the main tool to avoid what we call the spiral of disillusionment. What is that? Without model ops, you do have no easy way to detect that there are problems or your model is no longer performing very well, and it will be very hard to sustain the impact then. So that obviously leads to bad experiences. Sometimes even massive losses. And that as a result will lead to less buy-in from the organization. So you will end up building less models because why would the organization keep investing to something which was a failure the last time? Well, and this is where the whole thing becomes actually a spiral because if you have less models, you have less and less reason to invest into model management in the first place, and things go downwards from there, just as Jack Dawson.
41:26 And this is actually not that hard to achieve. Good model management, good model monitoring only needs four– whatever, four different components. First, you need to see the performance over time. Is your model just accurate, or is it also resilient? If you see the model performance over time, it’s easy to detect if you have a resilient model, one which keeps up, or if you just have a one-hit-wonder. Second, drift detection, you also want to see if there are changes in the world. Concept drift is the major reason why model performance deteriorates. Third, you should do a champion versus challenger approach. I highly recommend to always start with the resilient model first. Linear models or other simpler models come to mind because they tend to overfit less to a particular test set or a specific moment in time. And then, after you have this model in place, you can add more complex models as challengers. And you can keep tuning them. And you can prove out over time that they are truly better, that they are truly resilient. And then, with one click, you can switch them over to become the new active model. And last but not least, business impact over time. We have seen before that some models can even have a negative impact. And even good models could become a negative model because of concept drift if the model is not resilient enough. So you need to monitor this. And obviously, if the model starts costing you money, you really need to retrain the model or stop using it. So more than that is always nice. But those four things can be super helpful as a minimum to avoid most of the problems we have discussed today.
43:07 Which brings me to the second aspect which is the culture, because having those technologies in place can help you to build this culture of accountability. But that’s not enough. We also need to make some additional changes as data scientists. And this brings me to what I call the data science manifesto. Ooh, that’s a big thing. I call it RADIUS, by the way. I thought that’s kind of fitting for a data scientist. Look at this. Beautiful. Anyway, I think those policies can really help us to become better data scientists. So let’s go through them. First. Resilience over accuracy. Do not build models which work well for a short time, but build models which stand the test of time. Accountability. Own your models. Take responsibility for accidental mistakes and be aware that your actions will impact your organizations. Deploy early, monitor, and iterate from there. Only deployments show true success. Monitor those models and use challenges for additional improvements. Impact over optimization. Focus on the business impact, not on the data science challenge. And I know this is hard, but always keep the ROI in mind. Upright and ethical. Data science should be about finding the truth. Your job is not to look good. Your job is to have a true impact. And finally, simplicity first, no waste. The simplest possible model is often the best model. It is easier to understand, anyway. So be thoughtful about your resources.
44:50 So there you have it. A data science manifesto called RADIUS, and I hope this will help us all to become better data scientists and avoid some of the traps we have discussed here today. I’m convinced that if we all follow these basic rules and principles, we will be able to do data science and machine learning with impact, and we will not only predict but even positively shape our future. This is artificial intelligence for winners. Not just with board games, but in the real world. In contrast, deepfake is, of course, for losers. But I have to admit, it is also a lot of fun. I mean, look at me with a beard. [laughter] That stuff’s hot, right? Right? Wait, what’s happening now? Wait, I don’t like this beard. Wait, okay, that goes way too far. Right, stop it here. All right. Enjoy RapidMiner Wisdom, everybody. Thank you. [applause] [music]