Everyone talks about how machine learning will transform business forever and generate massive outcomes. However, it’s surprisingly simple to draw completely wrong conclusions from statistical models, and “correlation does not imply causation” is just the tip of the iceberg.
The trend of the democratization of data science further increases the risk for applying models in a wrong way. In this webinar, Founder and President Dr. Ingo Mierswa discusses:
- How highly-correlated features can overshadow the patterns your machine learning model is supposed to find – this leads to models which will perform worse in production than during model building
- How incorrect cross-validation leads to over-optimistic estimations of your model accuracy, and especially the impact of data pre-processing on the accuracy of machine learning models
- How feature engineering can lift simple models like linear regression to the accuracy of deep learning – but comes with the advantages of understandability
Hello, everyone and thank you for joining us for today’s webinar, How to Ruin Your Business with Data Science and Machine Learning. I’m Hayley Matusow with RapidMiner, and I’ll be your moderator for today’s session. We’re joined today by our very own founder and president of RapidMiner, Dr. Ingo Mierswa. Ingo is an industry veteran data scientist who started to develop RapidMiner at the artificial intelligence division of TU Dortmund University in Germany. He has authored numerous award-winning publications about predictive analytics and big data, and takes pride in bringing the world’s best team to RapidMiner. Ingo will get started in just a few minutes, but first, a few quick housekeeping items for those on the line. Today’s webinar is being recorded and you will receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who are not able to attend today’s live session. Second, if you have any trouble with audio or video today, your best bet is to try logging out and logging back in, which should resolve the issue in most cases. Finally, we’ll have a question and answer session at the end of today’s presentation. Please feel free to ask questions at anytime via the questions panel on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. I’ll now go ahead and pass it over Ingo.
Hey. Thanks Hailey. Yeah. Welcome everybody also from my side. I’m really excited to do this presentation today for you, and I hope that we can learn all a lot of things. But also have a little bit of fun. But of course, it’s hard in a webinar because I can’t see your faces. But I will do my best to bring across some of the interesting and important topics we are talking about today. So as Hailey has said, this is about how to ruin your business with data science. And of course, I am not expecting that 2,000 people to register for this webinar just to learn how to really ruin their business. So of course the goal should be how to not ruin your business. So what can we actually do to avoid some of the typical mistakes? And this is very important. One of the reasons why this is so important is because there is so much at stake. So let me start actually on the positive side before we go to the more negative side of data science and machine learning, and remind ourselves what is actually the opportunity we have with machine learning. Let’s have a look at Amazon, for example. And I mean, we all know Amazon, obviously. Amazon, last year, had a total revenue of 136 billion US dollars. And many insiders, especially in the machine learning space, believe that between 15 and 20 percent of this revenue is actually a direct result of all their recommender systems and their up and cross selling efforts.
So we all know about this. We figure out, “Hey, what are other items you might be interested to buy from Amazon?” That is a pretty impressive number. I mean, up to 20% out of 136 billion, I take that, of course. So this is one of the biggest success stories we know about and where we actually can see some of the numbers and results which are connected directly to machine learning. Maybe an even bigger success story before we get to the not so great stories– a maybe bigger success story is actually Google adverts. I claim that practically 100% of the more than 60 billion of revenues Google adverts made last year is a direct result of machine learning. Why? Because Google really cracked the code about how to match whatever you’re looking for, whatever you’re searching for on Google– so basically expressing your interest with matching advertisement and make it more likely that you actually click on this advertisement, and eventually, also buy the products which the advertisement was done for.
So 60 billion, practically 100% out of a good matching algorithm, if you like, and this matching algorithm, again, is done mainly through machine learning. So those are really impressive results. On the other hand– okay. On the other hand, we have a company like Tesco. Not all of you might be familiar with Tesco. It’s I think the largest retailer in the UK. And a couple of years ago, they also tried to enter the US market, which as we’ll see in a second, actually failed. Tesco started great in their efforts around predictive analytics and big data. So already in the mid of the ’90s, they started to invest into new data structures and do everything to support their customer loyalty programs. So you see a couple of those important points in time like those orange dots, like how much data they used. And more important, on this line, you see how much they’ve been able to increase their profits as a result of machine learning. So they made more promotions. They made better promotions. They actually made better advertisement, more personalized advertisement, and that actually was leading to a pretty significant growth for Tesco. So basically, up to a 7– they were almost a 7 X growth on the profits in 10, 15 years, which for a retailer of that size is really hard to achieve.
On the other hand, that started all very well, but then a couple of years ago, it ended pretty shockingly. So what happened was that Tesco really had kind of like a downfall, and insiders connect this downfall with basically two big things. One was while they needed to write-off an astounding 416 million in profits, and which actually led them to the resignation of their chairman and everything else– and yes. There have been a couple of shady accounting practices, and we’re not going to talk about that. That’s not just the big data investment which backfired, but that happened as well. So a good chunk of this amount was also– but all of a sudden, this nice growth curve you have seen on the previous slide actually was no longer working like that. It was actually falling down again. And what has happened is that the customer sentiment quickly turned against Tesco mainly because the customers felt that they are sharing more and more of their data, but they’re not getting really anything of value back from Tesco, and the only winner in this game was Tesco. So it’s not necessarily just wrong models or predictive models which have been created, it was also just how Tesco has used those models. And that really didn’t work out very well, and this whole thing turned against them. They also failed entering the US market, and really, it all went down pretty much at the same time. So it’s not just the machine learning problems they had here. So that is interesting.
We saw Amazon and Google adverts. Very positive examples where it’s publicly known like, “Okay. What was the impact of machine learning?” We have a couple of other examples. We have one we already saw. We will discuss another one later and where things don’t go that well. The big question is what is the difference? What makes the difference between a great success, something which really works very well, where machine learning is creating lots of value, and cases where everything goes down and nothing really works very well? Well, in order to answer this question, it’s pretty obvious that we have to analyze some aliens. And before you now stop watching this webinar, bear with me. It will all make sense to you in a second. Okay. Let’s analyze aliens. What can UFO sightings tell us about extraterrestrials? So this is a database which I really recommend to everybody listening in today to check out yourself. There’s a link at the bottom for the original work, but you can also check out the whole NUFORC database, which is the National UFO Reporting Center. This is unstructured data mainly. So people describe their UFO sightings, but it also comes with a couple of structured information like the date and the time. By the way, if you’re interested in text analytics, this is really an awesome data set to work with. There’s a lot of interesting patterns you can find in those UFO sites.
But today, we do actually something much simpler. Let’s have a look at this chart here. So please, all look at this chart. And what this chart is showing to us is the average– not the average. The total number of UFO sightings per year since 1963. So you see this blue line and it was kind of flat for all the time, but then something happened in 1993, and all of a sudden, the number of UFO sightings has been dramatically increased per year. I know this is a little bit harder to try if that actually works. You guys can actually ask questions and just go to webinar panel. So think about this for 20 seconds. And if you know the answer, why did the number of UFO sightings all of a sudden increase that much, please type the answer as a question to the panel, and I’ll check this here and see if I see one of the answers. So nobody yet or at least I can’t see it. Let’s make this a little bit bigger here. Oh, there’s very good points here. All right. So yeah. Here’s somebody who has it right. So many people actually said, “Hey, was there some movie?” And another person said, “Maybe the FBI in on something.” And then I’ve found actually one person who said, “What year did X-Files start airing?” And that’s spot on.
In fact, it was in September 1993 that the first episode of The X-Files has aired. And this is a cult TV series. I mean, the first episode was seen by more than 5 million people in the United States. Its peak, more than 25 million people watched the series. That was back then more than 10% of the total US population. It was really that popular. And I think, actually, in those 25 million, they are not counting the real population watching this. They only count the human population. But think about the topics about the X-Files. Of course, extraterrestrials are a natural extension of the human viewership. So Nielsen and all the others, they got it wrong because they only counted humans. They also should have been counting aliens who all came to earth to watch the X-Files together with us. So clear to me, aliens are friends of the X-Files. I mean, it’s in the data. You all can see this. Yeah. X-Files start airing, and all of a sudden, the numbers go up because aliens are very social beings. And that brings me to the next topic. If you look at this chart here, you will see basically per day of the week, Monday through Sunday, and for all the different times, the number of UFO sightings by hour and day. And orange means or yellow means there’s more UFO sightings, and blue means there’s less. And as you can clearly see, most UFO sightings happen on Saturday nights. And I think it cannot be a coincidence that that is also the peak time for fraternity parties. I mean, think about this. We saw already on the previous slide that aliens are very social beings. And now, there is no reason that aliens who come from galaxies far, far away from planets, flying around their own suns– so the thing really is there is no reason that there is a seven-day-per-week schedule. There is no reason that there is a 24-hour per day schedule. But those aliens are so social beings, they fly to earth, they quickly adapt to our own schedules of working and partying. And it’s pretty clear that they do the same thing as we do because they are so social beings. They work very hard, and they party very hard together with us. That’s why we see them.
And on this topic of being social beings, my fellow German friend, Roland Emmerich, who is the director of the movie Independence Day, got it completely wrong. Because here is the average number of reported UFO sightings per week since 2010. And we can clearly see that in the week of 4th of July there’s more UFO sightings than in any other week. So he got it wrong because we know already those are social beings. Those aliens, they’re not coming down to earth to destroy the White House or do other things. No way. They are social beings. They come to earth. They come to America on 4th of July because aliens love America and they do love fireworks. It’s completely obvious to all of us. It’s in the data. Is it though? I mean, this is exactly the kind of thing you might think now, “Come on, Ingo. That’s ridiculous. I mean, this is so clear that there are no alien sightings. This are probably fireworks, and this is totally clear that, whatever, those are drunk people doing fraternity parties who see some light and think, ‘This is a UFO.'” But actually, you can’t be that sure about this by just looking at the data. You would look into the data and you would believe they love the X-Files. And that, actually, still is one of the biggest problems around machine learning and data science. And we will see this in a second in recommender as well.
But before we get there, let’s actually have a look at the three big problems. The first one is what we just have discussed. Still, the confusion between correlation and causation is one of the biggest problems in machine learning. People see something, they model something, or they model things in the wrong way. And we’ll see how dangerous this actually can be in a second. And that actually leads to models which won’t perform that well in practice later on. And on that topic, there’s another big problem. Even if you do model your data in the right away but you validate your model in the wrong way, you still won’t know how well your model actually works. And that is a problem which is much harder to grasp, but it’s actually leading to really big, big losses. And we’ll discuss this as well. And finally, and this is something I was guilty myself for many, many years as well, people focus way too much on models and using the latest and most fancy hype model types coming fresh from college especially and learning about all those very difficult models instead of focusing more on the data and especially the data preparation. And I will show a couple of very simple examples later on to explain what exactly I mean with this. And we will discuss also how to do better in all three cases.
All right. Let’s start with number one, the confusion between correlation and causation. So we saw already, actually, the alien example which was exactly that. But before we actually do the mistake very likely together, in RapidMiner in a second, I would like to give you another business case. In this case, it’s coming from a government example because it’s really hard to find any business sharing how they wasted millions of dollars by doing this mistake. But here, we actually have a good example. Some of you might have heard about this. In Illinois, I think it was like in 2002, 2003, there was some research, some study done by the University of Chicago if I remember correctly, and this study showed that if there are more books in home– so the more books are available at a home with children, those children are actually getting better test marks. So they do better during the exams. And that, actually, if you think about this, makes sense. I mean, children who have access to books, they read more. They learn more. They will do better in school. They’re used to reading. They’re maybe even better in taking information in to their brain coming from those books. So of course, yeah, they have better learning, so they will actually probably do better exams. And that is something we can easily understand.
This was also something the governor of Illinois in 2004, a gentleman called Rod Blagojevich who actually later on ended up in some scandal– but that’s not the topic for today at all. Here, I actually almost have to applaud him here because he had a long-term vision and a great plan. He thought, “Well, if people have more books actually learn better and those children actually do better in the exams, they later on will do better in school then they’ll get better jobs, make more money, pay more taxes. So it will pay back for the state of Illinois.” So he came up with a plan, and his team I suppose. He came up with a plan, “Let’s send a book every month to every child in Illinois from the time they were born until they entered kindergarten.” This plan would actually cost $26 million per year. But hey, again, there might be a big payback down the road in form of higher taxes. So again, I applaud him for this long-term plan. Unfortunately, I can’t applaud him for being a good data scientist. Why is that? Because there was a follow up study which actually has shown that it didn’t really matter if those students ever read the books or not. Just the mere fact that there have been books in this home led to better test marks independent of the fact if the children read the books or not. And that, all of a sudden, destroys our whole hypothesis here.
And if you now think about that, again, knowing that, you will come up with a new one which probably is around, “Hey, it’s actually not about reading the books. It’s about the parents and the question if they create an environment for those children where knowledge and learning is maybe valued a little bit higher.” So if the parents invest a lot into books and read a lot, it’s very likely that also the rest of the environment for those children actually will invest more into their education. And that is the reason why their test marks is better. It has nothing to do with reading the books in the first place. It has to do with the environment that the parents create. And of course that environment won’t drastically change by just sending them some books. So luckily enough, this plan was never implemented. So the state of Illinois didn’t waste this $26 million per year, but it was close. And it shows also how easy it is to take this correlation, draw the wrong conclusion, and actually make a very costly mistake.
All righty. Let’s do a demo here together. So I’m firing up RapidMiner now, and I would like to model something together with you and that’s the good old Titanic accident. Well, keep in mind that there is something around correlations and causation here which might be a problem. So you should really see RapidMiner now. For those of you who know RapidMiner already, the next 30 seconds is probably not necessary. But for all the others, the whole idea is that you build analytical workflows like the one we see here in the center consisting of data sets you can take from your repository here on the left and operators which are here on the bottom left. And every block here is an operator doing something. So here, for example, you load some data, and here we do a cross-validation, and if I go inside of this cross-validation, I can see that I actually made gradient boosted trees model here. I apply it on the training data, and I apply this model and the performance on the testing data, and the cross-validation is looping all of the, in this case, 10 different test sets for me and does the average. So that’s a pretty simple process.
Okay. So let’s have a look into the data sets. This is the Titanic data some of you might know already. We have all the passengers. 1,309 passengers in total of the Titanic in 1912. And we have the names of the passengers and the most important column is this green column here which tells us if the passengers survived, yes or no. And then we have a couple of other columns here, for example, the passenger class. And I would ask you now who did watch the movie, whatever, 15 years ago or 20 years almost. Who would have watched the movie and what you believe has really an influence on survival. And most people would say like, “Oh, yeah. Sure. The passenger class clearly has an influence on survival. I mean, people in first class look like they have a higher likelihood of surviving the accident.” We have other information like the gender, the age. We know for every single passenger, for example, here for Mr. Ellison here. He didn’t survive. And we know all the information here. So the goal is to build a model based on this information to predict if a passenger is surviving yes or no. Well, that’s exactly what we do here. And let’s run the rest of the process which is this cross-validation here with the gradient boosted trees.
And as we can see here, we get a pretty good model, and the accuracy of correct predictions is 97%. That is really good. That is actually really good. I would go that far and say that is way too good. I mean, we can be that sure, based on that data, that we can predict almost perfectly who’s going to survive the Titanic accident. Wait a second. That’s suspicious. And by the way, in general, whenever you get a little bit too good, that should be suspicious. So I can ask again. If you like, we can actually put this into the question panel here again. What do you think– what was the mistake we made in this very simple process? So you can write into the questions again if you have an idea. What was the mistake we made? Test/train data, split data, overfit the data. So far, all three. This one gets closer. I’ll give you five more seconds. The model’s overfitting the data. Nope. That is all not the reason. Aha. We can stop here. We have a correct answer here. Somebody asked did you leave in the lifeboat information? And let’s have a look at the data set again. Yes. We left it in. And as you can clearly see, if you didn’t make it on any lifeboat, well, you didn’t really survive. Okay? But if you made on lifeboat, well, yeah. Guess what? You did survive. So we left in this information. And well, again, you might think, “Well, that’s easy to see.” But think again. If you have 200 columns, and it’s not that clear for you what they actually all mean, it’s actually hard to see these kind of things.
Why is this a problem? The problem is that the lifeboat information and the fact if you survive or not is highly correlated. So I can actually calculate the correlation. So I basically load the data again and just handle the survive column as any other column here and build the correlation matrix. And if I do this, I can see, well of course, every attribute is highly correlated with itself. I mean, whatever, passenger class is highly correlated with passenger class. No surprise. You can see also that the passenger fare is highly correlated to the passenger class as it should be but not perfectly correlated. But what is almost perfect is, in fact, the question if you make it on the lifeboats and the question if you survive. This kind of high correlation should always be thought about. I’m not saying this is always a problem. Sometimes it is highly correlated and it is okay, and it just makes your model building easier. So then you have good luck. But in many other cases, that actually is a problem. This is the case here. Why? Because if you know already if you made it on the lifeboat or not, there is no longer any value in predicting this from this point in time on. The only value would be when you actually can still do something about the survival. And that would be in 1912, at least, when you bought the ship. In this moment, let’s imagine you bought the Titanic and somebody tells you, “Oh Mr. Mierswa, I see you have a third class ticket. Just in case we are hitting an iceberg, you’re going to die.” I might have second thoughts. I might actually decide to not board the ship. Because today, sure, I can try to take out my cellphone and call for a rescue helicopter, but back then, I was pretty much screwed.
So if I know already if I made it on this lifeboat or not, it’s actually too late to really do anything else. So that’s why, actually, this high correlation is a problem. And what we need to do is we need to take this column out. So now I just remove this whole column. I run the whole cross-validation with the gradient boosted trees again, and as a result, aha, now we drop down to 80% accuracy from our 97%. And this is now an important thing. We are so trained as data scientists to always get to better and better and better models that we sometimes forget what is actually the business scenario around this and if those good models, which are actually using information they shouldn’t use, are maybe just good because we kind of cheated and we should not do that. So in this case, it is actually better to accept the fact that we can only be 80% correct, but at least we know we will only be 80% correct in any realistic scenario instead of actually optimizing for the best model possible which, unfortunately, we are pretty much trained in doing and ending up with the best prediction.
Okay. So coming back, what can we do to avoid those problems? First of all, always check for correlations before you start modeling and understand what they mean. And this is also the reason why I believe in automation for data science and machine learning. We can automate the parameter optimization, the feature selection. You can run through multiple models and I can try out lots of them. And you can do all of those things with RapidMiner as well. What I don’t believe though, is in magic wands because this is exactly the kind of thing where a human being should be in the loop. If ever anybody says to me, “Well, upload your data, click this red button here, and magically it will create the best model for you.” Well, the best model would be keep this column in. Because by taking it out, unfortunately, we will actually get worse. That’s why I want the human brain in the loop here because only we can really understand where this model is going to be used. So sometimes, as I said, it’s good. You can keep it in. But sometimes you need to remove it because the business circumstances require you to do so. And really important is the last point here. So if it’s highly correlated, often that means it also is only available pretty much at the point of time that the label information would also be available to you. Think about if that’s actually a point in time that you can do something. And if that is too late anyway, well then take it out please. Even if it means your model will be worse in that case.
Okay. So that was the first big topic around correlation and causation. And you see it’s actually not that difficult to fix it, but you need to accept that your models might be worse afterwards. The next big topic is around wrong model validation. This is a topic I discuss a lot because I know it’s a little bit more difficult. I know it’s a little more technical, but it’s so easy to do it wrong. And unfortunately, most products also force you to do it wrong. But it doesn’t have to be that way. Let’s go back actually into RapidMiner, and let’s have a quick look. So the first quick thing I will take the time just very quickly is please always forget training errors. Never ever even calculate them. There is no good argument. And I know this sounds harsh, and I know this might cause some discussion if you only could discuss now because people believe, “Well it’ll help if I compare the training error to the testing error. I might detect overfitting.” Nope. You can also detect overfitting by actually seeing if the curve of the testing error gets worse again. There is really no good reason to ever pay attention to the training error. And here’s a quick example to show you why that actually is the case. Here, I generate a random data set. Purely random.
Let’s have a quick look here. It’s two columns. Complete random data and then also I applied a random label to those two columns there. Sorted here by negative and positive, but it doesn’t really matter. If I plot those two columns like I’m doing now, you’ll see, well, everything is random. I mean, the point locations are random and the coloring is also random. How could any predictive model be anything better than 50%? I mean, they’re equally distributed. The error should be 50%. Now I’m training a model on this data set, and then I apply model on the same data set and compare the performance and guess what? If I do this, look at that. This model is 100% perfect. It has 0% classification error. How is this even possible? This was random data. You’re not supposed to be able to learn anything from random data. Well it’s easy. It’s one of the most simple models around, k-nearest neighbors, and it just sets the value of k-nearest neighbors to one, and what’s happening literally is– well, if I go back to the chart, let’s have a look at this data point here or let’s say this here on the left. This red one here. I hope you can see it. I put this one here. I just got a small rectangle around this. Well, let’s find the nearest neighbors to this data point. Oh, wait a second. That is the data point itself. It’s red. Brilliant. It’s positive. I’m done. It’s correct. And that is the problem why we always end up with 0%. So you might think, “Well, yeah. Then training error is not a good idea for KNN with K equals one,” and I would agree. But you actually have the same problem for all learners as well. And let’s just imagine you would actually optimize for the training error for whatever reason, you would actually optimize the value K in an automated parameter optimization. Well, you will always get the perfect model. So my big tip here is let’s forget about training error. Let’s do always an independent different test sets you have to. Ignore training errors. And if you do that like I’m doing this process here, your classification error will be close to the 50% you also expected. And that is important to know.
But while we’re talking about what is really problematic around validation is the following. So here, I take a different data set. It’s the Sonar data set. Let me quickly show it to you in case you don’t know it. It’s 60 attributes, all numeric, and there’s this one class attribute here and there’s two possible values, rock and mine. So our goal is basically to predict out of all those frequency values here from this frequency spectrum on the right– to basically predict if it’s a rock or mine. If you look to its chart, that almost looks like random. Even if you look actually into this parallel plots here, it looks like a hard task. I mean, every line is now representing a row in our data set, and in fact, it seems to be almost chaotic. It seems to be hard to differentiate. And even with a different type of visualization, you see already that there are certain areas like around attribute 11 here, for example, where things differ a little bit more or here on the right, but it’s still going to be a tough task because the rows really overlap a lot.
Okay. So let’s now say inside of this cross-validation, I’m actually going to use– so I’m doing a proper cross-validation. I’m using a KNN learner again with K equals one. But here, it’s not a problem because I do a proper cross-validation. So I calculate only test errors. So let’s add some normalization of the data. Why? Because KNN uses a similarity measure, so we need to normalize the data so that all the columns, all the attributes are roughly in the same range. Otherwise, some of the columns might dominate some of the other columns. And while there is already this little hint here, well, let’s just add the normalization here. And that, in fact, is what’s typically– actually, all problems will force you to do. Because you can’t really go inside of the cross-validation. And even if you program your data science model with R or Python, it’s much easier to normalize the data before you actually do the model building and the cross-validation. So it’s typically a hard task to do it in a different way. But let’s have a look what happens now. I normalize the data. I run my cross-validation and just let’s remember this number which is 86.6. Let’s just remember 86.6, okay, as an accuracy. But again, we made a mistake. And I’m going to ask again so you can actually think with me what might this mistake be. Any idea?
I’m looking at some of the questions here. Normalization out of cross-validation. That is exactly the problem. I mean, I have to be honest here. That’s also kind of the only thing I did. Of course, you had to recognize this. I made the normalization outside of the cross-validation. And that is actually– I’m now leaking information about the test sets into the training sets. Why? Because if I use the full data sets inside of the normalization, I look at all the rows at the same time, but I’m not supposed to know anything about the rows I’m going to use my model on and use for scoring. So you have to do the cross-validation actually inside of the cross-validation. So you see there’s nothing in between if I now go inside. Before I do the model, I put the normalization before. I then deliver the parameters I use for this normalization into the testing side here of the cross-validation. I apply the same normalization on the test set, then I apply the predictive I model created on the training set, and then I actually calculate the performance. And if I do this now– you might have remembered 86.6%. But if I do it now the normalization inside of the cross-validation, because of this leakage of information, my model was actually evaluated a little bit too optimistic. In practice, it wouldn’t be that good. It would only be 85.6% good. So 1% less. Now, you might argue what is the big deal? Who cares? 1%. That doesn’t move the needle. Well, maybe not. Maybe this 1% is not the problem. But those mistakes, they stack up. You start with the normalization then you do the same thing for parameter optimization, which most people know they shouldn’t. And they have are some external validation set etc. But still, it’s easy to do it. Feature selection. Everything, basically, which uses all the data, every pre-processing stack which uses the knowledge about the complete data set should not be outside of the cross-validation. If you do this, those mistakes stack up. And actually, it’s only 1% now, but then after they stack up, all of a sudden, your model looks like 10% better. And that is a problem.
Why? Because this, again, ruin you actually. Let’s have a look at this quick example here. Let’s imagine a company which is losing 200 million per year due to customer churn. I mean, every company loses customer. That is just normal. So 200 millions of revenues are lost every year per the customer churn. That means you need to replace it somehow. Or you do something, do some activity to keep those customers, which typically is cheaper than actually acquiring and replacing those customers to buy new ones. So that’s why customer churn prediction is such an important topic for machine learning. So a machine learning model has been created, and unfortunately, using this improper validation has shown to reduce this churn rate by 20%, meaning 40 million. But of course, you’re not getting those 40 million for free by just knowing who is going to churn. You also need to do something with them. May give them some incentive to stay. Maybe a discount or invest more into servers to actually make them more happy as a client. So let’s say you figured out what you need to do for those clients in order to save those 40 million in revenues. But those measurements would actually cost you 20 million. Well, you pay 20 million but you keep 40 million, so that’s still a very good investment, okay? So sure, somebody made the call. All right. Let’s move forward here. Unfortunately, afterwards, after you are in production because that is, unfortunately, the first time you will realize that you validated your model incorrectly and that it’s actually overoptimistic, and your model won’t perform as good as you thought it will be performing– afterwards, you realize, oops. It’s not 20% reductions. It’s only 5% so we only have 10 million revenue savings. Nobody pays 20 million to save 10 million, so the company just lost 10 million. Well, that’s not really good. And I guess there’s a data scientists looking for a new job.
You might think, “Well, that’s not happening.” It happens all the time. People are almost forced to validate the model in the wrong way. And as a result, they make the call to go into production. And while it’s not always 20 and 40 million or whatever kind of numbers, that the model performs less good in production than you thought originally is a direct result of you tried to optimize the model, optimize the validation, get better percentages, and it looks so good. But unfortunately, later on, it will perform worse, and that can actually mean this kind of loss. It happens all the time. I saw this too often. So a couple of hints how to avoid this. First of all, ignore training errors completely – that was the first thing – because it really doesn’t lead to anything good. Always use a cross-validation. That’s already a problem because many products actually don’t even support a proper cross-validation. But the next thing – and that’s basically not supported by any at least official product like RapidMiner is really the only one – it’s really all the data transformations which work across rows need to be inside of the cross-validation. And that is what makes it so hard. And unfortunately, that is even hard for programming it yourself in R or Python. Not impossible, of course, but you are responsible for doing that, and it’s not that easy. And just as a reminder because at the point here while you’re validating, you should be again thinking about, “Am I using any information I am not supposed to use at the point of time I’m doing the prediction?” So think about this again.
Which brings me to the last section here on focusing on models versus data and data preparation. And as I’d said in the beginning, I’m really guilty of that one here as well. I mean, I am– especially your young data scientist or you just learned about this even if you’re not young, but you just learned about this. It’s so exciting, all those deep learning models, all the gradient boosted trees. It’s all awesome. But unfortunately, it’s not solving all the problems automatically for you. And there’s a big trend, and I see it a little bit too often, and I’ve been doing myself in the early years of my career. What’s happening in many cases is like, “Oh, my God. That’s great. Let’s take the data. Let’s not really spend a lot of time understanding and investigating the data, because at the end of the day, the algorithm doesn’t care. It’s mathematics. They don’t care what the use case is. They don’t really care. It’s up to the algorithm to figure out what is important and what is not important. Well, let’s make do some basic data cleansing, obviously, but beside of that, whatever. Let the model figure it out.” And then you of course come up with the most complicated sounding model because you are data science superhero, and you just mindlessly plug your data into this model. And then, of course, because this model so complex, when you’re going back to whoever is the business stakeholder and present your results, nobody is going to understand anything. So there’s hardly any insights. In particular, there’s nothing people can really do. There’s nothing actionable really as a result. And really, all your work has been for nothing. And I have been doing this too often. And at one point, I realized, “What am I doing here? It really is not working as it should.”
And I give you now– my personal solution for that is actually starting with simpler models. You can always do the more complex stuff later on after you have built some trust. But start with simpler models. Think decision trees or linear regression. These kind of things. But put some work, more work, into feature engineering. First of all, you will be surprised how well those models actually can perform, sometimes even outperform, those more complex models – we will see this in a second – even on very simple data sets. That’s point one. And point two, really, it helps your audience to understand what you’re actually doing because features are something people can actually understand. Let’s say in SVM model with the radial basis function kernel is something nobody can understand. I am not even talking about people learning here. Okay. So how could that look like? Let’s go into the last demo part here. Here’s the simplest data set I could think about. It’s about buying lots to build houses on, okay? I will show the data set in the second, then afterwards, we again do a cross-validation with a complex model and gradient boosted trees. So let’s look at the data. This is as simple as it can be. I mean, I have only four rows and three attributes in the one column I want to predict, the right column here. So it should be easy.
Let’s solve this here. Let’s think. Okay. Let’s say. Oh, yeah. So if the length is smaller than 100, then you’re not buying– oh, wait. Here’s the length, this 100, and we are buying. Okay. Then it’s the width. Shoot. Same thing here. Okay. That means probably the price. I mean, you’re not buying if the price is too high. Here, we are actually buying. So it can’t be that easy. But hey, use the gradient boosted trees. It’s supposed to figure it out on their own. But it’s not. The accuracy is actually 50% on this data set. It’s not able to figure it out. And even if I go with 1,000 of those examples or data rows, it wouldn’t. But of course, every human being would immediately think about and say like, “Wait a second. Length, width. That’s not how I would calculate this. I would, of course, do some feature engineering myself, and for example, generate the area first. The area’s, of course, the length times the width. And then I actually generate the price per square meter or square foot,” or whatever. I’m still new to use– I have a hard time getting used to the imperial system. But okay. Let’s go with the price per square meter here. The unit doesn’t really matter. If I do that, I get a couple of new columns here, the area and price per square meter. And if I sort, for example, according to this one, now we actually see already, “Ah. If the price per square meter is low, then I’ll buy it. If the price per square meter is high, then I’m not buying it.” And of course, a simple decision tree– not even a gradient boosted model. It’s a decision tree, a linear regression, they all would work perfectly well.
I mean, yes, this is a bit of a toy example, but the argument still holds for larger and more complex data sets as well. What can you do to avoid this kind of problem? So first of all, don’t always follow every hype right away, but think much more about the data and the question, the problem you want to solve, and start from that end. And do me a favor and actually think about what a good model actually look like and how good is good. So what is an expected accuracy? Or whatever you need to– a false positive rate. Think about this before you even look at the data and start doing this before you actually start doing anything. Because if you don’t define the success before you start, you will have some benchmarking already after you’ve built the first couple of models, and that will actually influence your notion of what is good and what is not good. Then understand the business problem. Collaborate with stakeholders. And if you do this right, you can actually use your common sense and invent new features basically doing all the feature engineering. And that in combination with much more simple models often leads to better, but definitely, to more robust models. And with robust– I mean, they don’t change quickly just because some of the data points changed a little bit. The more complex models, thanks to overfitting, sometimes tend to be not very robust. And that makes it hard to explain why a certain prediction is as it is. And if that is hard to explain, and nobody understands anything because everything looks like a black box, that is a problem. Because your business stakeholders won’t be able to build the trust necessary to really use those models. Of course, after you set this benchmark, after we have built this trust, you should also try more complex modelling. And you will optimize parameters and do feature selection and even maybe automatic feature generation. But I think, again, there is nothing better, especially in the early phases of a project, to understand the problem, use your common sense, and spend some time on feature engineering. That actually helps you a lot in the early days.
As I said, I was doing this myself. I think it’s the very first project I made like 17 years ago was for a telco in Europe, and I was building a very complex model. It was, by the way, on a churn prediction problem. And it was working very well. I think, actually, my model was able to reduce it by 30% if the right actions would have been done. The interesting thing is it was never implemented. And the reason why that was the case is because I made it too complex. I was not able to build this trust. I was not able to convince the people at the end of the day make the decision to disrupt their business processes, so it never happened. So that was actually leading for me then later actually to the idea of building the whole RapidMiner platform. As a mission, from day one, I wanted to do the real thing. I wanted to do the same thing you could do today with R or Python. Back then, it was Java and Perl, but it doesn’t really matter what program language you are using. I wanted to make it as powerful as programming but in a much faster and simpler way. And as a side product, also something which gets easier to understand. So we do not compromise on the quality of results or completeness. That’s real data science. We make people more productive. That’s the fast aspect. And we empower more people to do and use real data science, which stands for the simple. So that’s what we are doing. You saw RapidMiner. Some of you might know it already. It really covers the whole analytical spectrum from data preparation to modelling and validation, and it’s a really important part. People talk about models. Unfortunately, they talk not enough about how to correctly validate them. This is super important. Otherwise, you will fail and you might ruin your business down the road. And operationalization, what happens after you build a model? How can you integrate this with your other parts of your IT infrastructure with your business applications. And sometimes that just means– let’s say, into some reporting product like Qlik or Tableau to also share those results with other people so they actually can make the right course of action.
So RapidMiner covers all three aspects related to end to end platform. And yeah. We are pretty proud of what we have achieved. I mean, it’s the number one most widely used open source data science platform as a generic data science platform. We have a huge user community, more than a quarter of a million people, lots of clients which I’m very grateful for like those organizations, thousands of users, lots of channel partners. By analyst, we are leaders in industry since four years now in a row, leader in Forrester. Yeah. You see the other accolades here. Yeah. We achieved a lot here. If you’re already a community member, I am very grateful for that, and hope you enjoy using the products. If you’re not yet, you should become a community member, and I hope that we can welcome you to the RapidMiner community then soon as well. And at this point, I would like to open up for a Q and A session, and yeah, take any kind of questions which might already be asked during the webinar. Hayley, please, back to you.
Great. Thanks Ingo. Thanks for a great presentation. For those on the line, just a reminder. We will be sending a recording within the next few business days via email. So I had a couple of questions on that. So don’t worry. We will be sending the recording to the email that you registered with. So like Ingo said now, time to go ahead and get the audience questions. So go ahead and enter your questions in the questions panel on the right-hand side of your screen. It looks like a lot of questions have already come in. So I’ll go ahead and ask. The first one I see here to you Ingo– this person says, “I work in credit risk, and we use weight of evidence logistic regression extensively as it is predictive, transparent, and easy to implement. Have you ever used weights of evidence transformations from machine learning models, and how can we better bridge the gap between traditional modeling machine learning to make machine learning more transparent if variables and trends have to be explained?” Sorry that’s a long question there.
It was. And probably it’s too long to actually answer all of this. I’ll start with the most important thing. No. I personally have not. I know people who did, but I personally didn’t try that specific model. So that was the first one. Have you ever do this. What was the second part of the question, Hayley?
I’m sorry. What is the question?
What was the second part of the question? So I personally did not use it, so.
The second part of the question is, yeah, how can we better bridge the gap between traditional modeling and machine learning?
I don’t think there is actually a real gap. I mean, this is kind of like a philosophical question, and I could talk about this for an hour, but I won’t. What I mean by– traditional modeling in the sense of, well, you create, for example, an hypothesis and do hypothesis testing. But it’s all, of course– it’s not happening with machine learning. I mean that’s really like, okay, you kind of mindlessly plug in your data. I don’t think there is or has to be a huge gap as long as you use the model in the right ways. Meaning if you actually have an hypothesis and prove it right and then use your hypothesis to actually extend or extrapolate this for feature viewings, that’s basically the same thing you would do with a machine learning model, but you don’t know all the distributions, the hypothesis and everything else necessarily at once. I personally don’t think this is a problem, but you should– if that’s important and you feel you have a gap, you should then of course afterwards understand what is it the machine learning model figured out and prove basically this hypothesis and then the gap actually goes away. I mean, I did my PhD both in computer science and statistics. I felt that statistics have moved a lot actually in that regard, and it’s no longer that big of a gap in my opinion. But again, it’s a little bit philosophical. I’m sure other people would have a different opinion on that.
Great. Thanks, Ingo. Next question here is what are your thoughts about automated machine learning such as tools that run thousands of models and take the best performing one?
Well, in theory, there’s nothing wrong with this. I mean, in RapidMiner, we support the same thing for a good reason because there is nothing per se wrong with trying a lot of different model types. But what I don’t like is to also optimize the feature engineering parts, I’m sorry, to fully automate the feature engineering part. And the reason is exactly for the problems we have seen in the first and the third example today. Even if you optimize– sorry, automate the feature generation which, for example, in RapidMiner you use operators like YAGGA, yet another generating genetic algorithm– if you use this one, it’s always dangerous if you don’t put this into an outer validation again. And if you really think through, “What other types of features which have been generated?” and again, “Is this maybe something I’m not kind of supposed to know or not?” So you should not keep yourself completely out of the loop just running through thousands of models. There is nothing wrong with this, but on the same time, if that’s the only thing you do, you won’t end up really with a great model. You’ll end up with a mediocre model in quick time, which if that’s all you’re looking for is good. That’s why we support it as well. I just think you can turn this into an awesome model if you put some brainpower into this in addition to the automation. And then again, parameter optimization, all those things. We automate all of this, and you should. That’s way too cumbersome work to not automate it.
Great. Thanks, Ingo. Another question here. This person is asking can we use programming languages such as R and Python directly into the platform such as Jupyter Notebook?
Okay. So let’s start on R and Python. So there’s operators here, for example, going with this execute Python operator. And you can basically write arbitrary Python scripts as part of this, return the results, and use the results in the rest of the workflow. That works for data sets, models. Basically, everything of importance. The same is true for R. I’m going to type in execute R. Oh, here we go. We have the R execution. By the way, it can also push both via PySpark and SparkR into Hadoop clusters for execution, which by the way, is true for all RapidMiner processes. So all the 1,500 operators, you can also push into Hadoop clusters as well. So yes is the answer on that. On Notebooks, it’s interesting. I mean, integrating a Notebook into RapidMiner doesn’t really make a lot of sense for reasons which probably would go a little bit too far to explain. But since this here is a complete visual process which then also, as you see, is kind of common sense as well, it’s kind of a Notebook on its own. So that wouldn’t probably make a lot of sense. On the other hand, exporting things into Notebooks, yes, can be done but it’s not super native. I mean, you can do it, but it’s not as elegant as it could be. But I think the more important thing is since this acts a little bit like a Notebook that you can also put some results back into the process, and stuff like that, it’s probably more important to actually be able to execute R and Python scripts. I think that’s the use case we see much more frequently. And that’s super seamlessly integrated. Hayley.
Okay. Another question here for you, Ingo. This person is asking how do we know which is the best feature engine such as forward selection, backward selection? Do we validate it against misclassification error for logistic modelling?
Yes. But be careful, because again, even in feature selection then and using a model, it is looking at the complete data set. So basically, this one also needs to go inside of another cross-validation. That’s why it’s so important to have this nestable cross-validation concept. So you probably can’t see it, but if I take this cross-validation here and you see this little symbol here on the bottom right, that means you can actually go inside of this cross-validation, do something. So for example, you could do this cross validation, and then inside, you do a feature selection, which you then again go inside to find another cross-validation validating the performance of this feature set. Well, of the feature sets you’re trying for. That is actually an important point here. So again, whenever you do something which takes into account the whole data set, it’s very likely that it has an impact on the validation of the model. So how to pick it? Well, you try it out, forward selection, backward elimination so it’s all here. I mean, if I type in feature selection, you see they’re all here, forward, backwards. And the one I like personally a lot is an evolutionary approach. Why I do like it a lot, I can probably quickly show you. I mean, if you want me to. Nope. That’s not saved this here. If I just go with this data set here, Sonar. Okay. Now I’m doing exactly the opposite of what just told you you should do. But because I’m not putting this into an outer validation. So you can use this building block here, let’s say, for this cross-validation. Put this inside here. The performance there. Okay. I now only have a cross-validation inside of the feature selection.
So one of the nice things of this one here – it probably takes too long to build the whole thing – is that you can also make a multi-objective feature selection here or a multi-objective optimization for the feature selection by changing this parameter, the selection scheme, to non-dominated sorting. And why is this important? Because now, you can optimize for both the optimal number of features– or maybe I just build it. I think it’s easier to see than to explain really. So we would like to actually optimize– let’s say we would like to optimize for accuracy on one hand– accuracy in one hand. Why is it complaining? Okay. Classification here. This one. Okay. Accuracy on one hand and the number of features on the other hand. We want to minimize this. So now, actually, I can run this whole thing here. I know it takes a little bit of time, but it’s actually going to be interesting – I hope it is – if it runs. Of course, I screwed it up. But here we go. So what’s happening now is I get this pareto front of features and you see that this pareto front moves slowly to the top-right corner. And the interesting thing is so while it moves here, you see that out of the 60 attributes, you can go for a sample with 21 attributes here, and then we end up with 0.66 accuracy. Or we go with, whatever, 19 attributes here. Now, we’re actually moving too quickly to really explain it. Now, we have 13 here and 0.72. And here, it’s 21 to .76. My point really is to get the full decision function of this pareto front so you know how well, for every feature set size, your model will actually perform. So I stopped it here. It will run a little bit longer, but it’s really getting this pareto front here in one optimization run. That, in my opinion, is one of the best feature selection schemes out there. Because you get everything you need to know for every feature set size, every performance, and all in the same time on one optimization run. I really, really like this. That’s also why I wrote my PhD about this, just as a side note. Anyway, sorry. Got a little bit sidetracked, a little off the demo here on this one, but I think that’s actually a technique worth knowing, and not many people do know it. And not many people actually use it.
Thanks, Ingo. Another question here in terms of simple models, where the model needs to be put into production, why not use logistic regression over machine learning models, for example, random forest?
Okay. Logistic regression for me is a machine learning model. Linear regression for me is a machine learning model. I mean, yes. It was invented 200 years ago by Gauss, but it doesn’t matter. I can basically go with modeling if you look at the predictive here, logistic regression. It’s all in here. Linear aggression. It doesn’t really matter. The outcome is always a model. I can learn something from the model. I’ve seen the coefficients and everything else. But I also can then use the model for predicting new values. That’s true for linear regression, for the logistic regression. So it behaves exactly like every other machine learning model. So that’s why I don’t make the difference. But again, people coming from more traditional statistics might see a difference there. I take a more pragmatic view on this. And if two things behave the same and look the same, for me, they are the same. It’s just new words. It’s marketing, really. Sorry, Hayley.
Thanks, Ingo. Another question here. This person says there are many real world problems which involve highly imbalanced data. Data preprocessing typically resamples data which inevitably changes representativeness of training data. Do you share this concern with them?
Yes, I do. I mean, of course, again just to make it a little bit more– yeah. So you can see basically– of course you can do this here. You can balance the data. You can resample the different classes and there’s also a couple of other simple algorithms available to do this a little bit smarter. But I share the concern that if you build a model on this redistributed data set that often this model cannot be just applied on the application data in production following a different distribution. So I don’t think there’s anything wrong with building a model on balanced data because sometimes it makes it easier. If one class is overshadowing the other class, it sometimes is hard for the model to pick up anything and just the easiest solution looks like, well, let’s just take the majority class. So during model building, chain of distribution all makes sense. During model validation though, it doesn’t. So at least during validation, go back to your original distributions, and you will get a little better feeling about how well this model will work. I think that’s kind of a piece of advice. But yeah. It’s never great. Let me summarize this like that. Yeah.
Great. We have another question here. We often have difficulty in dealing with spatial nature of our target such as building new territorial features of spatial smoothing. Are there any features in RapidMiner that could help us?
Well, a good friend of mine, who is a data science consultant in Austria, he actually created connections to Postgres which comes with a lot of geospatial analytics features, so we can use them from RapidMiner. But I do think– I’m not 100% sure though. I do think actually he also came up with a couple of scripts and other things you can use. In all honesty, it’s not as deeply integrated as many of the other functions you can see here. So you will always need to go through a little bit of workaround either through R or Python scripting, through Postgres or other means. So yes. It makes it a little bit more complex than just dragging and dropping a couple of operators. But doing it that way, it’s definitely possible. Yeah.
Great. So it looks like we’re just about time here. So if we weren’t able to address your question here on the line, we will make sure to follow up with you via email within the next few business days. We’ll go ahead and get your questions answered. So thanks, again, Ingo. And thanks, everyone, for joining us for today’s presentation, and we hope you have a great day.