In this series of four videos RapidMiner founder, Ingo Mierswa, demonstrates a complete automated data science project from end to end. Part one starts with data preparation. In part two, he will show you how to build machine learning models. And finally, he will show you how to put those machine learning models into production and how to manage them. Join him for a complete artificial intelligence project experience and learn how to leverage the fully automated data science offerings in the RapidMiner platform.
Part 1: RapidMiner Turbo Prep
Ingo’s demo starts with data preparation using RapidMiner Turbo Prep. In this section you will learn how to load data, explore your data, mash up data sets, and transform the data, all steps which will help you to build better models.
06:05 So maybe weather has some impact. Again, let’s remember that and we can add additional weather information to the flights later on as well. Last but not least, let’s actually have a look at the destination name here and here you can see highest delays are for Nantucket that’s a tiny, tiny airport though. And you can see this easily. If you change over to count, there’s only 280 flights in total while to Boston, there’s 123,000. We also saw that before.
06:31 So maybe let’s focus the rest of the analysis on Boston, Logan. So now let’s do some of the data preparation we came up with before. So for example, adding the total number of flights per day to each of the flights and well we call this augmented data preparation and you see why, because we sometimes can suggest you matching or matching data sets and to help you really getting your data prepared in the right way, also more efficiently.
07:14 But this is how you could do this in a manual way. Okay. So here’s our data set again since I’m going to join it with itself, sort of say, the first thing I want to do is to create a copy again because I’m going to do a pivot here, similar to what we have seen before when we explore the data. So let’s get started with a pivot and we’re going to do this time is we want to have the total number of flights for each day.
08:49 First thing is this column now is stable. It’s a constant column. Everything is Boston. So let’s hit the delete key. Let’s move this column and while we are in this session here. let’s also rename this one here. I don’t like that count name too much. So let’s call this total flights. Okay.
09:07 So that’s so that’s looking good. Let’s commit this change here. And I think, you know already to merge this total flights information together with original data. But before we do that, let’s also filter down just to Boston here. So again, we started transformation session pick this column and you can just filter it here like this.
10:00 So let’s care about our own flights and let’s just assume I’m somebody working for JetBlue, for the sake of this demonstration. All right so that looks good. Now that you have Boston here, JetBlue all is perfect. Let’s now merge it with our total data and this has made the augmented data prep is coming in so right out of the bed here, we are suggesting to you that this other data set is a perfect match.
14:46 No error messages. And as we can see, yep. Smaller than 15 on-time, larger than 15 delayed, perfect. Like that’s, that’s commit this. So there’s the column we’re going to predict, but before we do this, we need to do some data cleansing and do something about our data quality.
17:35 But wait, there is one more thing. It’s great that you can use this interactive data-centric way of juggle prep to prepare our data. And we did a lot of things, we cleaned the data we matched it up, we filtered the data and that’s fantastic.
Part 2: RapidMiner Auto Model
In this video, Ingo gives you an overview of how RapidMiner Auto Model itself works with his Auto Model Blueprint. Then he demonstrates how to build models with Auto Model using the same data that we cleaned up in part 1. Auto Model can go through thousands of candidates for model types and all the different parameter values and finds that the optimal models for you.
00:05 So now finally we can start with some modeling. Before we jump into RapidMiners Auto Model. Let me give you a high level overview, and we’re not going through all the details here. I mean in this overview, it’s not showing them anyway, but let’s just focus on some of the boxes you see like the dash bordered boxes you see on this, on the screen here right now.
01:38 I got this basic feature engineering because that’s, that should be a given and not too exciting. But now the really exciting thing for RapidMiner’s Auto Model is also there’s automatic feature engineering. That we use a multi objective optimization technique, which by the way, it’s close to my own heart because this was a, one of the major outcomes out of my own PhD thesis. So this is a really, really advanced technique to generate additional features out of the existing ones. So after doing the basic feature engineering, we do this full blown automatic engineering.
03:00 So there’s a lot of great things going on there and you don’t need to worry about any of this because for you. It’s just one single click. And that was actually kind of a lie because it’s not one single click, it’s five clicks, but that’s okay. And why is it only five clicks? Because that’s why we call it augmented machine learning. We try to support you and give you good recommendations as much as needed. And whenever we can, we actually fully automate the potentially still necessary data pre-processing like the feature engineering we have seen before Or, the model selection, hyper parameter tuning, yeah. And all of that, generating the results, and allow you to compare the different models. Okay.
04:53 So, for example, this column here is highlighted as a yellow one, but everything else is green. So what is this column? Well, it’s the arrival delay. Wait a second. That’s the one we use to calculate our delay class. The one you want to predict. You’ll remember if it’s bigger than 15 then it’s delayed.
05:10 Otherwise it’s on-time. Well, yeah, I mean if this is still in the data, all the models would pick up on this and say like, yeah, if it’s more than 15, then it’s delayed. That’s obviously not what we are you going to be interested in and RapidMiner finds this column, highly suspicious. Sometimes you’re just lucky and it’s okay to have a column like that in the data. But in this particular case, certainly not. So we should get rid of this and just highlight it and give you this recommendation. That’s why we call it augmented analytics because you only, you know your business case, only you can make the final call on this. But we at least give you a hint and the same is true here.
05:41 So those are the models RapidMiner believes will perform well on your data and to deliver good results in a feasible amount of time. You can override this decision. For example, turn this one on but that will probably take sometime so maybe lets not, let’s not do it. And in general maybe just focus on a couple of the more popular models like those here, logistic regression, deep learning, gradient boost trees, that kind of popular. Typically, we do the automatic hyper parameter optimization for you so you don’t need to worry about that. And the same is true for this basic feature engineering.
06:13 For example, extracting information out of the date columns like month, quarter of the year, or if there’s multiple date columns like in our data set. We even built the differences between all of them because that’s often very helpful as well. We’ll come back to text later. And also the same is true for the automatic feature engineering, both selection but also feature generation and not going to do this.
06:34 Now let’s just start this run here on this data set. Well, now you can see it’s running and those are the five clicks that promised to you as really kind of an easy wizard based approach. And while we are waiting for the results. We’re not going to wait for the full time, but we can look and have a look into some of the results which are already in.
06:55 So here’s the data sets. As you can see, the arrival delay column has gone at this point. But if we have a couple of additional columns, now like the extracted information out of the year, date columns or those differences between dates I mentioned before, they are here as well. The statistics are here already but, the models will still take some time. Why is that? Because we actually not just creating one model for each class, we are generating dozens, sometimes hundreds of models for each off the class. And sort of figure out what is the best model, what is the best network architecture what is the best number of trees for, for gradient boosted trees and so on. So I will stop this run here. And just go back to the beginning because here we actually have a couple of pre-calculated results from a previous run.
08:06 And if you would like to see other results like it was C or or others, you can just switch it over on the same is true for the ROC Curve. But to be really honest, I actually find this almost too good. So what’s going on here? So good that you can actually have a look into the model. So, for example for this gradient boosted trees here, let’s have a look at this tree or this one or this one. Wait a second. They all are very similar. They all are using this difference between the arrival time and what does this column about? Well that actually is the scheduled arrival time.
08:39 Well what is the difference between the actual arrival time and the scheduled arrival time? Well that is our flight delays in minutes or seconds of or whatever it is and obviously that’s the same thing as this arrival delay column we based on target on, Oh, okay. So here we made a mistake. We did use a column we shouldn’t have been using. We can see the same result. Also here, if you look into the model specific weights here, there’s another novel algorithm coming from RapidMiner that you can actually build model specific weights for all the model types. No matter if it’s deep learning, gradient boosted trees, logistic regression. It’s not just regular global weights like correlation based weights or anything.
09:46 So let’s do it right this time. We will, in the second session about augmented machine learning, we will focus on creating like a robust model of which has also a true business impact. So it works like it should work and you can improve what this business impact is going to be. And that’s really important because it’s not just about model accuracy or model error rates or AOC or whatever. It’s really about like, what is this model really doing for our organization and can you show
10:51 So in general and machine learning, there are techniques where you can define well what happens in certain types of arrows. So for example, if you predict on-time, but it actually is a delayed flight and especially for two class problems, there’s certainly a dozen or so algorithms around to solve this problem in a more or less efficient way. But the problem is for more than two classes, it’s actually is a really hard problem. There are not many algorithms for doing this. And those which are solving this for more than two classes, they typically build on ensembles around your actual model and that is coming at a price because now this model itself might already been an ensemble.
11:28 Like for example, gradient boosted trees or random forest and then you build 10 or 20 of those around those ensembles. It makes the model harder to understand and it also increases training times dramatically. So that is not really acceptable, but sometimes you just have more than two classes and what can we do? So we came up with a new algorithm, we call it profit-sensitive scoring, which doesn’t have those problems it works for two or more classes without the increase in training time. And it really is a, it’s a great tool in your tool belt and it should make you use of this.
11:59 So how are you using this? Well by looking into the business impacts each of the predictions can have. So I’m just quickly going through this. If I, for example, would predict on-time arrival and actually is on-time, that may have a positive impact on customer loyalty. Now just put the value of 10,000 to here, a positive number. But if I predict on-time, and it’s actually delayed, then I have two problems. I will lose some customer loyalty and I actually audit the gate too early. So that means that I’ve increased airport costs. And I’m not an expert for that. So I don’t really know. But that’s what happening. It’s just an example. Okay.
12:55 So this is just like an example here and I will show it to you in the protocol to use those numbers. Okay. So we’re back into Auto Model. So there’s still the column you want to predict. Let’s go through this. And in addition to focusing on delayed, I can now enter those values two years or 10 thousands was for the first one, if I remember correctly. Then you get minus 20,000 here. We had still minus 5,000 here and minus 10,000 here.
14:09 I can do this here. So lets not change the models this time, but I can do it here by actually forcing RapidMiner to treat this as a text column. So let’s do this. All the rest is the same. So we will still extract information from the dates it’s even though we are also extract information from the text. But I’m still not turning on the automatic feature engineering, both selection and generation. That is for a third one. Well instead of running it now again, I have the results here from a previous run so I’m just going to load them into RapidMiner and I’m quickly explaining the results to you in a second. Okay, here we go.
14:48 First thing we can see is that we have this text information over here’s the, for example, its a word cloud, so red means on-time, blue means or delayed. And we can see that normal is a little bit more frequent for on-time. Flights by rain for example, is bigger for delayed and fog is bigger for delayed and even snow is bigger for delayed. So weather potentially really has some impact here. So that’s one thing we can see.
15:15 I mean there’s not lots of different words so not much to see here, but, but still then let’s have a look here again into the into the accuracies. Yeah, it looks more realistic, some of the models no longer performing that well. Some others, are doing a pretty good job. Deep learning in particular, is doing a great job, good job. It’s not just getting the best, highest accuracy or lowest classification error. It’s also the model of which delivers most business impacts.
16:09 But then we can also use whatever the model predicts here. If you do this, then actually we count here and to 38 million. And that is the gain of 13 and a half million. That’s the number you’re seeing here. So you can clearly see that this model here delivers the biggest business impact since deep learning is otherwise, it’s pretty much a black box model.
16:28 So what else can we do with this model here? Well, one thing I like is those prediction tabs here. So wherever you create the prediction together with the confidence for the different predictions, but then also we can see for each row, what is supporting or contradicting the prediction. So for example, here, it’s the delayed prediction and in fact it was delayed. First of all, it’s supported by the origin here.
18:14 How can you make sure that this problem of feature-space overfitting doesn’t occur and we use multi objective optimization and a special form of regularization to keep this overfitting under control. And if that doesn’t mean anything to you, that’s totally fine as well. You’ll still be blown away by the results because you can actually learn something from this. So I will show you in this section or this automatic feature engineering approach.How to explore those results.
18:59 I loaded the results already so we had to total of 175,000 models and so on. You’ll see this here at the top and overall our models look like they are performing a little bit better, especially the linear models. They really benefited from this feature-space transformation and the generation of new features.
19:15 Deep learning is doing this inherently already as part of its approach, but still it often can benefit from additional feature generation as well because it can actually help speeding up the runtime for deep learning. Talking about deep learning.
19:50 Well good old linear modeling, especially in combination with powerful feature engineering often outperforms other methods like in this case, no obvious, yeah, get some fun out of this. But since deep learning, was the best one. Let’s actually focus our intention on the feature sets for deep learning. So whenever you turn on feature engineering, both feature selection or generation in this previous screen in the model type screen we’ve seen before you will get this additional result.
20:17 And what does it mean? So first of all, you get this point up here in the top right corner. That is the original feature-space. So it came with like 30 something features here. You see all of them here at the bottom, the extractive text features and everything else as well. But this model wasn’t actually that great. It only had like nine, 10% of error rate. You could go with a different model than this one here, which has a much lower complexity shown on the Y axis here, only using actually one feature of the NAS_delay, but actually it’s just less complex. It also performs better.
21:14 We tried almost 600 other ones for this particular model class here, but RapidMiner figured that there is no value in adding, even more complexity. And this is exactly what we call feature bloat or, or this overfitting problem because you could actually drive the error rates further down by increasing the complexity for the up here but not significantly. That’s why we went with this point and this exploration here that you can actually learn something about the interactions between the feature sets. That’s really a fantastic feature of this feature-space exploration tool here.
21:51 Finally before we actually click on this nice little green button here for deploying some models, I would like to show you what that like for Turbo Prep, also for Auto Model, you can click on this open process button and that will generate the complete process for, for generating all those models here. You can actually, it’s fully annotated. You can go inside here, you can have a look into the, the preprocessing, you can see every single thing.
Part 3: RapidMiner Model Ops
After creating great machine learning models in part 2, Ingo shows you how to put them into production, manage them, and automatically create all the necessary management processes with RapidMiner Model Ops.
00:04 Sometimes it’s easy to forget that building a model is not actually the end of the story. It’s really where it really starts because now you need to put this into production and you need to integrate the predictions, for example, with other pieces of your infrastructure. And then you need to make sure that the model is these current and then then there’s not getting worse over time. And this is really what the whole model ops solution of RapidMiner is for.
00:28 So let’s get started at the basics first let’s start with the deployments. I will show you how to deploy a model. What kind of different model types are there? Sort of like we call them Champions and or Active models and Challengers, how you can see how this model was created. Again, this is really important for compliance reasons and well, how are you really work with the model ops in the overall.
02:21 That’s why it takes a few moments. But after it’s done, we will see that we have our first model in this deployment and this new deployment. And since it’s the first one, it automatically became the active model. So, let’s add another one and see what happens then again, or it doesn’t matter where you click or lets deploy the deep learning model here. And if you do this, if you edit to the same deployment you will see that, the second model that become the Challenger model.
02:50 So what is the idea behind Active models and Challenger models? Well, the active model is the one which produces the predictions. But whenever you use the model, or the deployment for scoring, no matter if you upload some data and do the scoring that way or if you actually use the automatically created VAP services. The Active model is the one who produces those predictions, but all the Challengers are producing the predictions as well. They are just not delivered, but they can still be used to calculate how well those challenges perform. And you’ll see this leader and this demonstration that this can be very useful because if this Challenger becomes better over time, all you need to do is click here and change this one to the Active model to replace it.
03:31 Finally you can also see the DDoS of each model. So I can click on this model, I can see the model which has been generated. I see a snapshot of the full input data and that actually is really important because this input data here is not just a reference, it’s a full copy. And why is that important? Because otherwise the data could have been changed in the meantime. But if you needed to prove how this model was, it has been built.
03:57 Think about GDPR in Europe for example, and the right to explanation, you better know all the details without having any options that anybody could have been changing anything. In the meantime, all the other results have been produced as well and you can explore them here. And that even includes the generated process here.
04:13 You can load this back into the design view. Well, that’s not very exciting because it’s pretty much the same process as the one we have seen before. But again, you can really prove how this model has been created when we talking about scoring. And we also need to talk about how to explain the predictions and machine learning models is doing. I mean, for some models it’s really easy, like for a decision tree, you can pretty much follow along as you mean. But for most models it’s just not, especially for the more powerful ones. And let’s be honest, and I sometimes ask people if they understand or how exactly a Linden regression model is working. And most people wouldn’t know that either.
04:50 So the whole topic of explainable AI became really important. And yeah, RapidMiner totally believe in a complete no black boxes policy. And that means for both, it’s for how the models have been generated. So we saw this before. We can always open up the processes, but it’s also about the predictions which are created by the models. So let’s have a look into how those predictions are created. And then, what do you call it, the scoring, how we can explain those predictions. I really have been showing those model-specific but of model-agnostic weights before. And then there’s another thing we call the model simulator. I personally like a lot actually, and many of our users do too. Okay.
06:09 Now we take the 2008 data also from October 3rd and we load it in, so here’s the date or I mean not using the arrival time, arrival delay obviously, but we detected that we have this target column already and if we If you have this, it’s kind of useful of course, you can calculate our rates right away. Okay, so let’s feed this data in here. Let’s do the scoring. And after a couple of moments we will get all the predictions together with the confidence values and also the explanations like the ones we have seen before.
07:19 In this case we knew the actual value already. If that is not the case, you can also define the actual outcomes for given a IDice which is really helpful to calculate those error rates. We see a little bit later here as well. So let’s move to the simulator. Before we do this. We see in general those delayed columns are a bit more important. We don’t need those scores any longer, so let’s just throw them away. So the simulator, what this is doing is we see the input factors on the left side and then we see how the model behaves on the right side.
10:31 So this is one way to see this, but then there’s an extra tab up here. We call this drifts. We have you show who you are for each input, like what are the inputs, where there is the biggest difference. So for example, remember this column you have created this total flights column. So between the training data, the dark greenish here and the data, we can see that overall there seems to be less flights, now in 2018, going into Boston. So there was a bit of a change.
15:35 So lets just jump right into this. Yep, we have seen this before. Let’s move over to the alerts tab here and you can see that I’ve already created three different alerts here. The first one is the drift alert. If this drift is greater than let’s say 8% for the last week and I check this once per day, then that sends an email to data scientist number seven. And the same is true for error alert. Or if you have less than a hundred scores and on any given day, then again, you send up an email. Whenever an alert is triggered, we can see what has been triggered down here until we acknowledge this.
16:08 So at this moment this explains why we have the seven alerts here. If I, for example, acknowledge all of them then this KPI would actually go back to zero here and everything is green. So let’s go back to our alerts and let’s trigger a new one.