In this series of four videos RapidMiner founder, Ingo Mierswa, demonstrates a complete automated data science project from end to end. Part one starts with data preparation.
In part two, he will show you how to build machine learning models. And finally, he will show you how to put those machine learning models into production and how to manage them. Join him for a complete artificial intelligence project experience and learn how to leverage the fully automated data science offerings in the RapidMiner platform.
00:00 Welcome to this RapidMiner platform demonstration. My name is Ingo Mierswa and I am the founder of RapidMiner. Today I would like to show you a complete data science project from end to end. Starting with the data preparation.
00:19 Then we’ll build machine learning models and finally we will put them into production and manage those models. So it’s a true complete artificial intelligence project experience you’re getting here. The data you’re going to use is, a very famous dataset. It’s about domestic flights in the United States.
00:37 We are going to use the years 2007 and 2008. Those are the last two years available for this data set. And one of the columns describes how much each flight was delayed, if, if at all. And our goal is going to be that we will predict if a flight is going to be delayed or not at the take off time.
00:55 So the plane leaves the airport and at that point of time we are going to predict, um, if the, the, the flight will be late or not. So you see the URL for the datasets here at the bottom.
01:05 And we also going to use some supplemental data sets like weather information and other things. So it’s a true realistic data science product
you’ll be seeing here. Before we start with the demonstration, let me quickly explain the RapidMiner platform to you. In fact,
RapidMiner is the only platform which supports everything needed from data
preparation to automated machine learning down to model operations and
management, all in an augmented and automated way.
01:30 So we start with Turbo Prep. It is obviously about learning more about your data or mashing up data sets, transforming the data, which often helps you to build better models.
01:39 Auto Model itself then can go through thousands of candidates for model types and all the different parameter values and finds that the optimal models for you. After creating great machine learning models, that’s not the end of the story. You need to put them into production, manage them, and we again, automatically create all the necessary management processes for you in workflow. So you don’t need to worry about that at all.
00:04 So obviously we will start with Turbo Prep. So let’s get to it. The first step for any data oriented project is probably to get the data into the system. So naturally we will do the same.
00:15 And focus a little on data loading first data can come from local sources like a file, remote sizes for example, database or something, or even cloud-based data sources as well, no matter where it’s coming from.
00:26 Everything is managed in repositories including the metadata versions and everything else. So in this section I will show you how to set up those connections, this repository or data catalog, and also some metadata and versioning.
00:41 Okay, so this is RapidMiner. At the top here you can switch between the different parts. As I mentioned, we start with the Turbo Prep here and before we actually load some data in here, here at the top, this is where you can create new connections. For example, with databases, cloud-based data sources or others. You can even extend those lists by adding additional extensions.
01:02 So if I go on and log data now, I could for example, load the data from some of those connections or input the data. Let’s say from some file. If I do this for example, going with the 2007 flights data here, almost a gigabyte of data. I can go through this visit here. So here’s how the data looks like.
01:20 And then, in the next step, RapidMiner will guess what the best type is for each column. And you can make all this kind of changes here. I’m not going to do this year. As I said, it’s a gigabyte of data already imparted before. And the only change, change I made actually is, are reducing those data sets to only, flights going to New England airport this is where I live close to Boston.
01:41 So there’s the datasets on the right side here you can actually see the size of the data or the columns and other immediate metadata. And while as the final step, then let’s just load it into the system, into data prep. And here we go.
01:58 Well, the next section of this demonstration, we will do some data exploration. And then in fact we will look into the traffic flows and also analyze some of the delays. And in this section we will show well how to detect data quality issues, although we will only later on fixed those quality issues. For now we just ignore them, but that’s okay.
02:16 Some of the new visualizations in RapidMiner, how to explore the data. And this is all very helpful to also collaborate with your line of business managers or stakeholders so they can understand what’s actually going on in the organization. So we start with the data again, roughly 220,000 flights coming into New England, 28 rows.
02:34 Some of those rows actually have some data quality problems. So for example, this red bar here means that most of the values are actually missing. Or if I go to the far right and here the year 2007 while all flights are from 2007 in this data set. So obviously this is a constant column and if I hover over this, you see stability of 100% that’s exactly what I just said at this column. Both columns actually are not very useful, but we will fix those a little bit later.
03:00 You can do basic things like for example, copy of this data. So now we have two data sets and that will come in handy because if I start a transformation session, like for example this pivot here, that won’t change the original data in this case. So lets start this pivot here, I could for example, throw in the States, the origin state here.
03:17 So those are all the States where there have been flights from into New England airports. And then let’s throw in the destination name here. So here are the destinations and then let’s throw it here as well. And so automatically select the count. But you can change those functions here as well. So for example, from Florida to Bradley international, we got 4,800 flights. So that looks good. So far, actually the data has not really been changed unless I were to lock in those changes by committing it. And that’s exactly what I’m going to do now.
03:48 So we still have our original data here and then the change data over there. So let’s generate some visualizations. I’m clicking on shots here. We’ll bring up all the possible digitizations right now, we have a scatter plot here, but let’s use one of those fancy Sankey plots.
04:04 That’s how to say obviously we didn’t just define the from and to columns yet. Let’s do this and we can see all the day-of-flight flows are from the States here on the left to the different airports, Boston Logan, obviously the largest one, most flights to Logan have been coming from New York state. I guess that makes sense, Virginia and probably makes sense as well with Washington, etc. Now then there’s also. Florida down at the end. That was a bit of surprise at least to me. And if I was have a look into what other airports are well served by Florida flights we have those three at the top.
04:41 So you can explore this kind of things easily here. Let’s cancel this, cleanup our data set again, you know, we no longer need this one here. Let’s go back to our original data and let’s just create a couple of shots for, for this data, 2007 stats. So we will actually stop at the histogram for arrival delay that is actually the column of interest here. So, luckily most flights are not delayed.
05:01 They are even early or at most one hour, while I guess technically that is already a delay. But then there are a couple of flights where this is really crazy. So let’s create a bar chart here. And well this is not very helpful, but we could for example, also aggregate the data. And that’s for a sample, calculate the average flight delays, which is roughly 11 minutes. Okay. And then for example, plot for the day of the week. And you can see that’s Fridays, which is number six.
05:28 So the week in the United States starts with Sunday. So Friday is number six, and we have lower flight delays. So why is that? Well, one reason could be because there’s also less flights in total on Fridays and that gives me an idea. So maybe later on we can actually generate and another feature, another column, for all data sets that the total number of flights for each of the days. 05:53 We can do the same. Let’s go back to average here for, let’s say the month. If you do this, we can see that summer months, have higher delays, that’s typical vacation time, but then the winter is harder here in the Northeast as well.
06:05 So maybe weather has some impact. Again, let’s remember that and we can add additional weather information to the flights later on as well. Last but not least, let’s actually have a look at the destination name here and here you can see highest delays are for Nantucket that’s a tiny, tiny airport though. And you can see this easily. If you change over to count, there’s only 280 flights in total while to Boston, there’s 123,000. We also saw that before.
06:31 So maybe let’s focus the rest of the analysis on Boston, Logan. So now let’s do some of the data preparation we came up with before. So for example, adding the total number of flights per day to each of the flights and well we call this augmented data preparation and you see why, because we sometimes can suggest you matching or matching data sets and to help you really getting your data prepared in the right way, also more efficiently.
07:01 So we’ll cover our pivots, transformations like filterings, joins and mash-ups, and this is all for doing manual feature engineering. And there’s also automatic feature engineering and also augmented feature engineering. And we will cover this later in this demonstration.
07:14 But this is how you could do this in a manual way. Okay. So here’s our data set again since I’m going to join it with itself, sort of say, the first thing I want to do is to create a copy again because I’m going to do a pivot here, similar to what we have seen before when we explore the data. So let’s get started with a pivot and we’re going to do this time is we want to have the total number of flights for each day.
07:40 So let’s put in the days here and also the month. And if I do this, I get all the combinations. That’s all the days of a year, obviously 365.
07:48 Now finally we also throw in the destination and that is the airport code. So B O S would for example be short for Boston. And that’s what we want to focus on. Let’s throw in these destination here as well. And we see now that there on January 1st there has been 86 flights to BDL, whatever airport that now is. And since we don’t care, let’s filter this down to Boston. This filter here and we’ll be back to 365 rows for all the days to Boston and the number of flights into Boston that day. Yep. I like that result.
08:22 Let’s lock this in. Let’s commit this. But now thinking about this, I don’t like it that much because I actually would like to make just a couple of more changes. First thing is I would like to rename this data set.
08:31 Here was something like total flights. I often like to do this. It’s easier for me then to remember what the different data sets are about, but you will always see what the original data source was and you see the changes to the data, not many so far, but you get the full data lineage so we can always see what happened to your dataset. So what else would I like to change?
08:49 First thing is this column now is stable. It’s a constant column. Everything is Boston. So let’s hit the delete key. Let’s move this column and while we are in this session here. let’s also rename this one here. I don’t like that count name too much. So let’s call this total flights. Okay.
09:07 So that’s so that’s looking good. Let’s commit this change here. And I think, you know already to merge this total flights information together with original data. But before we do that, let’s also filter down just to Boston here. So again, we started transformation session pick this column and you can just filter it here like this.
09:29 So let’s go with Boston first and while we are here, let’s also make this a little bit more realistic because right now this is for all the different carriers and that doesn’t make even a lot of sense. We can even see the details here, like US Airways and so on. Well I like, I don’t want to endorse them now or anything. But let’s say I often go with JetBlue so lets go with JetBlue Airways here because realistically I will focus on a single airport if I built a model, but also I’m doing this only for my own flights why would I care about the flights from other airlines?
10:00 So let’s care about our own flights and let’s just assume I’m somebody working for JetBlue, for the sake of this demonstration. All right so that looks good. Now that you have Boston here, JetBlue all is perfect. Let’s now merge it with our total data and this has made the augmented data prep is coming in so right out of the bed here, we are suggesting to you that this other data set is a perfect match.
10:21 I mean that’s not too impressive. But just imagine if there would be like a dozen other data sets, you would get the recommendation for the best matching data sets right here. And there the same is true also for the year for the columns here.
10:33 So for example if I go with day of month here, the other datasets day of month will be here at the top. And again it’s only three. It even has the same name, but this could be dozens of columns and then really helps you to make the perfect matches in a fast way. Same is true for month here.
10:48 So that all looks good. Let’s update the preview. Just as a reminder, what is this join doing? So we have the two join keys here, those are the, those two keys, make sure that we are only joining the total number of flights for each of the days. Then we have all the original information here in blue and then at the end we have this new green column here, which the total number of flights.
11:07 First lets go down a little bit. You’ll see how this changes depending also here on the date. Okay. I like the results. Let’s commit it, let’s lock it in. And that brings me now to the second idea which is about the vendor information. So let’s get this in. You no longer need this one here, so let’s remove it. But let’s load another dataset and well, there’s another dataset I have imported already early into one of those repositories, you remember, this is about the weather information for all the US airports for all the days of the year. I see the name of the columns here, so that looks about right.
11:42 Let’s load in this data set and let’s have a look. Yeah, it looks all good. So for example, the visibility or min or max temperature or dew point whatever, that number is. I like this column here and we’ll use this later on. It’s kind of like weather events, like was it foggy and rain or normal weather or it was snow or some, something like that. You can see the least frequent one would be, fog, rain, snow, hail, that only happened once. That’s probably good. So anyway, so all this kind of information we have in this data set.
12:10 So now we would like to do the same thing again like figuring out what was the weather in Boston at a particular day and and join those two data sets together. The problem though is this is a proper date column as you can see here while here, if you remember if you had the day of month, and the month, so we need to transform one of those two, either the two ones into one single date column or the other way round. I think it’s easier to actually get started with the date.
12:33 So let’s start a transformation session here. Let’s first turn this into a second column. Let’s call this copy this column so we have this month column here and probably on this just for consistency. We call this W_Day but the type didn’t change but now we can just change the type. So we can, for example, change into a number by extracting the month relative to the year, done. And we do the same thing again for the day relative to the month, done.
13:00 That was good. So let’s commit this transformation. Let’s go back to our original data and now we can merge it. So again, this validator is a pretty perfect measure at this point. I can do a join again but this time I need three keys. First key is I would like to use the destination here and I will match this matching airport code. So why is this only 0%?
13:23 Because remember the weather data has the information for all the airports. So that doesn’t, it’s the reason why it doesn’t look like a perfect match, but still it is a pretty good match, that’s why it’s a top, just not as impressive as the other matches.
13:36 So for day of month, same story before. So even with a different name here you get the matching columns here at the top. I really like this. It just makes life so much easier. Update or preview. Yup. Those other three join keys. Then again in blue we have all our flight information, but now in green we have all the weather information. For each of the days in Boston. So that looks all good.
13:59 Let’s do two more things. Actually one more thing before we turn our attention to data cleansing and then modeling. The last thing, I don’t really care what the exact number of minutes in delays here. I would like to generate a new column and that’s also what we call our target column delay class. And then, you can use all kinds of functions here, but for generating new columns or like text transformations or mathematical functions here. So everything is in here. One of the functions is if, you can always see an example here.
14:29 You can just drag those functions in here and build a formula by with drag and drop. You can also just type, so for example, if the delay is larger than 15, let’s call this a delayed flight. Otherwise we call this on-time and that looks good.
14:46 No error messages. And as we can see, yep. Smaller than 15 on-time, larger than 15 delayed, perfect. Like that’s, that’s commit this. So there’s the column we’re going to predict, but before we do this, we need to do some data cleansing and do something about our data quality.
15:01 So then this next section we will talk about augmented data preparation and there’s even more augmentation in this section here. Then what you’ve seen before so far. So think about this. We can now go through our data set and clean everything column by column and by product is really very powerful as we actually offer a completely automated way of cleaning up your data sets.
15:21 So let’s have another look into the data quality measurements, the profiling of your data. Let’s see this automated data cleansing, it can do all kinds of things, normalizations, PCAs, whatever, handling problematic values, missing values and so on. So let’s have a look.
15:38 So here’s the dataset. We still have those data quality issues here. Like, like the missing values or this one here is now also completely stable. Yeah, as you can see, there’s a couple of like grey and reddish columns here. So again, I could now, go through this here, delete those columns. I could go into this cleansing mode here and do the same thing. So for example, remove or replace the missing values for this one here. There’s other functionalities like removing duplicates and, and others. But, why would I? I mean the easiest way is what we call auto cleansing. Especially if you prepare the data for machine learning for predictive machine learning model like supervised learning.
16:18 You can even specify here what is the class you would look to to predict. And this information is then taking into account. You don’t need to do this obviously if you want to do clustering, or outlay detection or other forms of machine learning and then you just go through this visit here and you see like this column would be removed because it’s very stable. This one here because it’s very, it has many missing and it fixes all the different problems here.
16:41 You can see or that you even can change the data types of, for example, everything should be numerical or categorical afterwards. If you’d go with RapidMiner Auto Model for modeling, you don’t need to worry about any of that because frankly this is something Auto Model will take care of anyway, but this can be useful if for whatever reason you would like to do the modeling somewhere else so I can just leave it where it is. Same is true of normalization, I even say it here.
17:07 If you’re not really sure what what you want to do, just leave the default settings here. You would get a summary of the changes, apply auto cleansing and like we have seen before, everything is nice and good so all the problems are gone. Nothing is red here, nothing too grey, a little bit of grey is totally fine. But nothing has changed so far until I commit this change here and now we are ready for modeling.
17:35 But wait, there is one more thing. It’s great that you can use this interactive data-centric way of juggle prep to prepare our data. And we did a lot of things, we cleaned the data we matched it up, we filtered the data and that’s fantastic.
17:48 But how can you see what exactly that you do? How can you use all the changes and apply them, for example, on the new data sets. How can you automate something like that? How can somebody else understand what was going on? And this is really what this next section is about.
18:02 How can you turn all those interactive changes into a visual workflow which is repeatable and can be used to build trust and sometimes even necessary for complying with legal or regulatory issues. So this section will show you how to create those workflows and how to repeat them or reapply them on new data sets.
18:23 I will explain briefly this visual composition framework. We call it RapidMiner studio designer, and then, I also point out what we call the Wisdom of Crowds, which is another form of augmented analytics. Okay. So here’s the data set again. I briefly showed the history before us. So this is obviously one way of seeing of what happened to your data.
18:44 But there’s something even more powerful and that’s this create process button. So if you do this RapidMiner generates a fully annotated workflow instead of actually moving things around here. I already did this before, so this is the same workflow here, but just like cleaned up a little bit so it’s fully annotated.
19:02 This explains every single step. So for example, this step here was the one we’ve renamed this PST dates column, to the day, and if I want to change this I could just change the name here, I can follow along, I can see exactly what happened to my data and it can other do other changes. For example, I edit this additional operator year which stores the results under the name 2008 here and that’s the only other change in the beginning. I also changed the year from 2007 to 2008 but just another dataset. I had a waiver and now I can reapply this whole process
19:36 You can see it’s running right now. I can reapply it on this new data set and that generates me all the results for the year 2008 I can even put this here into the cloud or on the server, schedule the whole process. So it’s re-executed once every night. And that’s important for trust-building repeatability. It’s important for for understanding exactly what’s going on, but you can also tweak this process, like we have seen very useful. Lastly, if you even put this form or this idea of augmented analytics even into the workflow building and design. So whenever you see something green like here, this is what you call the Wisdom of Crowds
20:13 So those are other operations you may want to consider based on other other user’s behavior and you could just direct them in. And the same is also true for the parameters. So for example, this join here, whenever you see something green like this, you can see what are the most frequently used parameter values. Same here. Most people go with a subset. I was going with a regular expression here, but subset, it’s pretty popular as well. And that’s extremely helpful, especially for machine learning models to see what are the most frequently changed parameters and what other values most users are changing those parameters to.
00:05 So now finally we can start with some modeling. Before we jump into RapidMiners Auto Model. Let me give you a high level overview, and we’re not going through all the details here. I mean in this overview, it’s not showing them anyway, but let’s just focus on some of the boxes you see like the dash bordered boxes you see on this, on the screen here right now.
00:28 So obviously the data comes in and then we start what we call it, basic preparation. You prepare the target column of this need for that. We do then remove some of the columns, for example, stable columns, ID like columns, et cetera. If you’re already used to automatic cleansing and Turbo Prep that it’s typically not much need for that, but that happens first. So there’s a bit of automation going on there. Then we separate the data into a training and testing sets of validation sets and so on.
00:55 And we move on to the next bigger block, the feature engineering block. So for example, if there’s dates columns in your data, we extract those dates. If we remember what other known values are in your data. So, if later, during the application phase, the scoring phase of the model, there’s some new data values are coming along that the model is not going to be confused and knows how to treat them. We handle missing values. If there are still many missing values in this data. Again, if you use auto cleansing from Turbo Prep, you probably don’t need to do this, but other data sources you may. We had a nominal columns, a dummy or one hot encoding. And if there’s for example, text columns in the data, we extract features from those text columns as well.
01:38 I got this basic feature engineering because that’s, that should be a given and not too exciting. But now the really exciting thing for RapidMiner’s Auto Model is also there’s automatic feature engineering. That we use a multi objective optimization technique, which by the way, it’s close to my own heart because this was a, one of the major outcomes out of my own PhD thesis. So this is a really, really advanced technique to generate additional features out of the existing ones. So after doing the basic feature engineering, we do this full blown automatic engineering.
02:08 And this is can be very dangerous because you’ll may, introduce something, we call over fitting into the feature space. But that is not going to happen, but this multi objective approach because it actually introduces some form of regularization. We will see this a little bit later around Automatic Feature Engineering and we do all the hyper parameter tuning. We first do some pre-tuning, then the automatic feature engineer and then the full blown hyper parameter optimization. And then the actual model training starts.
02:35 And here’s another thing which is really important if you call this profit sensitive scoring, this is another novel, algorithm which allows you to not just optimize for accuracy or error rates or things like that, but actually try it tells the model how, how it can best impact the business gains or profits. We do obviously robust model validation which against another novel technique. Here we are doing this.
03:00 So there’s a lot of great things going on there and you don’t need to worry about any of this because for you. It’s just one single click. And that was actually kind of a lie because it’s not one single click, it’s five clicks, but that’s okay. And why is it only five clicks? Because that’s why we call it augmented machine learning. We try to support you and give you good recommendations as much as needed. And whenever we can, we actually fully automate the potentially still necessary data pre-processing like the feature engineering we have seen before Or, the model selection, hyper parameter tuning, yeah. And all of that, generating the results, and allow you to compare the different models. Okay.
03:45 So now finally, let’s go back into the product and let’s have a look. So here’s the data we had in Turbo Prep before. And here I can obviously export it for us and loaded back into Auto Model, but there’s a shortcut here. I can just click on model. And here is the data in RapidMiner’s Auto Model. So it’s the same data set. I’m not doing clustering or outlier detection here now so let’s ignore this one now. But it would, it would like to do is to predict this column here we have generated all right predictors pre-selected, all looks good. Moving on.
04:19 We will come back later to this defined costs and benefits button here. But for now the only change I’m going to do is here to tell RapidMiner that we really would like to focus on the delayed flights first. That’s what you want to avoid. And this next step here now is what I would call the heart of RapidMiner’s Auto Model. I mean it’s not too exciting at this point because we already did so much work in Turbo Prep and we cleaned up so much. But you you see the simple traffic light, a couple of red columns could for example mean you should get rid of them. Turbo Prep already took care of this. Here, something absolutely mind blowing.
04:53 So, for example, this column here is highlighted as a yellow one, but everything else is green. So what is this column? Well, it’s the arrival delay. Wait a second. That’s the one we use to calculate our delay class. The one you want to predict. You’ll remember if it’s bigger than 15 then it’s delayed.
05:10 Otherwise it’s on-time. Well, yeah, I mean if this is still in the data, all the models would pick up on this and say like, yeah, if it’s more than 15, then it’s delayed. That’s obviously not what we are you going to be interested in and RapidMiner finds this column, highly suspicious. Sometimes you’re just lucky and it’s okay to have a column like that in the data. But in this particular case, certainly not. So we should get rid of this and just highlight it and give you this recommendation. That’s why we call it augmented analytics because you only, you know your business case, only you can make the final call on this. But we at least give you a hint and the same is true here.
05:41 So those are the models RapidMiner believes will perform well on your data and to deliver good results in a feasible amount of time. You can override this decision. For example, turn this one on but that will probably take sometime so maybe lets not, let’s not do it. And in general maybe just focus on a couple of the more popular models like those here, logistic regression, deep learning, gradient boost trees, that kind of popular. Typically, we do the automatic hyper parameter optimization for you so you don’t need to worry about that. And the same is true for this basic feature engineering.
06:13 For example, extracting information out of the date columns like month, quarter of the year, or if there’s multiple date columns like in our data set. We even built the differences between all of them because that’s often very helpful as well. We’ll come back to text later. And also the same is true for the automatic feature engineering, both selection but also feature generation and not going to do this.
06:34 Now let’s just start this run here on this data set. Well, now you can see it’s running and those are the five clicks that promised to you as really kind of an easy wizard based approach. And while we are waiting for the results. We’re not going to wait for the full time, but we can look and have a look into some of the results which are already in.
06:55 So here’s the data sets. As you can see, the arrival delay column has gone at this point. But if we have a couple of additional columns, now like the extracted information out of the year, date columns or those differences between dates I mentioned before, they are here as well. The statistics are here already but, the models will still take some time. Why is that? Because we actually not just creating one model for each class, we are generating dozens, sometimes hundreds of models for each off the class. And sort of figure out what is the best model, what is the best network architecture what is the best number of trees for, for gradient boosted trees and so on. So I will stop this run here. And just go back to the beginning because here we actually have a couple of pre-calculated results from a previous run.
07:39 So I’ve adjusted load them in here so we can explore those results. Okay, so here we have it. You start with the classification area here on the left and those models that really look very good one between one and 2%. Error rates here, you can see that gradient boosted trees was taking a little longer, but in total generates more than 250 models in less than eight minutes because a lot of this is happening in parallel. That’s pretty good.
08:06 And if you would like to see other results like it was C or or others, you can just switch it over on the same is true for the ROC Curve. But to be really honest, I actually find this almost too good. So what’s going on here? So good that you can actually have a look into the model. So, for example for this gradient boosted trees here, let’s have a look at this tree or this one or this one. Wait a second. They all are very similar. They all are using this difference between the arrival time and what does this column about? Well that actually is the scheduled arrival time.
08:39 Well what is the difference between the actual arrival time and the scheduled arrival time? Well that is our flight delays in minutes or seconds of or whatever it is and obviously that’s the same thing as this arrival delay column we based on target on, Oh, okay. So here we made a mistake. We did use a column we shouldn’t have been using. We can see the same result. Also here, if you look into the model specific weights here, there’s another novel algorithm coming from RapidMiner that you can actually build model specific weights for all the model types. No matter if it’s deep learning, gradient boosted trees, logistic regression. It’s not just regular global weights like correlation based weights or anything.
09:13 It’s really model specific. They’re based on the local explanations, I will show this a little bit later. And here again, there’s differences too important here that is just not really great. Well, we will do this again this time without this column. But before we do this, let me show you what I mentioned before, the multiple models. So for example, here you can see how well the models performed. Red is good, blue is not so good for the different parameter or combinations. So you can inspect this here, as well.
09:46 So let’s do it right this time. We will, in the second session about augmented machine learning, we will focus on creating like a robust model of which has also a true business impact. So it works like it should work and you can improve what this business impact is going to be. And that’s really important because it’s not just about model accuracy or model error rates or AOC or whatever. It’s really about like, what is this model really doing for our organization and can you show
10:12 So I will introduce this new form of profit sensitive scoring. That’s a new element from RapidMiner. We will obviously correct this a little mistake we did before with the arrival time to keep this column in. So we want to do that and I will also show you some basic texts analytics, which we can easily enable it here as well.
10:36 We’ll focus a little bit more on the deep learning model this time as well. So before we go back into the product, so let me quickly explain this profit sensitive scoring to you and not explain the whole algorithm. But what does it do and what is it for?
10:51 So in general and machine learning, there are techniques where you can define well what happens in certain types of arrows. So for example, if you predict on-time, but it actually is a delayed flight and especially for two class problems, there’s certainly a dozen or so algorithms around to solve this problem in a more or less efficient way. But the problem is for more than two classes, it’s actually is a really hard problem. There are not many algorithms for doing this. And those which are solving this for more than two classes, they typically build on ensembles around your actual model and that is coming at a price because now this model itself might already been an ensemble.
11:28 Like for example, gradient boosted trees or random forest and then you build 10 or 20 of those around those ensembles. It makes the model harder to understand and it also increases training times dramatically. So that is not really acceptable, but sometimes you just have more than two classes and what can we do? So we came up with a new algorithm, we call it profit-sensitive scoring, which doesn’t have those problems it works for two or more classes without the increase in training time. And it really is a, it’s a great tool in your tool belt and it should make you use of this.
11:59 So how are you using this? Well by looking into the business impacts each of the predictions can have. So I’m just quickly going through this. If I, for example, would predict on-time arrival and actually is on-time, that may have a positive impact on customer loyalty. Now just put the value of 10,000 to here, a positive number. But if I predict on-time, and it’s actually delayed, then I have two problems. I will lose some customer loyalty and I actually audit the gate too early. So that means that I’ve increased airport costs. And I’m not an expert for that. So I don’t really know. But that’s what happening. It’s just an example. Okay.
12:34 Predicted delayed, but being on-time, while in that case I can actually do some airport preparation for a delay, but I offset this with the, which costs extra but, but that I offset with this with increased loyalty. So that’s good. And then if a predict delayed and it’s actually delayed, well at least they don’t have an increase in airport costs, but I still will have a loyalty loss.
12:55 So this is just like an example here and I will show it to you in the protocol to use those numbers. Okay. So we’re back into Auto Model. So there’s still the column you want to predict. Let’s go through this. And in addition to focusing on delayed, I can now enter those values two years or 10 thousands was for the first one, if I remember correctly. Then you get minus 20,000 here. We had still minus 5,000 here and minus 10,000 here.
13:28 So that looks good. All right, so now we have custom settings for the cost and the rest is all the same. So yes, I get rid of the arrival delay here. And remember, you also need to remove the arrival time. Fine. I always like to look into the text-ness column here as well. So for example, the highest text-ness is the origin name, but to be honest, the origin name, like where the name of the airports, I don’t think there’s much to be learned from. So you know what, I just kick it out of here. But the second one here is, that is the weather events. You remember fog, thunderstorm…And while the, it’s not high enough to make the thresholds to be automatically detected as text and turn on the texts handling at that level .
14:09 I can do this here. So lets not change the models this time, but I can do it here by actually forcing RapidMiner to treat this as a text column. So let’s do this. All the rest is the same. So we will still extract information from the dates it’s even though we are also extract information from the text. But I’m still not turning on the automatic feature engineering, both selection and generation. That is for a third one. Well instead of running it now again, I have the results here from a previous run so I’m just going to load them into RapidMiner and I’m quickly explaining the results to you in a second. Okay, here we go.
14:48 First thing we can see is that we have this text information over here’s the, for example, its a word cloud, so red means on-time, blue means or delayed. And we can see that normal is a little bit more frequent for on-time. Flights by rain for example, is bigger for delayed and fog is bigger for delayed and even snow is bigger for delayed. So weather potentially really has some impact here. So that’s one thing we can see.
15:15 I mean there’s not lots of different words so not much to see here, but, but still then let’s have a look here again into the into the accuracies. Yeah, it looks more realistic, some of the models no longer performing that well. Some others, are doing a pretty good job. Deep learning in particular, is doing a great job, good job. It’s not just getting the best, highest accuracy or lowest classification error. It’s also the model of which delivers most business impacts.
15:40 So how do we calculate this? Well, that’s easy. Let’s actually look in, look into the, this model here into the details. And here we see what is our best option without using any model, you could make 25 million here. That’s by always predicting on time. So we never treat, potentially delayed airplanes in a different way because we always treat everything like it would be on-time. In that case, that’s the best option out of those two here, always delayed or always on-time we make 25 million.
16:09 But then we can also use whatever the model predicts here. If you do this, then actually we count here and to 38 million. And that is the gain of 13 and a half million. That’s the number you’re seeing here. So you can clearly see that this model here delivers the biggest business impact since deep learning is otherwise, it’s pretty much a black box model.
16:28 So what else can we do with this model here? Well, one thing I like is those prediction tabs here. So wherever you create the prediction together with the confidence for the different predictions, but then also we can see for each row, what is supporting or contradicting the prediction. So for example, here, it’s the delayed prediction and in fact it was delayed. First of all, it’s supported by the origin here.
16:51 But then also there was some departure delay already that was departing this year as well and so on. And this is also how we can calculate those models specific weights we have seen before. So we see in general the delays here is important, but also some other information that here it’s really if there’s more bold colors in those columns here, that means overall it’s a more important column.
17:16 So then this next section I will show you some results which potentially blow your minds. And this is all about automatic feature engineering, what it is doing to your models and, and how we manage something you call feature-space overfitting. So first of all, the results you will look at in a minute you will see that we generated more than 175,000 models.
17:38 We tried more than 46,000 different feature set combinations. And then we also, while we’ve been doing this, we generated 2,500, new different features, and we did all of this in less than four hours. And that led to much better models. But there’s often a problem if you, if you do an approach like this and if you fully automate the feature engineering, which means you’re do feature selection, but you’re also generating new columns based on the existing ones. And this problem is that you often run into the problem of feature-space overfitting. This was in fact the topic of my own PhD or doctoral thesis and many, many years ago.
18:14 How can you make sure that this problem of feature-space overfitting doesn’t occur and we use multi objective optimization and a special form of regularization to keep this overfitting under control. And if that doesn’t mean anything to you, that’s totally fine as well. You’ll still be blown away by the results because you can actually learn something from this. So I will show you in this section or this automatic feature engineering approach.How to explore those results.
18:36 You can actually see the trade off between the model and feature-space complexities and the accuracies or error rates. And then we will wrap it all up with another example for how to ensure transparency that you understand what’s going on and can even debug some of those auto model processes here. All right, so let’s go into the product.
18:59 I loaded the results already so we had to total of 175,000 models and so on. You’ll see this here at the top and overall our models look like they are performing a little bit better, especially the linear models. They really benefited from this feature-space transformation and the generation of new features.
19:15 Deep learning is doing this inherently already as part of its approach, but still it often can benefit from additional feature generation as well because it can actually help speeding up the runtime for deep learning. Talking about deep learning.
19:29 Here’s another interesting thing is while deep learning was in fact the one with the lowest error rate here, that’s what this little icon means. It was not the model with the best business impact. That was actually this one here. And I always love when that happens because like you can easily get hyped and like, like be excited about the latest new great techniques like deep learning here.
19:50 Well good old linear modeling, especially in combination with powerful feature engineering often outperforms other methods like in this case, no obvious, yeah, get some fun out of this. But since deep learning, was the best one. Let’s actually focus our intention on the feature sets for deep learning. So whenever you turn on feature engineering, both feature selection or generation in this previous screen in the model type screen we’ve seen before you will get this additional result.
20:17 And what does it mean? So first of all, you get this point up here in the top right corner. That is the original feature-space. So it came with like 30 something features here. You see all of them here at the bottom, the extractive text features and everything else as well. But this model wasn’t actually that great. It only had like nine, 10% of error rate. You could go with a different model than this one here, which has a much lower complexity shown on the Y axis here, only using actually one feature of the NAS_delay, but actually it’s just less complex. It also performs better.
20:49 It has a lower error rate or you can go a little bit higher in complexity going with all the delays here and really dropping the, the error rates even significantly down here. And this is a typical curve, this trait of curve here, where you go up a little bit of complexity and to, for example, end up with this one here, which is the one we used where there’s all the delays in here. Some information about the weather and we generate a new one, but just that’s the product of those two delays.
21:14 We tried almost 600 other ones for this particular model class here, but RapidMiner figured that there is no value in adding, even more complexity. And this is exactly what we call feature bloat or, or this overfitting problem because you could actually drive the error rates further down by increasing the complexity for the up here but not significantly. That’s why we went with this point and this exploration here that you can actually learn something about the interactions between the feature sets. That’s really a fantastic feature of this feature-space exploration tool here.
21:51 Finally before we actually click on this nice little green button here for deploying some models, I would like to show you what that like for Turbo Prep, also for Auto Model, you can click on this open process button and that will generate the complete process for, for generating all those models here. You can actually, it’s fully annotated. You can go inside here, you can have a look into the, the preprocessing, you can see every single thing.
22:16 You could make changes if we want to. But that’s again, it’s really about building this necessary trust, giving you the full transparency, making sure that we are not doing any mistakes and everything is, is as it should be. All right, this now will open up for this third part of this demonstration here, which is all about deployments and deploying models and managing them.
00:04 Sometimes it’s easy to forget that building a model is not actually the end of the story. It’s really where it really starts because now you need to put this into production and you need to integrate the predictions, for example, with other pieces of your infrastructure. And then you need to make sure that the model is these current and then then there’s not getting worse over time. And this is really what the whole model ops solution of RapidMiner is for.
00:28 So let’s get started at the basics first let’s start with the deployments. I will show you how to deploy a model. What kind of different model types are there? Sort of like we call them Champions and or Active models and Challengers, how you can see how this model was created. Again, this is really important for compliance reasons and well, how are you really work with the model ops in the overall.
00:52 Ok, so like you’ve seen before with Turbo Prep and Auto Model. Here’s another, a prospective review here at the top, which is called deployments. And I already have one deployment in here. In this location, you can have multiple deployment locations. Let’s ignore this for now. I’ll come back a little bit later to this one here. Let’s add a new one and let’s call this one Arrival Delays. Okay.
01:17 So we will generate this new deployment. And if you do this we can see that we still only have one active deployment here. This is not active yet nor could be activated because there are actually no models in this deployment. But I can click on it and you can see, yep, no model is not good.
02:21 That’s why it takes a few moments. But after it’s done, we will see that we have our first model in this deployment and this new deployment. And since it’s the first one, it automatically became the active model. So, let’s add another one and see what happens then again, or it doesn’t matter where you click or lets deploy the deep learning model here. And if you do this, if you edit to the same deployment you will see that, the second model that become the Challenger model.
02:50 So what is the idea behind Active models and Challenger models? Well, the active model is the one which produces the predictions. But whenever you use the model, or the deployment for scoring, no matter if you upload some data and do the scoring that way or if you actually use the automatically created VAP services. The Active model is the one who produces those predictions, but all the Challengers are producing the predictions as well. They are just not delivered, but they can still be used to calculate how well those challenges perform. And you’ll see this leader and this demonstration that this can be very useful because if this Challenger becomes better over time, all you need to do is click here and change this one to the Active model to replace it.
03:31 Finally you can also see the DDoS of each model. So I can click on this model, I can see the model which has been generated. I see a snapshot of the full input data and that actually is really important because this input data here is not just a reference, it’s a full copy. And why is that important? Because otherwise the data could have been changed in the meantime. But if you needed to prove how this model was, it has been built.
03:57 Think about GDPR in Europe for example, and the right to explanation, you better know all the details without having any options that anybody could have been changing anything. In the meantime, all the other results have been produced as well and you can explore them here. And that even includes the generated process here.
04:13 You can load this back into the design view. Well, that’s not very exciting because it’s pretty much the same process as the one we have seen before. But again, you can really prove how this model has been created when we talking about scoring. And we also need to talk about how to explain the predictions and machine learning models is doing. I mean, for some models it’s really easy, like for a decision tree, you can pretty much follow along as you mean. But for most models it’s just not, especially for the more powerful ones. And let’s be honest, and I sometimes ask people if they understand or how exactly a Linden regression model is working. And most people wouldn’t know that either.
04:50 So the whole topic of explainable AI became really important. And yeah, RapidMiner totally believe in a complete no black boxes policy. And that means for both, it’s for how the models have been generated. So we saw this before. We can always open up the processes, but it’s also about the predictions which are created by the models. So let’s have a look into how those predictions are created. And then, what do you call it, the scoring, how we can explain those predictions. I really have been showing those model-specific but of model-agnostic weights before. And then there’s another thing we call the model simulator. I personally like a lot actually, and many of our users do too. Okay.
05:28 So here’s our two deployments again. For now, let’s actually switch over to this one here because we just have a little bit more data already in this one here. It’s a bit more interesting.
05:37 So let’s go into the flight delays deployment here. Ignore the dashboard for now. We will come back to this later, but I will go into the scoring here. So there’s two ways. So one is we can do is create scores. By integrating this deployment into this model, into other systems way of X services. We will discuss this leader or you can just upload some data like this. So for example, today is, October 3rd, the day I’m producing this year. So I thought, let’s take the 2008 data. So remember we trained the model in 2007.
06:09 Now we take the 2008 data also from October 3rd and we load it in, so here’s the date or I mean not using the arrival time, arrival delay obviously, but we detected that we have this target column already and if we If you have this, it’s kind of useful of course, you can calculate our rates right away. Okay, so let’s feed this data in here. Let’s do the scoring. And after a couple of moments we will get all the predictions together with the confidence values and also the explanations like the ones we have seen before.
06:39 We are using a faster variant of the so called LIME algorithm for local explanations. Here it’s faster, so it even works in real time. You’ll see this in the, in the simulator a little bit later. And it’s delivers you those supporting and contradicting colors like you’ve seen before. So for every prediction, so most flights are on time, here’s one is delayed you see exactly what was driving this prediction. And as I mentioned before, columns which are in general a little bit bolder, I support most of the predictions or contradict most of those predictions. Those other columns which often have a higher model-specific rate.
07:19 In this case we knew the actual value already. If that is not the case, you can also define the actual outcomes for given a IDice which is really helpful to calculate those error rates. We see a little bit later here as well. So let’s move to the simulator. Before we do this. We see in general those delayed columns are a bit more important. We don’t need those scores any longer, so let’s just throw them away. So the simulator, what this is doing is we see the input factors on the left side and then we see how the model behaves on the right side.
07:52 And just by playing around it really can be helpful to understand how the model works. So we saw the, the delays are important, we see this here as well. If I just increase the delay here for example, in real time you see how the model behavior is changing and also what is driving those predictions. Again, those local explanations are seen here as well. You can even turn this into prescriptive analytics. For example, if the flight already started late, but I want to do my best to actually bring it back to an on-time flight.
08:19 What can I do? I can run this optimization here now which is then telling me how would I need to change certain input factors to bring this back into an on time flight as, as good as possible. So after this optimization is done, I can press the finish button here and then the optimal values are, are taking over here and as we can see, yeah, you still have a chance to bring this back to an on-time flight. How we do this? Pretty much the only thing, I compare this to the average, is by bringing the taxi out and taxi in times here to pretty much zero.
08:53 While that’s better, make sure that we get a gate close to the place where we are landing then cause otherwise it’s getting even for us. That’s the only thing we really can do then at that point in time, I mean bad news. Now lets turn our attention over to the year and model management aspect of it.
09:12 So sometimes you actually have been building models on a certain data set and then after some time the performance gets worse and worse. This is often a result of changes in the world and you can detect this by looking into the drift of the inputs. Meaning how different are the inputs now than they used to be. And then there’s another element to this. Like, sometimes you’re using information and not using the information you should have been using or not have been using. And that leads to some bias in the selection of your input data. And there are really two sides of the same coin actually. But if you want to do ethical data science, often you need to avoid this form of bias or you should better be aware of this.
09:51 So this section will show your how to identify those input drifts how to do this bias detection and also explore the differences in the difference tools. All right, so let’s start with our deployment again. The first indicator often already is, in this models tab. So if you just compare the reasoned arrow with what you expected when you trained the model, you can see that in for the fast large margin, which is a linear model. It is a little bit of a risk, but not as much worse, while for the deep learning model, while it started very good, it’s no longer that good. So, there’s probably some drift happening. And that’s very typical for deep learning models as well because they they’re just not very robust against changes against the world.
10:31 So this is one way to see this, but then there’s an extra tab up here. We call this drifts. We have you show who you are for each input, like what are the inputs, where there is the biggest difference. So for example, remember this column you have created this total flights column. So between the training data, the dark greenish here and the data, we can see that overall there seems to be less flights, now in 2018, going into Boston. So there was a bit of a change.
10:58 That’s actually the biggest change. This column is not super important, but still, and you can see the same for nominal columns like this one here again, like we talked about bias before, so, Oh well the differences are not that big. But here is for example, a couple of new flights coming, in from a state which wasn’t existing before.
11:17 And that can again also be an indicator for bias because maybe you train the model on cases, or you didn’t train the model of case because he left out, maybe complete geography or something like that. Input drift can be a problem. It doesn’t have to be. So for example, this is the one we have seen before, the total flights here and going back to this one. But it luckily is not a very important column. So it’s for the, here, the, the bottom left. But there’s really important one, the NASdelay, pretty much behaves the same. So it’s not a problem.
11:44 You really want to avoid any columns in the upper right corner here. So it’s good that we can highlight those drift problems, our potential bias problems here. Obviously in combination with being able to see the details, how the models have been created you remember from before that really together can help you to figure out if there was a bias problem, if it’s just a different problem and maybe retrain the model or take other measures if needed.
12:15 Let’s not dive in deeper into model management. I mentioned already before that with RapidMiner, you can actually close this feedback loop so we can upload those actual outcomes or you can use it vectors for that. You will see this in a couple of minutes. Closing this feedback loop is really essential because if you know the actual outcome or at a later time after you created the prediction and everything, you can then do back tasking and you can do proper model comparisons.
12:41 So I will show this to you and then the deployment dashboards, the leaderboards among the models that we have seen some of this already before, and we focus a little bit more on this performance dashboard here, which also can help you to understand the business impact of the models. Okay, so that’s done for the dashboards. Here we see a couple of main KPIs, like how many scores did we generate so far, what is the total gains? This is the, the orange line down here. So like how much did we generate here so far? Like what’s the error rate? Are there any alerts going on right now and so on.
13:17 So each bar is for the number of scores we have created so far, so that is this and the purple line shows you the average error rate. So it looks pretty stable over time. If you already saw the model type before, so let’s focus on the performance then. The performance total even more details here. So again, you would see the scores for each model, but you also can see the error rates for the different models.
13:42 So here we can see, okay, while we generated a couple of scores every day, overall, while our deep learning model is doing a half decent job, our fast large margin model actually becomes better now. So it’s probably about time now to turn this over and make this the active model, but it comes as a price.
14:01 It is a little bit slower for scoring then the deep learning model, in fact. So you can see the different distributions here for the extra classes, which again could be another indicator for a potential drift here. If the growth has changed. In this case, it’s more likely to be an outcome of the different cost matrices, the or costs we have defined before. And talking about costs, here, the purple area here, that is the best option without any models.
14:26 So always going with on time. You saw this already in the Auto model. So here we can see what is the impact really over time. If you wouldn’t use any model, like while we would make roughly 20 million here and over the course of the last 30 days, we can always change the timeframe here, the cumulated one then is what the model delivers on top of it.
14:45 So as we can see, it’s a almost 30 million in this case and this difference between those two areas. That’s how we calculate the gain, which is the orange curve here which is then almost 9 million for this timeframe. And that’s important for making the case for your model to make sure the model is still delivering value and it’s not about accuracy is not about error rates. It’s really about that and pretty much nothing else.
15:10 All right, moving onto the last part, which is about model operations. Model operations is really about automating all the maintenance of around the deployed models. So creating alerts, being aware that something is no longer working as it should. Integrating the models into other pieces of your IT infrastructure. This is really what this next section is going to beabout.
15:35 So lets just jump right into this. Yep, we have seen this before. Let’s move over to the alerts tab here and you can see that I’ve already created three different alerts here. The first one is the drift alert. If this drift is greater than let’s say 8% for the last week and I check this once per day, then that sends an email to data scientist number seven. And the same is true for error alert. Or if you have less than a hundred scores and on any given day, then again, you send up an email. Whenever an alert is triggered, we can see what has been triggered down here until we acknowledge this.
16:08 So at this moment this explains why we have the seven alerts here. If I, for example, acknowledge all of them then this KPI would actually go back to zero here and everything is green. So let’s go back to our alerts and let’s trigger a new one.
16:25 Check now. This error alert will be checked at this moment. Is the error rate higher than what was it? 4%? Yes, it was almost 7%. Okay, so this has triggers and I can just check my email inbox here, or actually it’s in fact data scientist number seven email inbox and I will get notified. So this is about alerts, integrations is equally as simple. You could see the URL for the rep service here. I run this on a local server right now. And you can test this for some of those values. You just test the, the web service here. What is nice, I get all the results here in JSON, but we support actually more than 30 different formats here.
17:07 So we can change this if you like. As you get all this information here, we even can de-activate the deployment and that means the representative would still respond, but it gives you a proper error message telling the either the system integrating this web service that currently the deployment is deactivated. And then there’s the second web service here for defining the actuals.
17:30 And with that I would like to wrap this platform demonstration. We really managed to do a full project on flight delays. It’s a full AI project. We learned a lot about our data and we created a predictive model. We put it into production, we managed this model. You saw, it’s all very simple but it’s also at the same time stays very transparent.
17:50 And I said it before, RapidMiner is the only platform supporting everything from data prep, automated machine learning, down to model ops, in this augmented and automated way. But there is this whole other piece to the platform that you can actually see what exactly has happened in, in with those annotated processes, which comes in very useful and it’s often very important also for compliance reasons. Thanks for staying with me whole hour here and I hoped you enjoyed it as much as I did. Thanks.