Intuitive Data Prep for Machine Learning

With the release of RapidMiner 9, we’ve introduced RapidMiner Turbo Prep—a brand-new data prep experience to help speed productivity. We’re addressing the data science skills gap many organizations face with this radically simple tool to help anyone from an analyst to a data scientist conquer time-consuming data preparation tasks.

Join RapidMiner Founder Dr. Ingo Mierswa for this 60-minute webinar where he provides a detailed overview of the features of RapidMiner Turbo Prep. By walking you through two example data sets he’ll show you:

00:00 Hi, this is Ingo from RapidMiner. Today, I would like to discuss something which I personally find extremely exciting, although you’d see a little bit later that most people don’t really share the sentiment and its data preparation. I think it’s an incredibly important part of the machine learning and in general data science process. But in fact, the way we how we’re doing this is not– is maybe not just good enough, it’s taking too long. And we spent too much time on this. And it’s not a very smart approach. So today I would like to introduce something new to you, and I hope you find it as exciting as I do. And it will change the way how you will do data preparation of the future. So, let’s get started by setting the stage and think a little bit about what is the really underlying problem when why would we need to focus and think about this in the first place? And the problem still is that most organizations still face a data scientist, a bottleneck. People want to do data science. They want to use machine learning model models in production. But they can’t achieve this just because they don’t have the necessary resources for this. And although colleges and everybody else tries to create as many data centers as possible, it’s just not fast enough and it’s hard to hire them and everything else. We all know that. So, this is the problem. And a potential solution really comes with two different flavors, but they can maybe be solved at the same time. And one potential solution could be, let’s empower more people to do data scientist’s work. So that means, obviously, we need to make this a bit more– well, a little bit simpler because data science in general and machine learning is just too hard. And the data preparation and we see this in a second, is such an important part of data science. Also, we need to make that simpler as well.

01:49 And then the other element to this is even if we simplify it and everything else, let’s also make sure that the resources we currently have or we are going to hire in the future are more productive. So how can we actually accelerate this whole thing like big time? So, you can think of this almost like two sides of the same coin. But it is important to understand, it is still a problem and we can only overcome this problem by empower more people and the people we do empower or we already have, make them more productive. All right. So now how does data– or sorry, data preparation actually play a role here? We all keep saying that more than 80% of our work as machine learning experts and data scientist is preparing the data. And I personally feel is actually much higher. But here’s some data sets from somebody who did a survey among data scientists and in general analysts. And how much time do they spend on the different aspects. And in fact, the big blue part here is cleaning and organizing data, and that’s 60%. And then the grey, light gray, part is collecting data sets, it’s another 19%. So that’s already 79. And even building training sets the orange part at the top, it’s another 3%. So in total, 82% of our time is spent on data preparation and only the 9% plus 4% like 13% on actually doing the modeling work and then other stuff, the other 5%. So my point really is we’ve spent more than 80% on data preparation, and that’s probably because we love it so much. Right. The we’re so excited that we can do data prep. That’s why we spent so much time. But actually not if we asked the same people what was the least enjoyable part of data science, well 57% say, “Well, it’s actually cleaning and organizing data.” That was the 60% bucket we have seen before. So, we spent 60% of our time in that bucket. And that’s the one area we enjoy least.

03:41 But then even collecting data sets and also building training sets, the other two elements of this, which form the total 82% of where we spend our time is the top three of least enjoyed things. So we spent 80% of our time there and then we don’t even enjoy it. I mean, that’s horrible, but that’s reality. So why is that? Why do we need to spend so much time? And also why is it really not that much fun? And I think one way to think about data preparation, but in general, data science is to have a look into the different approaches to data science. What is the code-based approach on the left side of the spectrum here. And then there is the data centric approach on the right side of the spectrum. And then there’s something in the middle which is a process-based approach. Let’s quickly discuss all three of them. Code-based approaches to data science are pretty much the norm. And I’m not saying this is actually a good thing because I love coding. I think there’s always a place for coding. But if you keep solving the same problems over and over again by writing code, you’re doing it actually wrong. It’s just not the most efficient way. And it also is not something a lot of people can do. But the beauty about writing code is it’s so powerful, you’re so flexible, it can basically solve whatever you want to solve. And that’s amazing. That’s definitely a huge plus. But then again, learning from a language like R or Python is a little bit harder. It’s difficult. It’s not for everybody. But another thing on the plus side though, it is repeatable. After you wrote the code, you can apply this code on the same data over and over again and you’re supposed to get the same results. So that’s the one end of the spectrum.

05:12 The other end of the spectrum is the more data centric approach. And to say one word to make sure that you understand what I mean by this, it’s Excel. You look into the data, you change the data, you edit the data. You constantly work directly with the data. And that’s fantastic because that’s a very intuitive approach. You really see the impact of your changes. Unfortunately, that also makes a little bit more limited. So, for example, it’s very hard to do things like loops or branches on this data, in this data centric approach. So that’s not really solving all the problems, but it is very intuitive. A big drawback, though, is if you do 10 different things on your data set in Excel, then you show me the final result. I couldn’t tell you then, but just looking at the resulting data, what is it you did? And that’s been problematic of those data centric approaches are not repeatable and that often comes with problems for data governance as well, and also making sure that you can apply the same approach to new data. So those are the two ends of the spectrum. And then it shouldn’t come as a surprise that the process-based approach, something we at RapidMiner has been doing for a long time, is kind of like a balanced approach in the middle, especially if you also allow to embed code. It’s flexible and powerful as coding, but it’s definitely limit is a little more intuitive than men coding. Maybe not as intuitive as the data centric approach, but yeah, its balanced, is somewhere in the middle. It’ll be good for governance as well. So, if you have a process and this process can be applied on new data sets that can be shared with others, other people can check it, everybody can understand it. It helps you with your governance around machine learning and data science. But there is also one drawback. And I although we are big believers in processes, I would be lying if I wouldn’t point this out. Even if you design processes and it’s typically a little bit faster than coding, especially after you’re getting used to this, and the reason is because every step in the process, it can encapsulate about 50 or 100 or 200 lines of code. So, you can do more things quicker.

07:10 But it still requires you to think a little bit like a programmer, because if a new code, you write code and then you interpret it already copilots and apply this code on some data and then you, at the end, you see the result. But while you’re creating the code, you need to envision what this code is going to do to your data. And that’s actually still true for processes as well. That’s a drawback. So while you’re designing the process, you can see the result until you finally execute the process and only then you see if you made errors or if everything’s good. And this is what makes it so hard. And also, even if you’re following a process-based approach instead of a code-based approach, makes it also take a longer than is necessary. And it’s not for everybody, for the same reasons why coding is not for everybody. Thinking like a program are still it’s not for everybody and that’s the problem. I mean, this is research what is interesting for us as a data science platform vendor. Obviously, the most important thing is usability for most people. And it’s not about the validation or total cost of ownership, not even the functionality of how many machine learning models we support or anything like that. And that makes sense. If you think about this big data scientist bottleneck, we discussed at the beginning, if you want to empower more people, then usability becomes the most important thing. But at the same time, even with a process-based approach, you can’t get the full nine yards. You can’t go the full out. You can’t get them all the way to empower, basically, everybody. So, what could we do differently then? And we have been thinking long and hard about this. And I think one way of actually combining the best of both worlds and getting to the ease of working indirectly in data, but at the same time keeping the advantages of coding process-based approach is by turning things around a little bit. So often people think we start with code, then it that the current processes and then you can apply the personal data and then you see the result. But that’s the whole way of thinking it might be wrong.

09:01 So let’s forget about code anyway. It’s definitely important for certain people. I love it myself, but not for everybody and not for every solution. So, let’s drop this one. But then let’s turn around the data and the process. What if we actually work in the data and then depending on what you do to the data, we build the process in the background? It’s almost like working with Excel. But then all the changes you do to the data, it’s almost like it’s recorded like a video and then it’s turned into a process which can be applied to new data. And you can always follow along and make sure that, you did no mistakes and everything else. And this is really a paradigm change. And I think personally, this is exciting because I believe in the future this is how people will create data science solutions and will work with data preparation because it combines all the advantages of the different approaches. You get the ease of use of a data centric approach. But since we build processes in the background, you keep all the advances of governance and repeatability and you keep the power because then you can still edit the processes and make changes to the process if you have to, you can even embed your own custom code if you have to. So, we are not losing anything, but just by turning things around, you really get the best of both worlds. And I think personally, this is absolutely exciting. And so it’s an honor to introduce to you now RapidMiner Turbo Prep, which is doing exactly that. So the whole vision of RapidMiner Turbo Prep was to provide this interactive, very data centric way of working with data, like sitting in Excel and changing the data directly. So we do all the blending and data cleansing directly in the data, but then we record what you did in the background and we build those processes which can then open up. And that will also at the end, we discussed this a lot around the whole notion of that we can’t accept black boxes. It’s so important that you’re not hiding what you did, even if you automate things or if you work in the data centric view.

10:51 So building process in the background is important for keeping those advantages but is even more important because otherwise you can’t trust those results, really. It’s been a couple of more things. So first of all, it offers a lot of automatic ways to improve the quality of your data, specifically for machine learning. It seems to integrate with automotive, again, helpful for machine learning. It’s going to really help data sets. I mean, although it is data centric and it’s very interactive, you can actually work on pretty large data sets as well. And still, do you feel good, and you don’t need to wait too long. And since we turned this whole thing around and have this paradigm change, it’s supposed to be usable by literally anyone who can– if you can use Excel, you can use RapidMiner Turbo Prep. Okay, so for the rest of this webinar, I would actually like to go into the product and show it to you and go through two different demo scenarios so you can actually get a feeling how this whole works. And I hope that actually makes it much clearer than me going through hundreds of slides and trying to explain all the details. So, we will have two demo sessions after each demo sessions. I would summarize what we saw in each session. And then at the end, yeah, we will discuss a little bit more why those pros in the black boxes approach on all black boxes approach and so incredibly important. All right. So, first demo session. I am going to use the good old Titanic survival data. I mean, its data about the passengers of the Titanic. 1,300 people. And typically, at the end, you would build a machine learning model predicting who’s going to survive. In the interest of time, we’re not going to do this today, though. But you can still analyze the data a little bit and we can work with the data, do some data preparation towards this goal of understanding what was going on back then on that ship in 1912. All right let’s go into RapidMiner. So this is what you typically see whenever you start RapidMiner. This is version 9.0.

12:39 And well, one way of start using RapidMiner is clicking your blank, which would start the process designer and you could build processes from scratch. But obviously, today we are going to look a little bit more into Turbo Prep. So you could basically click on Turbo Prep to get started. You could also click on this button here at the top of wherever you are to go to Turbo Prep. So on the left side here, we actually will add a couple of data sets. In this case, we will start with only one data set. It is the Titanic data. You can pick data from everywhere you want and from your repositories. And while whenever you select a data set, you see some information here on the right side. So we’re going with the Titanic data, by the way, don’t do this now. I think I would recommend to follow along this webinar first, but you will get access to a recording later on. And if you would like to follow along, then with this webinar and do it yourself. I highly recommend that actually. You will find this Titanic data set in the samples folder. And I will tell you for the next demo session, where you find those data sets, which are directly included in RapidMiner 9 as well. Anyway, so we go with the Titanic data that’s just loaded it in. So the first thing we see is we have those 1,300 rolls here, twelve columns, and we see the complete data set. So you can directly start working with your data right here and I will show you how this is done. Another thing we will spend a little more time with later on are those little quality measurements and distributions here at the top. Those are really helpful if you want to improve the quality of your data. For example, this blue column here, this is pretty much behaving like an ID, which means that almost every name is only occurring once, which is characteristic of an ID and often this type of calls are not very helpful for machine learning models, and you often would like to get rid of this. So we highlight this kind of things. But more about this later. If you want to learn more about the different columns, you can always show the details for each column and actually browse through them. So here, for example, we you have 20% of the age information is missing. So you get some statistics, distributions and everything else here as well.

14:37 All right. So now let’s actually change the data and all the changes we do, they fall into one of those five groups, transform, cleanse, generate, pivot, and merge. And in this first session, we will start with transforming into a bit of a generation and pivoting and let’s see what we can learn from this dataset. So first thing I would like to do is start a new transformation session. So you just click on Transform and now what you do is, it’s always the same thing, you basically select a column you would like to work on. So, for example, I would like to work on this sex column here. I’m certainly not a prude, but the sex I don’t know is a bit of a connotation to me. Maybe we should rename this whole thing. If you want to use this later on in some report, let’s call this gender that’s sounds so much more sophisticated. All right. So you select a column like this one here and then on the left side, you can see all the things you can do through this column. And the first one is the renaming. That’s exactly what we’re going to do. If you could change the type, you can remove the column, copy it. You can filter down according, let’s say, keep only the males, only the females here, sorting it, and all other kinds of things here. All right. So here, let’s go rename. Let’s call this whole thing gender. And if I think this is good, I can apply this change. And we can see now this column has been renamed to gender. All this good. So you can undo the strange. Obviously, we’re not going to do this. But one thing is important to understand, while you’re working with your data, nothing happens immediately. You kind of work on this and you kind of at one point need to look at the results in by committing this transformation session. And that’s a very useful approach because it allows you to actually very quickly work with your data, make sure that everything is exactly how you want it to be, and only then at the end you commit it. If you press cancel it, just go back to the original stage into the data set in. It’s in the state it was when you started the transformation session. So here this case, I obviously want to commit. I like this result. This good.

16:30 All right. So next thing I would like to do is have a look at these two columns here, number of siblings or spouses and number of parents or children. So apparently those are about family size. But somebody divided the family into those two subgroups here and created two columns. I actually would like to bring it back into one column. And for example, for Miss Elizabeth Walton and Allen here, I would like to create her family size on board, which is small. She was traveling apparently alone, which is basically the sum of this column, plus this column, plus one for herself. And for the next four people here is a family of four, again, for all four people I would like to build for sum of those two columns plus one. So I would like to end up with four for all the four people here. And this is done by generating a new column. Apparently we have a whole group for this. So let’s click on this generate here. And we can now use this new drag-and-drop interface to create a formula for this computation. So first thing we do is give this whole thing a new name, family size. And now you can simply drag the columns in, for example, this one here and then I add a plus in this one and add another plus and also know that we constantly check the inputs. For example, right now it’s a plus. But what am I adding here? So I didn’t edit. So there’s a mistake. So I need to make this formula correct and only if it’s correct and if I also do define a name now, I can actually update the preview. And this is the final step to make sure that the formula is not just syntactically correct, but it also delivers the desired results. So, for example, for this first person here, we have the family size of one. Yup. We checked that before for the next four. And you have the family size of four. That all makes sense. It looks correct. It was pretty simple. Even if we use a lot of columns in green, you see all the columns which has been used by the calculation and in purple you see the generated column. So this all looks good and like before, in order to lock in this result, we need to commit this generate. And here we go.

18:24 Now we have this new family size column we know we really need those two columns here. And one way of getting rid of this, you might remember in the Transform group, we had one section called Remove in order to drop those columns. But this is one way of actually getting the started. First, go into the transformation group and then doing this. Another way would be you could select the columns you would like to work on and then do a right click and just select remove here and that would be removed. And this is what I mean it’s a data centric way. You can directly work in the data, either by starting those sessions manually or directly change the data. Even nicer, actually, there’s another way you just select them and press the delete key. And what happens now? You automatically start a transformation session for you. We have the two columns preselected, also the remove sections preselected. So all you need to do is press the Enter key and those two columns are gone. So you still kind of like have one last chance to check if this is really what you want to do. But again, could press cancel otherwise. But in this case, yeah, that’s the desired result. Let’s lock at it. So that data set looks a little bit nice already. We have a new column here, we renamed something. Before we go to the next step, let’s actually make sure that this all makes sense. And there’s one way of doing this. You can always open up this history here by seeing the different steps and you could even roll back. So even if you laid us and say after a transformation session has been committed, I would like to go roll back before the step. You can select a step and click on this button here as well. But it’s also just a nice overview so you can see the different steps. All right. So now let’s actually do some simple analysis here on this data. And one really powerful tool for this kind of thing is the pivot group. And you might be familiar with pivoting and or aggregation tables or from Excel or from the RapidMiner tools and data visualization tools. So let’s go into the pivot group here. And it’s again, it’s a nice drag-and-drop interface to define what you would like to see.

20:18 And let’s answer a couple of questions. So, for example, the first question could be what was the average family size for all the passengers there? And maybe later on, we also wonder what was the largest family on the Titanic? So this is actually an aggregation. So we would like to to calculate an average function. And one way of doing this or actually the only way of doing this is to take the family size here and drop it down, drop it down in this aggregates area. And in this particular case, this is the number average would be preselected. And we see immediately the result we’ve been looking for. The average size of the families is 1.884. Well, what about the maximum? We could drag in the family size the second time. We can also just click on this little box here, down here and just change the functions, for example, to maximum. And if I do that, we see maximum. Maximum, that’s good. And the maximum family size was 11. I mean, keep in mind, 1912, I guess, family size in general, a bit larger, although 1.884 average isn’t really that different from what I guess it would be today. Anyway, that’s not the topic for this webinar. So let’s change it back to average and let’s actually do some more analysis. Another thing we could do, for example, is try to figure out if the family size is different for the men and women on the on the Titanic. So you could easily drop in the gender, the renamed column form from before just into this group. You could also do it in the column grouping. Wouldn’t really matter. But let’s just drop it here. And now we get a new data table with two rows and two columns. The two rows for the female and the males. And then we get the average family size for those two groups. And we can indeed see that while the overall average was 1.884, women have been travelling with larger family sizes than men. And again, I’m not sure if that would be that much different today, but this was more than 100 years ago. And I think it was less likely that women have been travelling alone from Europe to the United States, maybe if they had family over there or whatever.

22:18 And it’s definitely actually saw one data point already in the table before Miss Ellen, who traveled alone. But I think it was like a little bit less likely. Okay, anyway. So you can see that I know another thing, since this is about the Titanic, obviously we have the Survivor column here telling us if the person survived or not, we can also see like, well, is there any difference between the men and women and also their family sizes, depending on the survivor status? And you could drop the survivors also here into the group by another nice way of is actually creating this kind of a cross table by dropping it here, the column headers. So normally, yeah, that’s actually a real pivot’s if you do something like that and build this kind of cross table here. Now we get the two rows from before, but now we are basically having those two columns for people who did survive and people who didn’t survive. And interesting, if you look at the second row, the difference in family size is not that large for men. So it was pretty much the same. But for women it was indeed different. So women who did not survive the Titanic accident have in fact been traveling with a larger family size. So there could be a bit of a first hint what was going on there. And we are not going to do this on this data set. We will model the other data set a little bit more. But this data set, let’s skip this for this time. But if you would do this and build just some machine learning model on this, you would indeed see family size just didn’t matter as much for men as well. The gender is one of the most important things for survival. And if you are a man, You pretty much have been doomed back then. And by the way, women and children first are still a nautical rule even today. But for the women, the family size actually had a huge impact. And if you had a larger family as a woman you had a lower likelihood of survival, in fact.

24:09 All right. We don’t need those results. So I could cancel it. I could also just commit it. So if I do commit it, this is the final outcome. But there’s a couple of things now before we summarize the first demonstration going to the a little bit more complex, more exciting other one. Well, what to do with the results like that? So we saw already the history before the things we did to the data. We can do two things in particular. One thing is we can’t just export the data and you can write it back into a repository or– excuse me. Or write it into some Excel file or something and then just go follow those steps here to do that so you can export the data. The final result, which is often very useful, but what is probably most exciting and very similar to this history we have seen before here is that you can also now create the process doing exactly the same things on this data set you just did to this data. And you can do this by just clicking this little process icon here. And what this is doing is it actually moves you over to the design perspective of RapidMiner and you get a fully annotated process with all the different steps, like the renaming and it’s expands to exactly what’s happening. So like here you see the parameters are set for this rename or for removing those columns. There’s actually even a call for subprocess for the occasion of the pivot here. So it’s the full process. If I execute this process. Now, you see, I get exactly the same result I got in total print. So and that is beautiful because it makes it transparent. You can fully understand what is going on. You can check if everything is done correctly. You can also store this process for someone server and or in the rest of the RapidMiner platform and deploy it or schedule it and run it on a nightly basis. Or you can share it with your colleagues or you can turn it into a building block and reuse it for future processes.

26:02 It’s really important, this connection between working the data and still being able to create those processes. That is exactly the paradigm change I have been talking about before. Okay, so that is the first demonstration. Let’s summarize what we’ve seen so far. So it’s very data centric, obviously, but there’s those five main groups, transform, cleanse, generate, pivot, and merge. The FC in transform generated a pivot already. We see the other two in the next demonstration. Pivot, obviously, as those pivot tables. Merges is for everything around joins, appends. Generate is for adding new columns to your data. Cleansing is for everything related to improving your data quality, especially for machine learning. And Transform is pretty much all the rest. So those are the five main groups. It’s in the data. You see really how the data it’s impacting. You get the feedback and the preview and to make sure that the results are correct. It’s easy to roll back. Even if you made a mistake, just cancel it no problem there. But even in addition to this, if you realize later, “No, that’s not what he wants to do,” you can always roll back. And all the changes are recorded while you’re doing them and a process built in the background which can then be opened and shared, scheduled and anything else. So that was the first demonstration.

27:19 The other one is a little bit more complex and we actually will end up with modelling some data as well. So this data is about predicting flight delays. So actually. Let’s go back into RapidMiner and let’s load the data. You can just start, basically, a new session here by just removing all the data sets. Okay, so now we start over again and we have this data sets here about domestic flights in the United States. So I have this locally here. But by the way, as I promised, if you want to follow along, there is in RapidMiner 9, there’s this new great community sample repository here, which is preinstalled. And it’s actually one of the cloud based servers from us. And there’s this community data sets, transportation, and then there’s the flights and the information as well. So you can pick it up from there. It’s, I don’t know, a couple 50 megabytes or something. So I’m not loading it from there now because I want to save some bandwidth here. But I take my local copy. But it’s the same data. All right. Let’s start with the flight’s data. This is much larger. So this is 230,000 rows. I mean, still not massive, but it’s a much larger data set, about 30 columns. You see this information at the bottom. All domestic flights from New England airports, all Massachusetts, Connecticut, Vermont, New Hampshire, and Rhode Island, and at the many airports in those five states and in the year 2007. So all the domestic flights 230,000 in total. And there is this one– so here’s the airport called the Origin. There is this one column here, departure delay. That is the one we’re most interested in because as you can see, many flights are actually delayed. This is the delay in minutes. And at the end, you would like to predict if a flight is going to be delayed or not. But before we even get there, let’s do some very basic analysis. And again, we will start with some exploration in the pivot group.

29:15 The first thing I would like to do is what are actually the airports. So I can just drop in the airport coach or the origin airports here. And while that’s not even that helpful, I mean, ACK, I don’t know what that is. BOS is Boston that one I do know. So let’s also drop in the origin name. ACK was Nantucket. Okay, good. So now, as I said, we’re interested in figuring out what are the flight delays. So let’s actually create an aggregate now with the delays and let’s calculate the average delays for the different airports. Let’s sort this in descending order. And you can see that, in fact, the airport ACK, which is the Nantucket airport, has by far biggest delays on average 51 minutes. That’s crazy. I’m not sure if you’re familiar with the area here, but Nantucket is a pretty small island here in Massachusetts and they are probably not many flights going from Nantucket. There’s a handful per week or something. And if they are delayed or not, who cares, really? So you could now try to build models for the different airports and focus on something like Nantucket for us, but we probably won’t even have enough data. But but the main point is nobody would really care about a predictive model for Nantucket. So let’s actually see how many flights are going from the different airports, let’s the drop in the origin again. And since this is a nominal categorical value column here, you will get the count that solves this again. And indeed, it’s only 314 flights from Nantucket. I mean, nobody gets to build a model for that. Boston Logan, no surprise, is the one with the most flights, 128,000 flights in total and still more than 12 minutes average delays. So I would suggest let’s just focus on Boston Logan for the rest of his analysis here because well, I mean, this is a pretty high delay for that many flights, so we should definitely build a model for Boston Logan here. I don’t need this table in this form so I can just cancel it, as I described before. So I go back to the original data.

31:09 All right. Let’s focus on Boston Logan. How can we do that? You can select the Origin column here and again, right click, for example, and filter it. So now the column selected and we can say, I would like to only keep those flights from Boston. Okay, let’s apply this. So we bring it down to 128,000 rows, now all the statistics are recalculated. So this all looks good. Well, I like that result. Let’s commit it. Now before we even get towards building machine learning models, there is one thing I know from my own experience, and I fly a lot, or at least I used to fly a lot, but I still fly quite a bit, wheather it’s by far the number one influence for flight delays. I mean, if it’s storm event or if there’s, I don’t know, heavy rains or lots of snow or freezing temperatures, typically, it’s more likely that flights are going to be delayed. So I’ve got some good news because I actually do have a second data set with this weather information here. Let’s load this as well. And for this data set, I have for all the airports for every day in the year 2007. So it’s the same year. I have the weather information on the day things like maximum temperature or visibility or humidity and so on, did it rain? Or what was going on in the day? So I have all this weather information for all of the airports and all the flights. So I would like to join those two data sets now together or here, merge. Merge basically either joins like you might know from databases where you bring different columns typically together or it’s an appends rows of one data set into another data set. So in this case and more focused on the columns. But here’s the thing. In order to make sure that we are joining the right rows to each other, we need to make sure that we’re kind of matching the locations of both data sets, but also the dates. So location is going to be easy because the weather information have the airport code and here we already know that we have the airport code as well. So that’s easy.

33:06 But the problem here is that in the weather data table I have the date as a proper date column, like one column with the proper date information. And here I actually have the year, the month, and the day of the month. So it’s separated into three columns. If you need to make this match somehow. And either you turn those three columns into one date column, that’s definitely possible. Or I think it’s a bit simpler to just turn this column here into two columns with a day and the month. We don’t need a year because it’s 2007 in all cases. All right. So that is another transformation session. Let’s start this one. And the way we’ll be doing this is we are going to copy this column. Let’s give it a new name, weather day. And also we’re going to rename this one here just for consistency reasons, whether month. All right. So now we already have two columns with the full date information. But here I only would like to see the days, and here I would only like to see the month. And that’s pretty simple. You can just change the type. And you can change it to a number, and I would like to see the month. And here will he do the same thing, only that I say I would like to see the day. Well done. So now we have two columns here. Let’s lock this result in by committing it. If you have two columns here and they have the same format. So now we can, in fact, merge those two data sets. And here’s another thing I really like about turbo prep, which helps you a lot if you have many data sets or lots of columns in your data. If you actually use machine learning to predict how well those data sets will match and what are the best matching columns. So you only have one other data set, but it also has a really high score of 94.

34:40 So we start with the weather data set and the flight data set is actually good match. Some join here. And first, let’s focus on the location. So here we have the airport code selected for the first dataset, the weather information. And now look at this. Now again, we predict that the origin column is the far best matching column to the airport code here, and that’s exactly the column you’ve been looking for. Now, consider if you would have hundreds of columns and the one you’re looking for is at the top. That’s really, really nice. Same here. If we go with the month, even if would be some cryptic name, we could still pick the month column here just because it’s sort of like a very high match here at the top. And then finally, we need to do the same thing for the day. Here we have it. And again, the day of the month is here at the top. All right. Final thing is then to create a preview here for you joined keys, they look good. In blue, we have the whole weather information, that looks good as well. Then in green, we have the whole flight information for all the 128,000 rows all is correct. Let’s commit this. Okay, so now two last steps before we actually do some modeling. One thing is you have seen in the flight information we had this departure/delay column and it’s the number of minutes. But I actually don’t really care about predicting or it’s exactly 130 minutes or 67 minutes. All I do care about is this flight going to be delayed or not. So I would like to transform this column here away from this numerical exact-minute number to something which is roughly like, yeah, it’s a delay or it’s not a delay. And the way we do this is by doing another generate session here. And you can basically look for this column here and you can drag in functions like, for example, this if function, then we add the column. Then we say, for example, well, if it’s larger than 15 minutes, then it is a delay. True. And if it’s smaller, that’s the L part. It’s not a delay, but that looks good. Oh, let’s call this column this column delay class. Let’s update our preview.

36:40 Okay, zero is false. 38 is bigger than 15. That’s a true. Yeah, that looks about right. So even for the negative ones, it’s false, false, false. True again, 31. That looks correct. So if you like the result let’s commit it and let’s lock it in. So this is the column now, this one here I’m actually going to predict later on, but we still have some data quality issues here. So for example, we have a couple of columns which are pretty stable indicatd by this grey one here or the year, which is 100% stable because it’s 2007 for all rows. I mean, there is nothing to learn from that. There’s some columns which have so many missing values here, as indicated by this missing red bar. So we cannot fix all those problems and work on all those problems like column by column by removing them. Or you can go into the expense group. And for example, if the one with many missings here, there’s like functionality for X, replacing the missings and removing others with low quality duplicate handling and everything else. You could do this. But another super exciting thing about Turbo Prep is what we call the Turbo Prep auto cleansing, which brings up is that a dialogue here. And you can now specify, well, this is the column I actually would like to predict, and that’s important. So Turbo Prop is not accidentally taking it out if the quality is not very high or maybe there are some missings, so the missings won’t be replaced because those are typically the rows you would like to create predictions for. So you can define those. If you want to do some clustering or if you don’t care, then you can also click this button here. But often it’s a good idea to define the quality you would like to predict later on. Now, RapidMiner is suggesting you columns which should be taken out. So, for example, because of high stability, I mean, everything is Boston, that’s not helpful or even here for the numericals, although not everything is 16. And it’s really by far I mean, it’s almost stable here. Or let’s see, this one here. This has way too many missings. So take it out.

38:32 So typically, that’s a good idea. Just go with this because those are really low-quality columns. So we should probably get rid of this. The next two steps, I’m going to the details, but they are useful, especially if we’re not going to use RapidMiner automodel. I mean, we will do this later, so you wouldn’t need to care. But sometimes certain machine learning models, they can only work, let’s say, on numerical data or only on categorical data. So you can make this choice here and then turbo preps autocleansing will then do things like dummy encoding or discretization or binning to your data. If you’re not really sure, just leave things as they are, especially if you’re going to use automodel because automodel to take care of this anyway. All right, so I’m not going to do anything here. Same is true for an numerical column. You could do dimensionality reduction using a PCA or normalization to bring things roughly to the same scale. Again, if you don’t really know or if you’re going to use automodel anyway, just keep things as they are. So we show you a little summary of what’s going to happen at removing those lower quality columns, replacing missing values. So now you can apply those changes. And if you do this, well before after we join the weather information, the flight information, we have a little more than 40 columns. So we lost some of them. You still have all the rows. But look at the quality measures here at the top. It all looks really good. There’s still a bit of gray, but that’s totally fine. And now no red any longer. Everything is perfectly fine. So we like the result. Let’s lock it in. And here we have it. The data set is ready for modeling. And just as a reminder, I mean, again, you can export it your model else. So maybe that is just the result. I didn’t mention it and I’m not going to. But you can start plotting things here. You can obviously see all the changes here and roll back things. And in this case, the process we can create is actually much more complex. Look at that. There’s so many different steps.

40:22 And if you think about this, I was explaining this whole thing, but this it has subprocess and everything. I could have been building this process, obviously, and I might have been taking me an hour or maybe whatever, a couple of hours. If I did it in a couple of minutes and I explained everything, I’m pretty sure if I would have been not talking, I could have been building this whole thing in a minute max. And that’s why we call it Turbo Prep, is so fast. And you see immediate results and you can immediately make sure it’s all doing the right things. But still, in the end, we built the processes. So that’s exciting. But, well, let’s now do the final step and actually let’s do some modeling. And when I click on Model. We automatically switch over to RapidMiner automodel. The data is preloaded. We obviously want to predict this particular column here. So let’s click on this column. You can ignore this tab, we actually care more about the delay. So let’s keep it as it is. And now automodel is making some final quality check and it’s good that it’s doing this. And this is another reason why I think you cannot accept anything black box, no black box approach from machine learning, because here what we did is, hey, we also recognize, wait. This departure delay column, that’s super helpful to predict this departure class column or delay class column. Yeah, guess what? That’s because one column is directly based on the other. If you just forgot to remove it and if you just would go for the models with highest accuracy, this column would stay in because it’s super helpful. But the models would be trivial. They’re not helpful. And if you wouldn’t be able to see that, that’s a sure recipe for disaster because I mean, you would use information, but you don’t even have at the moment of time you’re doing the prediction. So long story short, it’s important that RapidMiner has pointed out this is a very suspicious column and that’s why you should probably take it out. But sometimes you’re just lucky and it’s just a helpful column and you can keep it in. But in this particular case, it has to go.

42:13 So automotive points this out. Thanks for that. And while I suggest to use those two models, I had another one. Just why not? At your discretion. Well, it won’t take too long, I mean, it’s only 130,000 rows. That’s not particularly big, but it might take a bit. Maybe you should not do gradient boosted trees with automatic optimization because that could run for an hour or two, too long for this webinar. Anyway, let’s run this whole thing. Now, RapidMiner is building machine learning models, tries to optimize them, the first model is already done, Niave Bayes. You can go to the models and actually visualize the impact of the different attributes here. You can see which model performs best or Naive Bayes was okay, but GLM, generalized linear model is better because this congression is also 90%. Looking at the Roc curve indeed, if you want to be at the top left. Logistic regression is the best one, the green curve. So let’s go into this model here. One thing which is really nice that this simulator– I mean, this is about total prep, so I’m not spending a lot of time here, but to give you an idea how this all works together, so you can see that those settings here right now it is more likely that the flight is not delayed, 75% likelihood. But I could for example, let’s do something. Let’s reduce the visibility. And you can see that actually brings the likelihood for a delay is up. Let’s also change the humidity a little bit. I don’t know what I’ll—more winds, oh yeah, add a storm, what happens if we add more wind. Good, good, good. Temperature Oh, look at that. Let’s make it freezing cold. And this has a huge impact, actually, on the predictions. Anyway. It shouldn’t come as a surprise, but you can simulate it. You can optimize it and everything else. So if you like one of those, again, like with Turbo Prep, you can always open the process, which has been creating the full model. And it, again, is fully annotated. You get all the preprocessing, you get how the model was built and built and validated.

44:10 And that’s the beautiful thing about this whole approach. It’s not just making you fast and it’s automated, but it guides your soul and it tells you what you should do. It always allows you to override those decisions, but you can always see exactly what was happening and can use this as a starting point or deploy it or applied on new data sets. So it’s, as I said, it’s a paradigm change, but it really brings together the best of all the worlds. All right. Let’s go back to the slides briefly. And first of all, let’s summarize what we’ve seen in the second session here. So we talked a bit more on those machine learning-oriented quality measurements like stability, identities, missing values. So they’re constantly updated so we can work towards improving your data quality for the learning models. There is little things I really personally like a lot is those metric scores for the mergers or joins, basically, pointing out while those are data sets which match very well, or within the columns, if you have hundreds of columns, here’s the column which actually matches best, the column you already have selected for the other dataset. Super little useful thing, actually, and I just love it using machine learning also to make our life a little bit simpler here. It’s really easy to work with your data. I mean, saw, for example, it’s a pretty complex problem working with dates and everything, but it actually is kind of simple here in Turbo Prep. You select the column, you turn it into two columns, then extract the month and the days and all the type transformations are done automatically. So it’s pretty cool. All the cleansing is my go-to methods at this point. I’m no longer cleaning data myself at all. Why would I? Whenever build a model, I always let’s automodel and or Turbo Prep doing the preparation for me. I mean, I can always see what happens. I can always change things if I want to. But as a as a benchmark, as a starting point, it’s fantastic to get to much better data sets and often much better models in a couple of clicks really.

46:03 And as we have seen before, the results can be exported, not just the process, but also the data itself. And it’s seamlessly integrated with the whole model. So you can really– I mean, so you saw the flight data, merge more data sets. We cleaned it up. We created a model and that all is done in less than five minutes. That’s pretty cool, actually. Let’s wrap this whole discussion. One last time to get rid of one misunderstanding. When I talk about black boxes and other people talk about black boxes, unfortunately, there’s some misunderstanding out there. And I would like you to understand fully and embrace fully what this misunderstanding is and make sure that you’re avoiding the right thing here. When I talk about blacboxes, I’m not so much concerned about the black box, which is the model itself, but I’m concerned about that in many solutions out there, the way how the model was created or the how the data is changed is opaque to you. You don’t know. It’s you just can’t see it. You can’t see the details, you can’t control this process. And that’s a sure recipe for disaster. And that’s in particular true if you separate the data preparation from the model building. So if you have a couple of data engineers, doing some data prep in one tool and then you do the modeling in another tool, what you can no longer do is actually measuring the impact of the data preparation on the model quality. If you’re not measuring this, the model might look too good, really look too good and you get overoptimistic estimation of how well the model did work. Now you put it into production and there’s really, really a negative surprise down the road for you. So that is just not acceptable. So just to make sure that we all talk about the same thing, there are actually two types of black boxes. And most people, when they think about machine learning and black boxes, they only think about this type, one of black boxes, which is the machine learning model itself, and that is not always the case.

47:53 So, for example, if your machine learning model is a decision tree, that’s typically not a black box, it’s easy to understand it. It’s also easy to explain why a certain prediction was made. But things are different. Let’s say, if you’re looking into an SVM model, support vector machine Machine model, with the rate of the base function or neural network. Or some of the other methods, even if they’re decision-tree based like gradient boosted trees, you can no longer easily understand how a model is coming to this decision. But just looking at the model and deriving some explanations. And that, yes, that is a one type of black box. And that doesn’t mean it’s always a problem. In fact, in many cases, if you can’t explain why predictions made, but you can still prove out that the prediction is good and that the model is correct and has been corrected, built, then this is totally acceptable in many, many use cases. You don’t always need to get rid of this black box, okay? Which is the machine learning model itself. But then there is the model creation process, and that includes the data preparation. This is what we call the black box type two. And here’s the thing, while black box type one sometimes is not acceptable. But then there’s tools around that. So, for example, if you have a deep learning network, there’s something called lime. And by the way, a pretty cool implementation of this lime item or a variant of the algorithm itself, which can then take the model and create the predictions and also explain how this prediction was made and what would have been the most important influence factors for this particular prediction. There’s all kinds of tools around that to overcome the hurdles, the black box type number one. And as I said, sometimes is not even a problem. But for black box number type number two, people somehow believe, “Oh, it’s automated. Well, let’s just take the model. It’s it looks like it’s 98% accurate. So it has to be good.” And they just accept this black box. And this is, in my world, is never acceptable because if you don’t know how exactly the model was created? How the data was prepared? If the data preparation was correctly validated? Was the model actually correctly validated?

49:49 And it doesn’t matter if people say, “Oh, yeah, well, we’re grandmasters and our tool is doing this right.” Or, “Yeah, we have 35 publications in machine learning journals.” So do I and I still make mistakes from time to time. So it’s important that you cannot accept this black box type number two, that you have access to the full model creation process, including all the data preparation to make sure that what is happening is actually the right thing and the desired thing. And then there’s no mistake has been made and there’s support for governance reasons as well. So it’s never acceptable to have a black box of type number two. Black box number one? Sometimes it is. Black box type number two? Never. So I think this is really important and I want to spend a minute of actually going a little bit more detail there. So a couple of key takeaways Turbo Prep is super fast way of working with your data, preparing data. That’s why we call it Turbo Prep. I also think it’s kind of a fun way because you see the impact right away. It’s so much more fun than actually building process first or even encoding first and then seeing the impact afterwards. And because of both, I think it’s truly usable by anyone. I mean, if you can use Excel, you can use Turbo Prep. So you’ve worked out and cleaned the data, you’re getting realtime previews. There’s a lot of automatic data transformations, like let’s think about the auto cleansing specifically for machine learning. So that’s a huge timesaver as well and typically improves the model quality big time. And whatever you do, all the processes are built in the background, while you’re working there, you can always open it, you get the full transparency. There are no black boxes because keep in mind, black box type number two is never acceptable. It’s a sure recipe for disaster. Don’t do this. And with that, I think I’ve read it. Thank you very much.

Related Resources