Real-world data is dirty. In order to get it ready for modeling, you need to deal with missing values, correct erroneous values, select relevant attributes and adapt dataset format to the model type – which consumes a lot of your valuable time!
Luckily, the hundreds of blending, cleansing, filtering, feature generation and munging functions in RapidMiner Studio, allow you to expertly format your data for predictive modeling – in a very fast and efficient way.
Watch this webinar to learn how to dramatically reduce the time spent on basic and advanced data prep tasks such as Data Exploration, Replacing/Imputing Missing Values, Replacing Nominal/String Values, Feature Generation, Handling Outliers, and more!
Hello, everyone. And thank you for joining us for today’s webinar: Expertly Prepared Data Produces Better Models Faster. I’m Hayley Matusow with RapidMiner and I’ll be your moderator for today’s session. I’m joined today by Tom Ott, our RapidMiner marketing data scientist. Tom will get started in just a few minutes. But first, a few quick housekeeping items for those on the line. Today’s webinar is being recorded and you will receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who are not able to attend today’s live session. Second, if you have any trouble with audio or video today, your best bet is to try logging out and logging back in which should resolve the issue in most cases. Finally, we’ll have a question and answer session at the end of today’s presentation. Please feel free to ask questions at anytime via the questions panel on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. I’ll now go ahead and pass it over to Tom.
Thank you very much, Hayley. Good morning, everybody. And I’m glad to be here to talk to you about a problem that we as data scientists, data engineers, state analysts have to always deal with; it’s messy data, right? What we want to do is we want to take this data and we want to make the dataset more robust to build more accurate models and so forth. So give me a second here and I will go into today’s agenda.
So there are always challenges with data dirty. We’re always going to run into this. They say something along the line of 60 to 80 percent of the time, we’re going to be dealing with data cleansing, replacing missing values, maybe imputing values, changing string values, and doing a whole host of other different things like feature generation and measuring or detecting outliers. So this is what we’re going to have to do. And with RapidMiner, we always call ourselves an advanced analytics platforms because we have some really great algorithms under the hood. We can model. We can validate. We can test them. But we also have a very, very rich data set of tools available. Mostly going from data exploration where we first look at the data to data blending and data cleansing. And all those different things are operators that we have stuck with RapidMiner. When you download RapidMiner Studio, they all come with it right off the bat. And the reason why we have them is because some algorithms require the data to be transformed in such a way that it can read it and it can make sense of it. For example, a k-nearest neighbor algorithm can handle missing values. But if you want to use say a support vector machine algorithm, it would blow up right away once you load data in with missing values. Other algorithms can only handle certain data types. Maybe to go back to support vector machine, they can only handle numerical input types. So if you have maybe some string values in your data, maybe like male or female, you might have to convert them to zeros and ones to be able to use that algorithm.
Okay. So we’re going to really focus on the first part of this time-to-value dataflow, right? Data prep models invalidate and operationalize. We’re going to start at the beginning. We’re going to start right here at the data prep. Because, as I said before, to have really robust models– to have really accurate models, you need to really spend time on the data quality aspect side of your training data. To build awesome models, we need an awesome training data set. And quite honestly, I do spend all my time pretty much right here. At RapidMiner, we joke and say, “Well, they say outside that 60 to 80 percent of the time, their time is spent building data sets. And then the rest of the time is building and coding all the algorithms.” RapidMiner, it’s 90% of the time is data prep because the modeling, the evaluation is so quick because of our reusable visual programming blocks called operators.
All right. I touched a little bit upon this before just in my introduction that the data in the real world is dirty. It’s incomplete. You could have a column for occupation; and 90% of that column is filled with everybody’s occupation in your data set, and the other 10% is missing. How do you handle that? You could have bad data. If you have an age column, somebody could have entered 222. Reality is that we don’t have anybody age 222, it is–
–maybe we were using – Yes?
Oh, looks like we lost you for just one second if you don’t mind repeating what you had just said.
Sure, sure. So in the real world, data is dirty. We sometimes have incomplete data, missing values in our data set. And sometimes, it’s incredibly noisy. Meaning that perhaps somebody put the wrong age into an age column or maybe there was a negative or maybe the age doesn’t really fit in our scheme of things. Maybe the minimum age for our workforce is 16, and somebody put in 10. And also, some cases, it’s inconsistent. We may have been using a rating scale from 1 to 3 or 1 to 5 and somebody changed it to A, B, C, and D. How do you handle that type of data? How do you transform that? And how do you make this reusable? It’s ultimately we may be doing this data prep work to feed something downstream. It may not be modeling, okay? It may not be modeling. It may be to a dashboard BIQ like Tableau or Qlik Sense. How do you build processes that create these catch-all systems so that if some dirty data does slip through, how do you fix it on the fly and how do you prevent your downstream systems from breaking? So right. We have to deal with missing values. We have to correct those values. We have to generate new features. We have to find outliers, right. We need to do all that with RapidMiner and you can. Okay. Ultimately, as I said before, bad data, bad data quality, garbage in, garbage out– we all know this. This is pretty straightforward. So we know how serious this type of situation is.
With that, I’m going to switch it over to my demo in a second. But I want to just highlight a couple of things that I’m going to go through in this demo is we begin at the beginning. We begin with data exploration. The first thing we do is we use RapidMiner to visualize our data, to read the statistics, to understand what we have so that we can maybe generate some questions. Perhaps you received your data dump from a database administrator, maybe in a spreadsheet. And you start exploring it. And you realize that “Oh my God. There’s a lot of missing data.” Did he or she give you the right data dump? Okay, maybe they did. So then you have to come up with a strategy on how to replace those missing values. Other things like data blending. You may then work on replacing your missing values and then maybe you want to generate a new feature. Maybe you want to extract the date where somebody joined your company and the date where they left and calculate the elapsed time. You could do that with RapidMiner, generate new features. Likewise, we could normalize. We could detect outliers. Maybe there was some bad data entry, and it doesn’t seem to fit with everything else. What are those outliers? All those three – exploration, blending, and cleansing – are all done within RapidMiner Studio in a one-platform way. Okay.
Now, let’s switch over to RapidMiner. Give me a second here while I pull it up. Okay. For those of you who are new to RapidMiner and you first load it up, you should see this type of development interface. We have repositories over here. I’ll just touch on these briefly. This is where you save your processes, local data stores, where you make database connections so forth, and where you can get your data and build your workflows. Down here, which we’re going to talk about right now, are operators. And these are these visual programming blocks– these operators that you chain together to create a process. That can be reusable. It can be operationalized. It could cleanse your data, blend your data. And it also model your data and gets you to your results. So we’re going to focus on these two directories right here: blending and cleansing. See right over here, blending has 77 operators. That’s that number 77. And cleansing has 26. So if I were to come over here and expand this, you’ll see that it’s subfolders. There are four subfolders here: one about attributes, one about examples, table, and values.
I’m going to pause here for a moment, and I’m going to talk a little bit about RapidMiner’s nomenclature. What normally people call columns in the dataset, we call in RapidMiner attributes. What people call rows in a dataset, we typically call them examples. So when you want to say filter rows, you would actually call it use a filter example operator– filter examples. If you want to select columns, you would then use a select attribute operator. So this is something handy to understand right off the back. If you’re brand new to RapidMiner, you just download it, and you’re wondering what the heck is an example, what the heck is as an attribute, that’s what those things mean. Okay. Let’s just go back here and take a look at what we got.
Okay. Under blending, blending is really about how you blend together different attributes or how you actually blend together data sets– like we have these fantastic data joins similar to SQL Joins but done in a visual way like joining, appending, and so forth. Things like grouping, aggregating your data, all important things that you need to do maybe to slice and dice and prepare your data to things like, as I mentioned before, filtering. Filtering your data. Sampling your data. You might want to sample a percentage of your data. Or you might want to bootstrap or even balance your data. You could do that here as well too. And then as we continue through the blending, we move to cleansing. These are some of my favorite operators. These are where we come up with strategies on how to replace missing values. How do you replace integers? How do you replace nominal values or string values? What about taking care of duplicates or binning? All these different operators are all here to prep your data and prepare them in a way so that the model can be run quickly and easily. Okay. That’s it.
We’re going to start with a data set, a customer churn data set. This is a data set that we use in RapidMiner for some training. And I’ll make all these processes and this data set available to download later on after the webinar. And this data set is something that we typically will see. Maybe your database administrator gave you this dump, and it looks something like this. We drag it into the canvas. We wire it out. And we’re going to start with the data exploration. Okay. Very important. Come over here and we hit run. And when you execute this process, RapidMiner will create a results view. This is the first thing that you see. There are in this case 1000 rows. And I can scroll down through here– scroll all the way to the end. And I can see that I have 1000 rows. And also over here, up at the top, RapidMiner will say, “Hey, you have 1000 examples.” There’s that word again: examples. For rows. And it says you have 8 regular attributes. There’s that word attributes again for columns. Okay. Now, this data set’s very small. I mean, I could actually sit here and look through it. I see question marks. I could see ages, row number. And I see all these different items in here. And I could probably scan through these pretty easily. But the reality is is that we deal with probably millions of rows of data, several thousand columns or attributes. So it gets very hard to really eyeball all this data and look it over.
So how do we actually try to make more sense of it? If you’re an R user or a Python user, we’re familiar with the head command or the summary command, which will give you a high-level overview of your data. We have that here as well. We have that as a visual way– a visual tab to do it in. And that’s the statistics tab. Okay. Let’s click on that over here. Now, I’ve got a higher level overview of the data. I actually have the column name. I have the data type – we’ll talk about that in a second – and we have missing columns or missing values, and something called statistics. Okay. Now, in RapidMiner, we have different data types, which also need a little bit of explaining. The integer is self-explanatory. We know what integers are. We know what real numbers are. But we have something called polynominal or nominal value types. Those are actually string values. When you see a binominal value type, that means it’s either one or two– it’s two strings like male, female. Yes, no. If you see something called a polynominal, it means that the values have multiple strings. That could be in this case like here, payment method. Click on this. You’ll see that if I click on details here, you’ll see that I have three nominal values: credit card, cash, check. And that means they’re polynominals, more than two or multiple.
And what I also did here too is part of this data exploration that I’m talking about. How do I actually start looking at this data at a high level– in a high-level way? You could see here I already did this. I click on this payment method. I got this little chart here. It’s a little bar chart. And I also see, if I pull this out of the way, I see something called least, most. So I could see that the least entry are check. The most entry are credit card. And if I click on details over here, I got another window. I got a pop-up here. I could see what fraction of the data set each one of those entries are and what the absolute count for this. Okay. Very, very important. Very, very handy. Let’s close this for a second.
And here’s another neat thing here. This is just a little visual clue, a really quick way to inspect your data, but there’s a link here called open chart. Okay. Open chart will take you to another visualization part, workbench of RapidMiner, which is the charting function. I can either click on charts here or I can just click on open chart here. Once I do that, I’m now presented with RapidMiner’s visualization workbench. Okay. I’m going to make a distinction here. This is not Tableau or Qlik or some other BI tool. This is for the data engineer, the data scientist, the data analyst to to explore and start working with their data. See here. We have some bar charts. We could see how many entries of the credit cards, check, and cash. I could quickly see that credit card tends to be the predominant way to pay for whatever this product that we’re selling. I could see over here in my chart style, if I pull this down, I can also visualize the data in different ways: scatter charts, series charts, histograms. I could even make pie charts. So not every data type or not every data set will lend itself to visualize say maybe as a time series or a scatter chart. But there’s multiple different ways to look at it. So you can explore over here. And what I will say – I’ll make a quick little say in here – is that you can actually build your own custom charts by using the advanced charts tab. We’re not going to go into that today here, but this is just something for our viewers to remember. You can actually drag and drop and build your own charts. You could relabel the chart title. You can change line types from line to dots. You could do different things like that. So this is also available to you. And there is also on our docs.rapidminer.com website a PDF that you can download where you can see how all this works. So that’s also something to think about.
Okay. Let’s just go back to our data set here– our statistics data set. Okay. Let’s take a look at what we got here. So we just looked at payment method. Fine. But I’m looking at over here. I’m looking at my age. I have one missing age value. And if I click on this over here, I’ll see that, “Oh, okay. This is roughly my histogram distribution.” I see that I have a minimum age of 2. That doesn’t make sense. And I have a maximum age of 234. Okay, that doesn’t make sense. Okay. Let’s look at gender over here. Now, normally I should have only male and female, but I have the German word for female here. I click on this. Let’s take a look at it. Oh, I see we also have, back over to details, some other entries. So I see here that I have male, female, which is what I want. And 5% of the data set, somebody entered männlich, which is male in German. And somebody entered weiblich, which is female in German about close to 5% of the data set. So I have to do some transformations. I have to remap or I have to replace männlich and weiblich with the word male and female appropriately. All right. Good. What else we can take a look at over here?
I also have another data type called date time. In RapidMiner, when you import a date, 90% of the time, RapidMiner is smart enough to recognize, “Oh, this is a date. It’s a date format.” It’ll automatically create a special data type which will allow you to do date differencing, calculations using dates all based on the epoch or something relevant relative to weeks or months or something like this. So for those who do not understand what the epoch is, this is a time in computer science. I think it’s January 1st, 1970, where a computer programmer said, “This is the reference point.” And from there, you can calculate the elapsed milliseconds to say today’s date, November, 24th, here in this example, 2009. So you can do a lot of different date calculations over here.
Let’s take a look at this one over here. Churn date. 497 values are missing here. Okay, let’s think about this. So if I look at my data set – let’s come back to my data set here – we sell some sort of product. And this is a churn use case. So we want to understand and build a model to try to predict churn. So if I have here two columns: last transaction date – that’s the last time the customer interacted with my website or my company – and I have a date here for churn– so obviously, we know that if we have a date that they left, so we know they’ve churned. The missing ones are people who have not churned. They’re loyal. So what I’d like to do is, as I prepare my data, I want to generate a new feature. I want to generate a new feature, create a new column called churn date and make a label for: have they churned or are they loyal. And I’m going to use that and feed that into my model later on. If we get to it, we’ll build a very quick model; we could see how you do all this.
Okay. All right. Now, let’s get started. Let’s build our first process. Back to design. And I’m going to open this up because I’ve already created it. We’re going to replace the values. First thing we’re going to do is we’re going to replace that männlich word with male. And we’re going to replace the word weiblich with female. And what I’ve done is I’ve dragged in my customer data. Let’s just do a breakpoint here just to make sure it’s the same. Okay. Let’s see. Okay, we got 1,000 rows. Excellent. All right. Everything looks similar. Let’s go back to design. And I have two operators called replace. Now, let’s go see where those replace operators are by coming in here. And I search. I do replace. I could see that they’re under the blending directory, under values.
Okay. In RapidMiner, each cell holds a value. So here, we’re going to replace some value. Whatever it is. Okay. And the replace operators are actually very, very powerful. What they do is is they have a way to filter. I could either say, “Hey, I want to search everywhere in the data set or I want to search specifically in one column.” In which case, we’re going to select the gender column. And what I want to do is I want to replace everything with the word W. Okay. Here, I wrote a regular expression. They’re very powerful. You can actually type in the word. I think it’s weiblich, something like that, and replace it by female. But sometimes, you need to have a better selection criteria. So this can handle anything you put into it, and it’s also regular expression family. So for me, this is a very powerful operator. All I did was select anything that begins with W and replace it with female. Likewise, I did the same thing for the male one. Come down over here and I can select do it all. So in this case, anything that starts with the letter M throughout my entire dataset will be replaced with male. And I might not want to do that. You might not want to do that. So I come over here, and I select a single column or a single attribute. Right. There’s that word again, attribute. Column. Attribute. I select the column gender.
Okay. So before I run this process, let’s just take a look at what the data looked like before. We had here the gender column. Let’s visit our statistics just to make sure that we have four different values, right? Male, female, männlich, weiblich. And when we run this process, it should only present us with male and female. Let’s close this. Go back to design. And I’m going to run it because it’s paused right now. Right here, it’s paused. Let’s run it. And it’ll process it. And now, it’ll present me with a new data set. Let’s go back to statistics and verify this. Go back to gender. Aha! Now, we have 550 counts of male and 449 counts of female. So we did a nominal conversion. We took a string value of weiblich or männlich and mapped them or replaced them to the proper name or the proper value, which was male and female.
Excellent. Good first step. Let’s talk about doing some more cleansing. Let’s actually talk about gender. You see here I have one missing value of gender. In my customer data set, replacing missing values, there’s a whole art and science in my mind to dealing with missing values. Sometimes, you need to actually come up with a good strategy. If maybe a percentage of your data set– a small percentage, maybe less than 5% or so, are missing, you might want to delete them. You might want to filter them out. In some cases, you might not want to do that. In some cases, you might want to replace them actually with an average value or a minimum value or a tag word called missing or not applicable or something like that. It’s really going to depend on your domain. It’s going to really depend upon your industry. For instance, if I looked at the age, I have one missing age. Maybe I want to replace it with the average age or maybe a minimum. Maybe I know that all my workers start at age 16. And there’s no way that they’re going to be starting at age 2. Or if I don’t know it, since it’s only one, maybe I’d filter it out. I don’t know.
So what we’re going to do now is we’re going to talk about filtering out the missing value of gender. Okay. I’m going to open up the next process here. I’m going to build– just so you know, I’m going to build on this process here. I’m going to build it into one long process that does everything. Okay. Okay. So looks familiar. My customer data, replacing my male and replacing my weiblich. And now, I have a new operator called filter examples. And just so you know, right? We showed it before in the beginning. Type in the word filter or examples. You’ll see here it’s under blending, under examples, and under filtering. Right? Once again, that word, examples, for rows. So we’re going to now filter out the row that’s missing for gender. Another powerful operator. I love this operator. I especially love the custom filters. Once you drag and drop this in and connect the wires, you’ll see that you have a custom filter: condition class. And there’s other different really simple ways you can just filter out missing attributes, which are columns, or missing labels and so forth. We’re not going to talk about these today, but they’re there and available for you.
We’re going to look at custom filters. And if I click on add filter here, custom filters gives you like a wizard. Really, really cool. It allows you to add multiple rows. You could select payment method. You can select age. You can do different things like that. And you can do all these different types of filtering automatically. Now, you’ll see here that age comes up with an equal sign, but payment method is equals, hash codes equals. Well, when this thing is toggled on, preselect comparators, RapidMiner, based on your data type, if it’s a number, it’ll give you a comparator that’s related to numbers like greater than, equal, not equal, so forth. If it’s something that’s a nominal value like male or female or payment method, which is credit card, check, cash, it’ll give you a comparator that’s related to nominal, to strings. So like does it equal that? Does it contain it? Does it start with, end with, and so forth? You can of course override all this by just toggling this off. And then you should be presented with all the options here. So just so you know, that’s a handy thing– that if you see that this pops up and it’s not the right comparator, you could just toggle this off and do that. RapidMiner’s pretty smart. It propagates the metadata and it makes some assumptions. But sometimes, the assumptions are not 100% correct. So you have that flexibility and that ability to override that as you may need.
Here, what we’re going to do is we’re going to skip these guys here because we’re just focusing on gender. One other thing here. This match all, match any, this is really your end and your or. So you can match gender and age specifically or gender or age or something along the lines. You could toggle between these. These are really, really handy. So let’s hit okay. I’m going to come over here, and I’m going to put a breakpoint here. This stops the process, right, as we did the replacement. And now, we’ll see here we have 1,000 examples. I’m going to filter out that one missing example. So when I filter out that example, I should have only 999 examples. Let’s run it again. Now, here we go. Now, I have 999 examples. Go to my statistics page, double check that I don’t have a missing gender. There we go. Zero missing genders. Awesome. Awesome. We’re getting closer. We’re getting closer to what we need to do here.
Next thing I’m going to do is I’m going to actually filter out my age. And I want to normalize the age because I want to start prepping it. I’m thinking of modeling– maybe put into a k-nearest neighbor. Something along that line. So let’s open up the next process here. And you’ll see I’ll keep building on the same process here–
Tom, it looks like we lost you for just a second there as well. Sorry.
Oh, okay. Okay. So I open up my process, my new one, where I’m going to filter the age and normalize the age. This process looks like what we saw before. I replace. I replace. Filter. And now, I’m going to filter age. Here’s what I’m going to do. I go back to my add filters. I actually have two entries. Age and age greater than 16 and less or equal to 100. And I want to match this. So this is an N function. So filter out and give me only the data that matches age greater than or equal to 16 and age less or equal to 100. Okay. And finally, at the very end, take age and normalize it. And see over here, once again, I drag the normalize operator just so you know where the normalize operator is. Normalize is under cleansing and under normalization. So you could drag that in there. You could normalize, denormalize, scale by weights– many, many different ways on how to handle your data there.
So if I drag this in and I connect the wires and present it here in my parameter window with a way to filter the columns– attributes. Attributes or columns. Here, I build a subset. As opposed to single, I build a subset. So I click on select attributes. And I can then do a couple of things. I can select which ones I want, which is also handy. This case, I’m only selecting age. I could do something like this like. I could drag and drop– push these over and do another cool way of just selecting what I need when I need it. Okay. And what I’m going to do over here is I have multiple methods, range transformation from 0 to 1, proportion transformation– but we will default with this transformation. In this case, a very popular normalization technique to remove your means or to make zero means so to speak.
Okay. So let’s go put a breakpoint here. Make sure we filter out the ages. Actually, we’ll stop right before. So we’ll inspect the data here, make sure that the age is still 2 and 234. Let’s take a look. Go back to statistics. Look at age. Aha! I have age 2 and age 234. And hopefully, when I’m done, I should have 16 to 100. So let’s go back into design. It’s going to now stop here when I press play. Okay. Let’s go back and inspect it. Aha! We could see we went from 999 down to 996. Okay. So it definitely did do a filtering here. Let’s look at the age. Oh, I see. And also, now, it gives me from age 17. So there must have been nothing that’s 16 and nothing that’s 100 but 99. So now, my age range is 17 to 99. And you see that it also filter out the missing one. Because the missing one was missing. It did not meet any of the criteria for greater or equal to 16 or less or equal than 100. So that’s gone. Okay. So the ages begin to look pretty good. If I wanted to, I can now explore it better. And I can see, aha! From my histogram over here, a lot of youngish type of people, some people in their mid-40s to 50s. And then we trail off into some people over here that are in the 80s and 90s. So those are probably great-grandmas and great-grandpas that are loving technology. Let’s go back to design. The last step now is to normalize the age. We hit run. And there we have it. Now, our age is completely normalized. Let’s go to statistics. We could take a look at it now. Click on here. And we could see that we have a minimum and a maximum, zero average, and one standard deviation. Exactly what the transformation should be doing. Perfect.
All right. Now, let’s build on top of this process. What I’m going to share with you next, actually, is – don’t get scared – is we’re going to start doing some housekeeping items, right? I mean, as we start adding these operators, it can get messy. So what we’re going to do is we’re going to take these operators here and we’re going to put them into some process because we want you to think that when you’re using RapidMiner– you start at the beginning, right, with exploration, cleansing, and so forth for data prep. We also want you to think about the end in mind. For those who are FranklinCovey fans, begin with the end in mind. So we want you to start thinking about how you’re going to operationalize, how you’re going to create processes that are easy for your colleagues to use or somebody to pick up if you want to leave your firm or somebody needs to do some troubleshooting. We want you to start thinking about how you’re going to logically organize all these wonderful operators, all these wonderful visual programming blocks here, into something that can be easily understood– makes sense for you.
Okay. So now, what we’re going to do is we’re going to open up the next process. And you see here the two filters and the two replaces are gone. You don’t see them visually here, but they’re here. They’re in a new operator called subprocess. And subprocesses are just ways for you to group different operators. They’re ways for you to make sense of things. And you can come here and you can create a tag by creating a note that say something like, “This subprocess is for replacement and filtering.” Let’s get rid of this. Delete this here. And then take it and drag it on to here. You can create a note and so forth. And then if I double click on this– because it’s a subprocess. This little tag here means that there’s something going on underneath the hood. You could see here there’s my replace. My two replaces, my two filters, and so forth. So a really great way to start creating a little subprocess or a little building blocks that you can reuse over and over again. And even though we’re not going to talk about building blocks in this webinar, go check out the post I wrote about building blocks on our RapidMiner blog. Always go to our blog. There’s some really great stuff there. Building blocks are really handy, fast ways to insert bits of processes or bits of subprocesses that you use over and over again. And I’ll just digress here for a second. I have a bunch of building blocks that I use over and over again. And that I love to use influence factors. I love to read S&P 500 data. I love to do different transformations that I personally use to make my daily work faster. Okay. So you can also just create a building block. You just click on this, and then you could save it as a building block. And it’ll become available to you. Just like this.
Okay. So once again, here’s a subprocess that does the filtering and replacing. I normalize. And now, what I’m going to do is I’m going to do– I’m introducing two new operators: generate attributes, select attributes. Okay. Go back to the beginning. Attributes are columns. So generate attributes means you’re going to generate new columns. We want to take this data – so I’ll put a breakpoint here – and then do a few things with this data. We’re going to create a new column, a new attribute, that will say to return based on this information. And what we’re going to do is also we’re going to modify the postal code here. We’re going to extract the first digit because then we’re going to feed that into a model later on. We’re going to just extract the first digit because that’s important. In Germany – I believe these are German postal codes – depending on what your first digit is depends what region of the country you’re located in. So in this churn data process, we want to maybe see if there’s any type of statistical significance to the location of where they’re at.
Okay. So let’s come over here. Back to design. And let’s investigate this generate attributes operated. Before I do, generate attributes is located under your blending directory, under attributes, under generation. Right here. Another one of my favorite operators. I use this one all the time. I call this the Swiss army knife of RapidMiner. You can do in-process calculations. You can convert. You can replace nominal values. You can do dates calculations. You can do so, so much with this operator. It’s amazing. Let’s dive into it now. I highlight it over here. And you’ll see my parameter window. All I’m left with is something called function descriptions. I click on this. So it’s edit list. I can come in here and I can create my new attributes, my new columns. I’m going to create one called column label. And I’m going to, in this case, overwrite the postal code one. If you name an attribute in here with the same name of an existing attribute, it’ll overwrite it. This comes in handy. Sometimes, you don’t want to generate hundreds of new columns. You just want to just do a transformation and keep the same name. Just change it.
Okay. Let’s take a look at the first one here, churn label. Here’s my function expression. It appears to have an if then type of statement. If missing churn date, then loyal, else churn. Okay. What happens if you don’t know how to write this expression? Where do you get any clues from that? How do you learn to write expressions in there? Well, you click on the calculator here. And this will take you into the function expression writer. This has been revamped in version six point something. And this is a fantastic, fantastic new way on writing expressions and doing in-process calculations. I love this one. You can see here that you have functions like logical functions. If then which we’re showing here. Ands, Ors. You can compare missing data, equal data. You can even transform text: cutting, starting, matching, finding, generating the length of them, indexing them. You can then replace, concatenate. You can make lower-, uppercase. Prefixes. All these different things. Trimming data. To things like mathematical functions like square roots, natural logs, in logs, and so forth. Statistical functions. Average, Max-Mins. Random numbers. Trig functions. I mean it goes on and on. I would highly recommend all of you spend some time to go and dig through this one. It’s really, really fantastic. And the ones that I use quite a bit are date calculations. Here, once the data type is date time, you can do calculations of the time before or add dates or do date differencing. All great things. And here’s the neat thing: all of these inputs, they’re right here. Here’s my regular attributes. These are all my data set. RapidMiner extracted the metadata and put it right here for you. And you could use basic constants. You could use true-false, pi, infinity, so forth. You can even use something– this is advanced functions called macros, which is also part of, I believe, a webinar that was done for us, RapidMiner, a little while ago about using loops and using advanced ETL functions and advanced data prep functions. There’s so much more that you could do with this, which is fantastic.
Let’s take a look at this. Here, what I’m using is an if then statement. I’m saying if the churn date, which is this guy right here, is missing, then the person is loyal, else, your churn. Okay. My function is done. And you’ll see here that if you wrote your function correctly, it’ll say, “Hey, your expression is syntactically correct.” If it’s not, you’ll see a little red error here, and it’ll tell you what’s going on. Okay. Cancel out of this one. Last but not least here, the postal code. So click on this. And here, what I’m doing is the postal code– I come to my regular attributes; they didn’t give you a clue. Postal code, you see this little hash sign? That’s indicative of a number, an integer. In this case, it’s an integer. But it’s saying that, “Hey, this is a numeric type.” So I can’t extract the number unless I turn it into a string and then using the prefix. So here what I can do is I’d say using the STR command, STR, which stands for string, right here– a conversion. I can convert a numeric number into a string. Then use the prefix command. And then extract the first digit or the first string value which in this case will be the first digit. And while I do that, overwrite that column. And we’re good to go.
So let’s take a look at this. So we stopped the normalize here. Let’s put a breakpoint here. And now, let’s just see the conversions. Now, I’ve generated a new feature called churn label. See here, loyal, and so forth. Churned, loyal, so forth. Right here, this is missing. So this person is loyal. And I took the postal code that originally had multiple five-digit numbers and now extracted, only the number here. All right. Great. Good. What’s the last operator here? Select attributes. Okay. I’m getting close to getting ready to model this. So let’s just take a look at the data real quick. Let’s look at statistics again. I have zero missing, but I have these guys here that are missing. But I did a conversion for churn label. So what I want to do now is I want to select the columns that I want to pass into my model. I want to take age. I want to take gender. I take payment method. I want to take my churn label, my postal code, and maybe my last transaction. There might be some sort of analysis or some sort of statistical significance on whether or not that was the last time they interacted with my website. But I don’t want to take cash code. I don’t want to take real number. I don’t really need churn data anymore because I made my feature generation that generated a feature here. So I want to be able to select those out. I want to clip them out. I want to delete them or I want to remove them.
Just so you know, RapidMiner does not delete your data unless you specifically say you want to, okay? As you build this process, RapidMiner creates metadata views, okay? So the data is not destroyed. It just gets modified and a new view is created. So even if I use this select attribute operator, which in this case is selecting the columns, the resulting dataset that gets wired out is a new view. What? you see here, that’s original. So the original data is still here. If you want to output it, you can. It’ll keep it all there. Let’s set this one up. Here, I’m going to do a subset, select the attributes. I take age, churn label, gender, so on and so forth. Hit apply and then hit run. And now, my data is almost ready to be modeled. And I do a few other things to it. I’m going to calculate outliers. And I’m going to convert this date into a number, into elapsed milliseconds. So let’s go back to design. Let’s open this next process.
Okay. First thing we’re going to do is we’re going to convert the date. And here, RapidMiner has– these are also going to become your favorite operators. Under blending, attributes types. This is where you convert numericals to nominals, numbers to strings, strings to numbers. This is where you convert your date to numbers or convert a nominal date like say month, day, year to year, month, day. Sometimes, the Europeans, they go with year, month, day. And the Americans like to go month, day, year. So you might want to do some conversions. All of these are located right here. And they’re going to become part of your best friends especially parsing numbers, formatting numbers, even set the data types. Maybe somewhere along the line, you did some major conversions, and you might want to convert things back into integers or back into nominal values or something like that. So these are also very, very powerful tools to use right here. So what we’re going to do here is we’re going to select the last transaction. We’re going to convert it to a millisecond relative to that epoch, January 1st, 1970. So let’s hop over here. Hit run. And lo and behold, based on last transaction, now, we have the elapsed milliseconds from the epoch. So we’ve converted that date, whether it was November 24th, 2009, now, to how many seconds have elapsed since that was recorded. Awesome.
Okay. So we did that conversion. Now, let’s calculate outliers. Okay. And what you see here, I’ve also started doing more housekeeping, right? Originally, we had a subprocess over here– where I click on, you’ll see the replacing the filters. Now, I created another one for the normalization. The normalization, the feature generation, the selection, and the conversion here. Just something to keep my process looking neat and tidy and something that I can use over and over again. Now, I come and drag a detect outlier operator. I do a search for detect or if I do a search for outlier– let’s do outlier. RapidMiner comes with four stock outlier detection operators. Very handy. Very good. They use different ways to detect outliers from Euclidean distances to densities to the local outlier factor and so forth. But I also want to bring your attention to a really fantastic anomaly detection suite that’s available on our extension marketplace. In that extension, they have many, many more– about 15 more different types of anomaly detection methods going from most nearest neighbor to more anomaly detection scores to clustering and measuring the performance and so forth. So if you do a lot of outlier detection, I highly recommend you go to our marketplace and you download this extension.
Over here. Another one. We’re going to leave the default values. The distance function is Euclidean. We’re going to run this process. And what it will do is it uses the local outlier factor score system. So it’s going to generate a number. The more closer or less the row or the example is to 1 the less of an outlier it is. It tends to be– as the number increases, the more of an outlier tends to be. So over here, it automatically created an outlier column. And I can sort it over here. I can go, “Okay. Click sort.” And I can see [inaudible] 3.276 is the highest. Or I can come to my statistics and I can review this as well. And I can see the minimum and the maximum and the average and so forth. So also very handy. And we’re going to now take all this data. We’re going to now put it into a process to the model.
The last process I’m going to share with you today before we open it up to questions is what all of this data processing looks like. Let’s go over here. This is the data prep. Right here. This is all data prep right here. And these guys are the actual modeling. And if I were to expand all of my operators, take them out of the subprocesses, you can kind of see that, right? RapidMiner, 90% of the time is all data prep. And here’s just your modeling. So if we run this process now, it’s going to go do all the work that we just did on this data set. And it’s now going to go and build a k-nearest neighbor model, which we can evaluate. This case, it’s not going to be so good. Roughly 60% from my recalls. And class precision overall accuracy is 59%. Not good. Now, I can come back, double click on my cross validation. And I can add my gradient boost to trees if I wanted to and quickly remodel the data and get to my – what’s the word for it – to my confusion matrix quicker. While this is going, I’m now going to hand it back over to Hayley for any questions.
Great. Thanks, Tom. So as a reminder to our audience, we’re going to be sending recording of today’s presentation within the next few business days. And like Tom said, feel free to go ahead and put your questions into the questions panel right now. It looks like we already have a few questions here right now. So I’ll go ahead and ask the first question that I see. Tom, this person says, “I’ve downloaded the anomaly detection extension from your marketplace. Are those LOF operators different than yours?”
No, the LOF algorithm is the same. One of the things about the anomaly detection suite, which I like a lot, is that they’ve actually done some very good optimization. So they actually run a little bit faster than our native operators, but it’s the same type of algorithm, the same math behind it all. Yeah.
Great. I have another question here, “At the end of your demo, you show one cross validation operator. Is there a way to automatically iterate that over multiple algorithm?”
Yes. So what I just showed you was probably the less elegant way of trying different algorithms like over here where I just copy and paste it and turn one on and turn one off. What I normally do in a production sense is we may have something called like a multiply operator, which will make multiple– so you’ll do all your data processing up here. And at the very end, you create a copy of the pre-process data. And then what you can do is you can have multiple cross validation operators to do something like this and then evaluate– and there’re different algorithms in there. Of course, more advanced stuff, as we have things called select subprocess. So you can do even more elegant solutions by doing auto modeling. I guess there’s a lot of talk about auto modeling nowadays or being able to just dump data into something in the auto model stuff and see which is the best model. We’ve been doing this for many, many years before all the latest interest in it. So yes, you could do it many different ways.
Great. Someone here is asking, “How do you apply a decision tree to this?”
How do I apply a decision tree? Very, very simple. So let’s just see here. Just delete these guys. What I would do is I would maybe just, in this case, skip reverse and put a decision tree. Just look for the algorithm decision tree if I can find it. Drag it in. And I think this will work. Yeah. The data types are good. And then just run the model. And what I could do is I can actually output the decision tree as well by outputting the model and outputting the performance. So while this is running, we can investigate how good the model is. Let’s see here. Performance tells me – just a second – it’s pretty good at predicting these, not so good at churned. And the tree, in this case, splits on last transaction as being the most important one. So you can see here that if you went to go back, calculate this, which you can, you could see what was that last state of interaction, which should tell you who’s churned and who’s loyal.
Great. Thanks. This person is asking, “Are there any common building blocks you or other RapidMiner folks have created that are available for download?”
Oh, yes. Oh, yes. Yes. Building blocks are actually quite simple once you can actually share them. They’re very easy to share. We actually created a community board. If you go to community.rapidminer.com– also your best friend. I mean, it is chock full of all kinds of good stuff. There’s a specific– if you come to RapidMiner products help, there’s a specific subforum called building blocks where RapidMiner users can actually share and put their building blocks in there. There’s some fantastic high-level stuff, some simple stuff. Like you see here, cross validation using Python models. You can create a Python model and use in RapidMiner to generating, optimizing decision trees and support vector machine. So all kinds of stuff in there. We encourage all of our users who actually build a cool building block to share it with the community because the more you do, the better it becomes.
Great. I have another question here, “How do you include Python code to create a data set?”
Oh, very, very simple. So a couple of things with Python– you’re going to need to have pandas installed. So if you’re running Windows, pandas can only come with, I think, the Anaconda distribution. But if you’re running Linux or even I think Apple, you could install pandas separately. So the way you would do that is you would just use our execute Python operator. And you would then, in this case, it would work– but you would then wire out the data into here. Click on add a text and you would write whatever Python code that you need to do to do some transformation or whatever it is. And then you can pipe it back out into RapidMiner. If you want to learn more about that, just right-click on this. And it’ll show operator info. And then down in the bottom here, we have some tutorial processes, which let’s say for instance here clustering, which you can then inspect the data. So this RapidMiner data, pumping it in. And then I have some Python code right here to do some work and then it pumps it back out. So yeah. You can do that. Yeah.
Great. This person’s asking, “Is there any way to use a C number of library to bring input data in?”
C libraries. So yeah. We have actually a lot of the libraries. I believe you could do that. What you would do is you would probably use the execute program operator. So the execute program operator allows you to either execute a program to do something or you could actually, if you want to use a specific char for C libraries, you can actually use maybe the execute script operator, which is a little bit of Groovy script. You can actually pull that in into your process. So RapidMiner’s Java-based. So if you’re good at groovy and good at Java, I mean, it will work really well. You can do a lot with it with these two operators.
Great. So next question. This person’s asking, “I know you mentioned your blog post earlier. Will there be a similar webinar on text data?”
On text mining? I am pretty sure we will have one in the future. I have not been tasked to do that. But with text data, I mean, RapidMiner also shines very well on that. I do a lot of sentiment analysis especially around Twitter data. So there are a couple of videos on our YouTube channel which talks about text mining federal reserve meeting minutes, doing some Twitter stuff. So you can check those out as well too. That’s another resource, the RapidMiner YouTube page. Yeah. Check that out.
Great. And then someone asked about the recorded version. Like I said, we’ll definitely be sending that out within one to two business days. And then they’re also asking, “Is there any reading material that you can suggest to follow up on this webinar?”
So yes. There is one book– there’s two books I really like. One is Exploring Data with RapidMiner written by Andrew Chisolm. I believe that’s available to purchase off of Amazon. And there’s also the book written by one of our partner, Semaphore. Really great review of all the features of RapidMiner especially data prep. I believe, Hayley, we may be doing– we have occasional contests where we give the book away for free. So what I would do is I would probably try to reach out to somebody in our marketing group to see if we can get you a free copy or something like that.
Okay. Thanks, Tom. So it looks like we’re just about at the top of the hour. So for those of you who we weren’t able to address your questions, we’ll make sure to follow up with you via email. So thanks again, Tom, for your presentation. And thank you everyone for joining us for today’s presentation. Have a great day, everyone.
Thank you. Bye-bye now.
You can check out the processes and data from this presentation here: