Joining Data – How to do it in RapidMiner
We all have to do it. Joining data from different sources is a must when you work in the data economy. The typical way to do it is to ask your Data Warehouse Admin to give you a dump of data in the format you want. He or she will typically write an SQL statement that meets your criteria and then give you the output. Your next logical step is load it into RapidMiner, do your work, and live happily ever after.
While this is the optimal scenario, it isn’t reality. You might get a data dump from the DWH Admin, then your manager gives you spreadsheet with more data, and you might find a CSV file from the Internet. Your task? Merge, Mashup, and Join all those data sources. You could easily code all the mashups but that would take time. The good news is that RapidMiner allows you do all those complex SQL merge and join functions quickly with 7 operators.
Finding the Joins
RapidMiner Studio has 7 different types of data “joins;” Append, Join, Set Minus, Intersect, Union, Superset, and Cartesian Product. All of them are easy to use in the drag and drop visual programming way BUT some of them require the use of an ID role. What’s an ID role? That’s where you tell RapidMiner what column (aka attribute) you want to use as a Primary Key. This is easy to do if you use a Set Role operator and then select the column you want and make it the ID role.
To find all those 7 operators, you will have to go to Blending > Table > Joins. You can also use the recently enhanced search box for the Operators. Searching for the word “Match” comes up with the Join operator. If you search for “Add” or “Combine” the Append operator is displayed.
Pro Tip: I’ll be showing some examples using the built in RapidMiner Tutorials. If you right-click on each operator you’ll see a “Jump to Tutorial” link. That will bring you to a sample process illustrating the use of that particular operator.
The Append operator is exactly like the SQL Append operation. It will append two or more data sets together but it only works correctly if – and only if – all the data sets you’re trying to append together have the same number of columns and column names. For example, if one data set has 2 columns (ID, Label) and the second data set has 3 columns (ID, Label, Att1), then you will get an error.
Pro tip: There is a simple work around to this. use a Generate Empty Attribute operator to create the “Att1” column in the first data set, then everything should append well.
The join operator is my most used operator. I use this operator to join so many different data sources or process branches. This operator does what it’s called, it joins two data sources or branches together.
There are few parameters that you need to know in order to make this operator your favorite one. It’s the Join Type parameter (Inner/Outer/Right/Left) and the “use id as attribute as key” parameter toggle (see red arrows). The Join Type parameter is a standard SQL type of join. Inner or Outer, Right or Left, these tell RapidMiner how to join your datasets together.
The “use id attribute as key” is the easy – but not typical – way RapidMiner does the joins. For you to Join two datasets together, you must have something to match the records up with. Usually that will be the email field in dataset #1 with the email field in dataset #2. Another example would be joining up the CustID field in one dataset with ID in the the other data. The “use id attribute as key” makes the assumption that both datasets already have a ID role assigned to a column, which isn’t always the case.
I just toggle this parameter off and then click on the “key attributes” list and do a manual match up of fields. The cool thing with the key attributes list is that you can do multiple match ups. Just click on “Add Entry” and you can do another matchup of your data. Just be warned, depending on the Join Type and how many fields you match up, you could get a resulting data set with no data in it!
Pro tip: You can use multiple joins! Just connect the output join port to the right or left input of the second Join operator!
Set Minus is one of those operators that you tend to overlook at first until you need to remove data. This powerful but often misued operator generates a lot of frustration at first if you don’t have a column set as an ID role. Simply put, this operator works by scanning the whole data set and then removing rows that are contained in a “subtraction” set. For example, if my entire data set has 1,000 rows and my subtraction dataset has 20 rows (assuming those 20 rows are contained in the 1,000 row data set) the resulting data set should be 980 rows.
Just remember, in order to get this operator to work correctly, the whole and subtraction dataset must have a column with the ID role AND the ID’s in the subtraction set must be contained in the whole data set.
Pro tip: Depending on how big of a subtraction data set you have, this operation can be memory intensive.
The Intersect operator is another operator that requires you to use the ID role for both data sets. The way that this operator works is that it looks at all your ID’s in the whole data set (exa port) and looks up the ID’s in the second set (sec port) and then delivers outputs the data that is contained in both inputs. It’s very similar to an Inner Join and is opposite of the Set Minus operator. I hardly use this operator, instead I usually just use a Join operator with an Inner Join setting.
Just make sure to inspect your data to make sure it’s what you want!
Pro tip: Just remember that the ID role is critical to get this operator to work correctly. Both ID’s have to be in same data format to work (i.e. numerical, nominal, etc).
The Union operator reminds me of the Append operator, but only less finicky. Just like the Append operator, you can merge multiple datasets BUT they don’t require you to have the same number of columns or attribute names.
The nice thing about this operator is that it doesn’t care if you have an ID role or not. It just merges everything!
Pro tip: Inspect your data carefully after using the Union operator to make sure the output is what you really want.
Superset & Cartesian Product
The last two operators I want to talk about is the Superset and Cartesian Product operator. These are handy operators to use if you want to do build data sets that have the same number and named columns, just without the complete data. You might want to do this if somewhere downstream if you get a test set or a refreshed DWH extract. Although not used very often, they do have a specific purpose in data manipulation.
The Superset operator takes two sets and then outputs two data sets. The nice thing is that the two outputted data sets contains all the columns and their names in each set. The only difference is that not all the data is contained in each set.
You’ll get two data outputs (shown below) when you run the example process. You can see that both outputs have the same number of columns and names BUT the data is different.
Pro tip: Superset is really handy in building data sets “on the fly” that need to have the same number of columns and names. This comes in hand in model building and model testing where you need to have the same number and names of attributes for the training and testing set!
The Cartesian Product operator is just that, it’s the product of all your data rows for the two data sets.
There’s one snag that you need watch out for. If the data sets you want to merge each contain a label role, it will delete the label from the right side in lieu of the using the one from the left side. To avoid this, just use a Set Role operator to demote one of the label roles to a regular role.
Pro tip: This operator really “blows up” your rows. Just multiply the number of rows you have in the data sets and that’s the total output you’ll get.