The Many Tools of Data Prep, part 3: Data Types and Conversions
THIS IS THE THIRD ARTICLE IN OUR RAPIDMINER’S DEEP AND RICH DATA PREPARATION SERIES.
MAKE SURE TO READ PART 1 & PART 2 (and the prequel on JOINING DATA)
One of the more interesting and important subjects with respect to RapidMiner’s data prep capabilities is the topic of Data Types and Conversions. As Data Scientists, Engineers and Analysts, you have to routinely transform data from one type (i.e. date) to another data type (i.e. numerical). Other times, you may want to parse information from a string (nominal data type) to a date. Perhaps you want to use Support Vector Machine and discover that it can’t handle any attributes (columns) that are nominals (strings), so you have to convert those nominal values into numerical ones.
Although converting data types is relatively easy to do in the command line if you’re familiar with a programming language like R or Python, it’s significantly easier to do in RapidMiner Studio. Our code-optional approach encapsulates that conversion code inside an operator. When the operator used in a workflow design, it is automatically executed as part of the process. To further streamline this step, it can be saved as a building block for reuse when needed.
These conversion operators are found under the Blending > Attributes > Types and there are 15 of them. To use them properly you have to first understand how RapidMiner interprets Data Types.
When you load data into RapidMiner certain assumptions are made automatically. RapidMiner tries to be smart and save you time but scanning your data in your columns (attributes) and automatically setting that column to a “data type,” but what is a data type?
As I’ve discussed in my previous Data Prep posts, in RapidMiner there are really two main data types: Nominal and Numeric.
Nominal, Numerical, and Date-Time Values
Nominal values are typically strings or categorical values. They can be binomial (true or false, male or female) or polynomial (cash, credit card, check). Numeric values contain integers, real numbers, and date-time. You might wonder why date-time is part of the numeric values – well, that’s because of Computer Scientists and something called the “Epoch”.
In Computer Science, you can convert any date and time into a numeric value if you use some reference point. That reference point is called the “epoch”, which is January 1, 1970. Computers can quickly calculate the elapsed time in milliseconds from January 1, 1970 and then automatically covert those milliseconds into a day and time. For example, if you wanted to convert a nominal (string) value 11/21/1970 to a date value, you simply us the Nominal to Date operator.
Let’s take a look at this in practice.
Nominal to Numerical Conversion Example
Let’s say, I want to convert the nominal values of cheque, credit card, and cash to numerical values (i.e. 0’s and 1’s).
To do this, we’ll use a Nominal to Numerical operator and select the attribute that we want to convert.
Then we hit run button and our attribute is converted into three new attributes that either have a 1 or 0 in it.
Now all of our different payment methods are represented by numerical values which can be used in a variety of different machine learning algorithms.
Pro Tip: When converting nominal to numerical values you will automatically add a new column for each nominal value. In our example above there were three nominal values: cash, cheque, and credit card. When we did the conversion we deleted the old attribute and created three new attributes. Be very careful if you have 100’s or 1,000’s of nominal values. If you try to convert all those you will create a very wide data set, which will consume a lot of your computer’s memory and resources!
In my next post I’ll discuss Feature Generation using one of my favorite operators, Generate Attributes!