The Many Tools of Data Prep, part 4: Feature Generation and Selection
THIS IS THE FOURTH ARTICLE IN OUR RAPIDMINER’S DEEP AND RICH DATA PREPARATION SERIES.
MAKE SURE TO READ PART 1 , PART 2 & PART 3 (and the prequel on JOINING DATA)
Feature Generation and Selection is the next step on transforming your data. It’s definitely a must during any Data Prep phase and RapidMiner has some handy operators to help you make this process fast and easy.
To begin – many of you may be wondering: what exactly is a “feature” and why would you want to generate a new one? A feature is another term for an “Attribute” (RapidMiner’s term for a column). Feature Generation is used to take one or more attributes from your dataset and create a new “feature” from them. A typical examples: calculating the rate of change over time, calculating the percentage of an observed value, or even a simple extraction of a prefix value of a string.
There are many Feature Generation operators in RapidMiner which are found under Blending > Attributes > Generation.
The Generate Attributes operator is a fairly popular operator, and one that I rely on a lot. I call it the Swiss Army Knife of RapidMiner Studio. The Generate Attribute operator can do simple mathematical calculations (e.g., addition, subtraction, multiplication, etc.) as well as really advanced logical, string and date calculations. It’s quite simple to use – just drag a Generate Attributes operator into your Process and click on the Function Descriptions. This will cause a dialog box to appear, and you will be asked to type in the name of the new Attribute (or feature) you want to create and then the function expression.
The function expression dialog box can be a bit intimidating at first, and you may find yourself thinking – what kind of function does RapidMiner expect and in what format? Is it a programming language? Relax! It’s a very simple expression writer. If you can do “IF-THEN” expressions in Excel, you can easily use this! All you have to do is click on the calculator icon at the right of the function expression box and you will find the new Function Expression helper.
On the mid-to-lower left of the dialog box you will find an expand double arrows icon. Click on it to view the impressive list of available functions. These functions range from the standard Logical functions (If-Then, Or, And, etc), to more advanced one like Date Calculations (date_difference, date_now, etc.).
On the mid-to-lower right of the dialog box is the Inputs window. This is an extremely helpful window because it will list all the attributes in your Process workflow. In the example above, you can see that I have Age, ChurnDate, Gender, etc., I can insert these Attributes in the above Expression Box with a click of the mouse.
The Expression Box is where you write your expression, which will automatically be checked to make sure its syntactically correct. If it’s not, you’ll get a warning message telling you what’s wrong. If it’s correct, you get a green check mark. When you’re done, hit Apply and your new feature generating function is ready to go.
The Generate Attributes operator is just one of many Feature Generation operators available in RapidMiner Studio. Others include the Generate ID, Generate Aggregation (Aggregate across examples), Generate Empty Attribute (make an empty attribute), and Generate Concatenation operators.
After you’ve generated new features, you might want to tidy up some of your data. Perhaps you calculated a rate of change and you want to remove the Attributes you used to create the calculation and only keep the new Feature/Attribute. Sometimes you might want to remove correlated Attributes or randomly select Attributes. Whatever the case, selecting Attributes for your process downstream can be easily achieved, just go to your operator list and look under Blending > Attributes > Selection.
I use Select Attributes all the time and find Select by Weights particularly useful when I’m doing Feature Optimization – which we will cover next.
Feature Optimization is on the periphery of Data Prep and uses more machine learning techniques than standard Data Prep techniques. We’re including it in this data prep discussion, because model building is an iterative process and in RapidMiner it’s very easy to flow from Data Prep to Modeling and back again.
Feature Optimization uses machine learning to test and measure the performances of your features. This helps you identify the optimal group of features required to build the best model. Often you don’t need all them and you want to remove the unnecessary ones. There are two main reason for doing this – one, it reduces a very wide data set down to the essential features, and two, it speeds up run time. This means it’s not necessary to throw your “data at the wall” and see what sticks!
There are several ways to do Feature Optimization in RapidMiner. The two standard ones are Backward Elimination and Forward Selection. Backward Elimination starts with all your attributes and then drops out attributes that aren’t as robust as you thought. This is usually done by embedding a machine learning algorithm (i.e., Decision Tree) and iterating over the features, measuring their performance and keeping only the ones that add to the performance of the dataset. Forward Selection is the same concept, except it starts with an empty dataset and adds features, measures their performance and keeps the features that add to the performance of your dataset.
Let’s check out a Forward Selection example from the built-in RapidMiner operator tutorials. How do you get there? Just search for Forward Selection, right click on the operator, and select Operator Reference, and click on the Tutorial link.
When you load up the tutorial process, you should see this:
The dataset is the Polynomial sample dataset included with RapidMiner and the it contains a “label” attribute and 5 regular attributes (RapidMiner’s term for a target variable is called a “label”). Our goal is to shrink the number of attributes down to only the essential ones that help us best predict the label.
The Forward Selection operator is a nested operator. This means we need to embed something inside it to make it work. But what? Think back to our description of what Forward Selection does: it starts with an empty data set, takes one attribute and measures the performance at predicting the Label. Then it adds a second attribute, measures the performance and keeps the two attributes if the performance increases. It then iterates overall combinations of the attributes till it comes with up with the optimal selection of attributes.
The way we measure performance for our Label in RapidMiner is through Cross Validation. The tutorial process has a Cross Validation operator already embedded, and inside that nested operator we embed our machine learning algorithm (K-nn) plus Apply Model and Performance operators. The Cross Validation operator does all the hard work and delivers the performance results back to the Forward Selection operator.
When all the iterations are done, the Forward Selection operator will output the dataset with the optimal attributes!
Note: Any Feature Optimization process can take a long time if you have a wide dataset with 100s of attributes.
The Forward Selection operator also outputs a series of weights (1 or 0) which you can use with a Selection by Weights operator. Why is this handy? Most of the time we’ll do Feature Optimization as part of model training. Since model training can take a long time, we want to do all our scoring as a separate set of processes, but we certainly don’t want to hand select attributes in our scoring process from the results in our training process.
The solution is to save the weight results from the training process and then use a Select by Weights operator in the scoring process. You load in the saved weight results and let the Select by Weights operator auto select the attributes for scoring. This saves time and allows you to move quickly onto the next step in the process.
I hope you have enjoyed learning about how you can use Features Generation and Selection in RapidMiner to help you transform your data, thus reducing model complexity and improving your results.
In my next post I’ll discuss how to handle Outliers.