RapidMiner & Kaggle: A Measurable Difference!
This post is Part 2 of a two part series on how I used RapidMiner in a Kaggle Competition to help Shelter Animal Outcomes. Read my first post here.
How do I use RapidMiner for Imputation?
Recall that we have missing values in the training and testing data. We can leverage some supervised learners, K-NN or Naïve Bayes or any preferred algorithm, nested inside the ‘impute missing values’ to estimate missing values. I used a K-NN (k=5) scheme to impute the missing values of AgeInDays, Neutered, and Sex. The attribute filter type parameter is set to ‘subset’, thus missing values of AgeInDays, Neutered, and Sex will be estimated using the K-NN scheme.
Quick and dirty Random Forest model is built inside a 5-fold cross-validation within one minute in RapidMiner! Due to the high-flexibility of Random Forest, there is no need to convert nominal attributes to dummy codes. The easy-to-interpret tree structured results from a Random Forest make it my number one go-to learner. Besides, RapidMiner offers the importance of variables for a Random Forest model through ‘Weight by Tree Importance’.
As you may have guessed, Age, Hour, Neutered or not, Weekday (DateTime_day) are important attributes for predicting outcomes of shelter animals. Using the quick and dirty random forest model, my 1st submission landed a log loss score at 2.48665. Next let’s build a XGBoost model in the same pipeline with the seamless integration of R/python scripting in RapidMiner.
How do I use RapidMiner for parameter tuning?
Want to find the best set of parameters for you models? There is a way to do this with Optimize (Grid) in RapidMiner. It can optimize on parameters of the inner operators. You can optimize more than one parameters for the inner learner and personally specify the range & steps of your target parameters.
Here is the parameter setting for the inner Random Forest in my process.
And in the parameter file I get the best setting for the number of trees and confidence levelfor pruning.
Random Forest.number_of_trees= 152
Random Forest.confidence= 0.30000003999999997
What if you have called R/Python libraries, like EARTH, XGBoost, sklearn, and want to optimize those parameters in RapidMiner? This very easy, you just set up a macro for the parameters and call them later in your R/Python codes. Or adding a deactivated dummy operator, you can optimize on this and use its value as a macro inside Execute R or Python.
How do I use RapidMiner for Ensemble Methods?
In RapidMiner, you can leverage all ensemble methods (Vote, Stacking, Bagging, AdaBoost, etc) to combine or take advantage of the benefits from different learners. I used a Vote for my final ensemble models. ‘Vote’ operator uses a majority vote (for classification) or the average (for regression) on top of the predictions of the inner learners. Of course, you can combine the prob predictions manually from several learners with ‘Join’[Symbol]‘Generate Attribute’.
It is just a start of my kaggle journey. As of June 21 2016, my best ranking on the leaderboard is 65 out of 1112 (Top 10%). My take home message is R/Python are not the only programming languages for kaggle competitions, but you’re missing out a lot by not incorporating RapidMiner into your analysis. Contact me if you want to team up using RapidMiner as the platform for kaggle competitions!
Update: RapidMiner 7.2 was released after I wrote this post and it now contains Gradient Boosted Trees and Generalized Linear Models. While I love working in R & Python, these two new operators (based off the H20.ai algorithms) makes my life way easier.