Using RapidMiner for Kaggle Competitions – Part 2

Share on twitter
Share on facebook
Share on linkedin

RapidMiner & Kaggle: A Measurable Difference!

This post is Part 2 of a two part series on how I used RapidMiner in a Kaggle Competition to help Shelter Animal Outcomes. Read my first post here

How do I use RapidMiner for Imputation?

Recall that we have missing values in the training and testing data. We can leverage some supervised learners, K-NN or Naïve Bayes or any preferred algorithm, nested inside the ‘impute missing values’ to estimate missing values. I used a K-NN (k=5) scheme to impute the missing values of AgeInDays, Neutered, and Sex. The attribute filter type parameter is set to ‘subset’, thus missing values of AgeInDays, Neutered, and Sex will be estimated using the K-NN scheme.

YYKaggle07



How do I use RapidMiner for modeling and validation?

YYKaggle08

Quick and dirty Random Forest model is built inside a 5-fold cross-validation within one minute in RapidMiner! Due to the high-flexibility of Random Forest, there is no need to convert nominal attributes to dummy codes. The easy-to-interpret tree structured results from a Random Forest make it my number one go-to learner. Besides, RapidMiner offers the importance of variables for a Random Forest model through ‘Weight by Tree Importance’.

YYKaggle09

As you may have guessed, Age, Hour, Neutered or not, Weekday (DateTime_day) are important attributes for predicting outcomes of shelter animals. Using the quick and dirty random forest model, my 1st submission landed a log loss score at 2.48665. Next let’s build a XGBoost model in the same pipeline with the seamless integration of R/python scripting in RapidMiner.

YYKaggle10

How do I use RapidMiner for parameter tuning?

Want to find the best set of parameters for you models? There is a way to do this with Optimize (Grid) in RapidMiner. It can optimize on parameters of the inner operators. You can optimize more than one parameters for the inner learner and personally specify the range & steps of your target parameters.

YYKaggle11

Here is the parameter setting for the inner Random Forest in my process.

YYKaggle12

And in the parameter file I get the best setting for the number of trees and confidence levelfor pruning.

Random Forest.number_of_trees= 152

Random Forest.confidence= 0.30000003999999997

What if you have called R/Python libraries, like EARTH, XGBoost, sklearn, and want to optimize those parameters in RapidMiner? This very easy, you just set up a macro for the parameters and call them later in your R/Python codes. Or adding a deactivated dummy operator, you can optimize on this and use its value as a macro inside Execute R or Python.

How do I use RapidMiner for Ensemble Methods?

In RapidMiner, you can leverage all ensemble methods (Vote, Stacking, Bagging, AdaBoost, etc) to combine or take advantage of the benefits from different learners. I used a Vote for my final ensemble models. ‘Vote’ operator uses a majority vote (for classification) or the average (for regression) on top of the predictions of the inner learners. Of course, you can combine the prob predictions manually from several learners with ‘Join’[Symbol]‘Generate Attribute’.

It is just a start of my kaggle journey. As of June 21 2016, my best ranking on the leaderboard is 65 out of 1112 (Top 10%). My take home message is R/Python are not the only programming languages for kaggle competitions, but you’re missing out a lot by not incorporating RapidMiner into your analysis. Contact me if you want to team up using RapidMiner as the platform for kaggle competitions!

Update: RapidMiner 7.2 was released after I wrote this post and it now contains Gradient Boosted Trees and Generalized Linear Models. While I love working in R & Python, these two new operators (based off the H20.ai algorithms) makes my life way easier.

Yuanyuan Huang

Yuanyuan Huang

Yuanyuan Huang, PhD, RapidMiner. Yuanyuan (YY) studied mathematics as her undergrad major at University of Science and Technology of China, later did bioinformatics and statistics study and research at Iowa State University and joined RapidMiner in May, 2015. During her PhD research, YY first-authored 5 and co-authored 2 other research papers, focusing on statistical assessment and prediction of protein 3D structures, and evolutionary game dynamic models. She currently works as a data scientist in Customer Success Team at RapidMiner.