RapidMiner & Kaggle: A Measurable Difference!

This post is Part 2 of a two part series on how I used RapidMiner in a Kaggle Competition to help Shelter Animal Outcomes. Read my first post here

How do I use RapidMiner for Imputation?

Recall that we have missing values in the training and testing data. We can leverage some supervised learners, K-NN or Naïve Bayes or any preferred algorithm, nested inside the ‘impute missing values’ to estimate missing values. I used a K-NN (k=5) scheme to impute the missing values of AgeInDays, Neutered, and Sex. The attribute filter type parameter is set to ‘subset’, thus missing values of AgeInDays, Neutered, and Sex will be estimated using the K-NN scheme.

 

YYKaggle07



How do I use RapidMiner for modeling and validation?

 

YYKaggle08

 

 

Quick and dirty Random Forest model is built inside a 5-fold cross-validation within one minute in RapidMiner! Due to the high-flexibility of Random Forest, there is no need to convert nominal attributes to dummy codes. The easy-to-interpret tree structured results from a Random Forest make it my number one go-to learner. Besides, RapidMiner offers the importance of variables for a Random Forest model through ‘Weight by Tree Importance’.

YYKaggle09

 

As you may have guessed, Age, Hour, Neutered or not, Weekday (DateTime_day) are important attributes for predicting outcomes of shelter animals. Using the quick and dirty random forest model, my 1st submission landed a log loss score at 2.48665. Next let’s build a XGBoost model in the same pipeline with the seamless integration of R/python scripting in RapidMiner.

YYKaggle10

How do I use RapidMiner for parameter tuning?

Want to find the best set of parameters for you models? There is a way to do this with Optimize (Grid) in RapidMiner. It can optimize on parameters of the inner operators. You can optimize more than one parameters for the inner learner and personally specify the range & steps of your target parameters.

 

YYKaggle11

Here is the parameter setting for the inner Random Forest in my process.

 

YYKaggle12

 

And in the parameter file I get the best setting for the number of trees and confidence levelfor pruning.

Random Forest.number_of_trees= 152

Random Forest.confidence= 0.30000003999999997

What if you have called R/Python libraries, like EARTH, XGBoost, sklearn, and want to optimize those parameters in RapidMiner? This very easy, you just set up a macro for the parameters and call them later in your R/Python codes. Or adding a deactivated dummy operator, you can optimize on this and use its value as a macro inside Execute R or Python.

How do I use RapidMiner for Ensemble Methods?

In RapidMiner, you can leverage all ensemble methods (Vote, Stacking, Bagging, AdaBoost, etc) to combine or take advantage of the benefits from different learners. I used a Vote for my final ensemble models. ‘Vote’ operator uses a majority vote (for classification) or the average (for regression) on top of the predictions of the inner learners. Of course, you can combine the prob predictions manually from several learners with ‘Join’[Symbol]‘Generate Attribute’.

It is just a start of my kaggle journey. As of June 21 2016, my best ranking on the leaderboard is 65 out of 1112 (Top 10%). My take home message is R/Python are not the only programming languages for kaggle competitions, but you’re missing out a lot by not incorporating RapidMiner into your analysis. Contact me if you want to team up using RapidMiner as the platform for kaggle competitions!

Update: RapidMiner 7.2 was released after I wrote this post and it now contains Gradient Boosted Trees and Generalized Linear Models. While I love working in R & Python, these two new operators (based off the H20.ai algorithms) makes my life way easier.

Showing 10 comments
  • Brian Tvenstrup

    Thanks for the great article on using RapidMiner for Kaggle! I have used RapidMiner for several Kaggle competitions myself and I agree that it is a very flexible tool for these types of data mining problems. You did not describe this part in detail here, but you can also use RapidMiner for scoring the test data and preparing the upload files in the necessary formats (which are often quite specific in nature, so you may need to apply number formats and reorder attributes when doing so).

    Also, it would be terrific if you could post your process file here as a learning exercise for others who want to follow along with your steps in more detail.

  • Rushikesh

    Hello Mam, Thank you for sharing the process. I am new to RapidMiner and tried to run your provided process. However, the ‘Build XGBoost Model’ gives error ‘X must be atomic for ‘sort.list”. I believe it is not able interpret the line ‘require(xgboost)’. I have installed XGBoost via R Command prompt. I wonder why this error keeps popping up.

    Thanks

  • Malcolm Haynes

    How do you the tabs (like the log tab) to appear next to the process tab?

    • Yuanyuan Huang

      You can click the tab and drag & drop to the appropriate place to re-arrange the views. This allow s you to Customize the interface of RapidMiner to your needs and make your work more efficient. With a fresh install you might not have all the Panels you might want to have. You can activate new Panels under View->Show Panel. My must haves are:
      Context
      Macros
      XML
      Server Monitor