25 August 2016

Blog

Using RapidMiner for Kaggle Competitions – Part 1

Use RapidMiner, Make a Kaggle Impact!

RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Lately, I’ve been trying some machine learning “challenges” at Kaggle.com. I wanted to use RapidMiner to tackle Kaggle Competitions and see if I could get in the Top 10% of the machine learning challenge called “Shelter Animal Outcomes“.  This was my first time taking part in such a challenge!

If you visit the link (the competition is over) the data set is a standard CSV file that contains information on many of the shelter dogs and cats.  The goal of the challenge is to take one CSV file containing training data (the training data contains all attributes as well as the label outcome) and a testing data file containing only the attributes (no outcome label) and to predict the outcome label of the testing set based on the training set.

My environment is as follows: RapidMiner Studio 7.1 in Win 10, with R or/and Python installed for any extra libraries.

YYKaggle01

How do I use RapidMiner for EDA (exploratory data analysis) and ETL?

The platform in RapidMiner Studio offers a versatile and intuitive interface compared to other advanced analytics tools, also RapidMiner is well known for its easy to use data visualizations. To load the original CSV files, the data import wizard guides you through several steps that you can store both of the train and test data in your repository for later retrieve.

YYKaggle02

You can set up special attributes, ID, label, weight, or anything like ‘subtype’ as user specified for the loaded data frame. The date-time format is a little bit tricky but it is wise to define your data-time through ‘Nominal to Date’ later in the data prep.

YYKaggle03

For exploratory data analysis, the visualization in the results view is just one-click away, for instance, stacked bar charts of outcomes by AnimalType is shown below.

YYKaggle04

Both cats and dogs are commonly adopted or transferred, but dogs are much more likely to be returned to their owners than cats. It also appears that cats are more likely to have died compared to dogs. Fortunately, it appears very few animals die or get euthanized overall.

Tip: Sometimes its best to generate an EDA report and you can do just that with the reporting extension. You can find it on the RapidMiner Marketplace. Just go to your Marketplace pull down menu and click Marketplace.

YYKaggle05

Look at the variables ‘AgeuponOutcome’ and ‘SexuponOutcome’, which are not in a format that will be easily usable for efficient modelling. The information we want to extract from ‘AgeuponOutcome’ and ‘SexuponOutcome’ are, Sex, Neutered or not, and Age. We can use ‘Generate Attribute’ operator to achieve these new attributes. Please note that complicated expressions can be created by using multiple operations and functions. Parenthesis can be used to nest operations. The logic used for massaging the train data can certainly be re-used on the test data.

YYKaggle06

As you can see from the snapshots, in the data pre-processing sub process, a list of new attributes I am adding:

There is more than one approach to extract time variables from date-time, avoid coding by ‘Nominal to Date’ and ‘Date to Numerical’ two operators to extract the weekday (day relative to week), week relative to year, month, year, etc.

In my next post (Part 2) I’ll go over Imputation, Modeling Building & Validation, and Parameter tuning. Stay tuned for that!

There have been some major advancements to the RapidMiner platform since this article was originally published. We’re on a mission to make machine learning more accessible to anyone. For more details, check out our latest release.

Related Resources