04 August 2016


Creative Example of Cluster Analysis in RapidMiner

If you’re looking for a creative example of cluster analysis, you’ve come to the right place. As a passionate data scientist, I try to use analytics not only at work but also in my every day interests. One of my hobbies besides data science is Hearthstone.

For those of you who aren’t already familiar, Hearthstone is a trading card game designed by Blizzard – one of the most viewed and played games on the internet. You might compare it to Magic the Gathering or even the Pokemon trading card game. Basically, you build a deck with 30 cards and play against each other in a turn-based fashion. In this post, I will analyze those decks to identify patterns.

Example of Cluster Analysis: Hearthstone

Now that we all know how amazing the Hearthstone trading card game is, let’s dive into an example of cluster analysis using RapidMiner.

Data Crawling

As usual the most important thing for data analysis is the data. I found a very cool Python script here which allows you to crawl Hearthpwn.com for decks. It also enables you to use hearthstoneapi.com to get information about each card. In our analysis we will only use the decks itself, not the added card information.

The data is stored in a SQLite database which can easily be used in RapidMiner  The SQLite driver is not directly shipped with RapidMiner but can be download and add it to your RapidMiner.

Data Overview

My crawling process gave me two different tables. The first one includes the decks itself. It includes the information about which cards make a deck.


The second table I used contains some metadata about the deck – its class, rating, a timestamp and cost in dust (the in-game currency).


About Classes: In Hearthstone you can pick between 9 different classes. These classes have unique abilities and unique cards they can use.

Class Distribution

After joining our metadata to the deck data, we can do a simple aggregate to get the distribution of classes. The result can be seen in the chart below.

To be honest, I am a bit surprised that the Hunter is that far down. I am not sure how to interpret this, but maybe there aren’t that many archtypes available to the Hunter so there aren’t that many decks around?

About Archtypes: Archtypes are different fundamental types of decks. You can imagine that there is the difference between playing aggressive (called aggro) or passive (called control). There are also some other types of decks around for example Midrange, One-Turn-Kill (OTK) or Tempo.

In my analysis I would like to find different fundamental archtypes in the dataset. I will perform the analysis only for the Warrior. If you are interested in doing it for other classes, you can do it yourself with the attached processes and data.


As usual, for most kind of advanced analytics we need to convert the table into a one line per deck format. Usually this is called a customer profile, visitor profile or machine profile. In our case it is a deck profile. On this stage we can do very different things. We could build a profile of attributes (columns) like:

I’ve decided to go for a card-based profile. We count how often a deck occurs in our data set. The result looks like this

As you can see, this yields a dataset with 290 examples and 169 attributes. We’ll do a cluster analysis on this data.

Cluster Analysis: Decks

Clustering finds groups of data which are somehow equal. For this analysis, I’m using the K-Means algorithm. This algorithm searches for the k groups, which have the smallest average distance to the cluster centroid (= the smallest in-cluster variance).

The big question for all cluster methods is: How many groups do I want to find? To figure this one out, we will build clustering models for a various number of clusters. I’ll calculate the David Bouldin Index to evaluate the clustering quality


It is very clear that the best value for the number of clusters is 4.  Now, let’s take a deeper look on the four clusters. To do this we will built the K-Means model with k=4 and have a look at it.

Analysis of the Cluster Model

To figure out what our four clusters are, we will do two things. First we will analyze the centroid table. The centroid table shows us the central point of each cluster. We can interpret a 0.53 for amount_C’Thun as “An average deck of this cluster runs 0.5 C’Thuns” or “Every second deck runs a C’Thun”.

As mentioned earlier, we have 170 attributes. Visualizing all of them is not that easy in this blog post, but you can do it on your own by downloading this analysis and having a look on your own.

We will reduce the number of attributes to a more reasonable amount. We could use a feature selection technique with a 1-vs-All strategy to find the most distinguishing attributes, but I decided to go for a simpler way, using this calculation:

Difference = max(value – average(value)).

This is the maximum difference between the average usage of the card in all clusters and the use in this cluster. This difference needs to be at least 0.6 for us to make the attribute relevant. This is a quick filter method which yields reasonable results. We end up with a total of 29 cards.

The interesting chart resulting from this filter is the average usage for the remaining cards for the different clusters. This is depicted below. We will discuss this chart in detail in the next section.

The other thing we will do with the clustering is to find the most prototypical deck. The most prototypical deck is defined as the deck with the closest Euclidian distance to the cluster centroid.

Cluster_0 – Control & C’Thun

The first cluster is a straight-forward interpretation. It uses Brawl, Shield Slam and Shield Block as unique cards. It further runs C’Thun in 50% of the times and sometimes Justicar. This is definitely the Control & C’Thun Cluster:  Prototypical Deck

Cluster_1 – Pirates and Weapons

Again something very straight forward. It uses Southand Deckhand, Dread Corsair and Bloodsail Corsair as pirate cards. Additionally it runs Upgrade and Arcanite Reaper as well as Heroic Strike. This is the famous Pirate or Weapon Warrior! Here’s the Prototypical Deck.

Cluster_2 – Dragons

Again something very easy. It runs Azure Drake, Twilight Guardian, Faerie Dragon and Drakonid Crusher as well as Blackwing Corrupter. Yep, it’s Dragon Warrior! Here’s a Prototypical Deck.

Cluster_3 – Tempo?

The last cluster is the hardest for me to interpret. The signature cards are Battle Rage, Whirlwind and Armor Smith. I think this that this group is most likely a Tempo Warrior. It might be that there are some other decks in there:  Prototypical Deck.

Here is a screen shot of my entire process.

Further Predictive Analytics Options

Of course, there are more things to do than finding archtypes. One thing which came to mind is building your own decks or finding replacements for cards using machine learning. While those options are in general pretty nice, we lack enough data. Having only ~200 decks per class limits the options of what we can do.

The other idea would be a time-depended analysis of the meta game. You might classify all decks into aggressive, midrange or control decks and have a look at the evolution over time. This might be a very nice future analysis of the data set. Feel free to join the RapidMiner Community and do this together with other analysts.

Cluster Analysis Example in RapidMiner

Feel free to download the repository and add it to your very own RapidMiner. If you need help adding the repository to your RapidMiner Studio, have a look at this Knowledge Base entry. I am curious to hear your results. Happy Clustering!

If you haven’t done so already, download RapidMiner Studio for all of the capabilities to support the full data science lifecycle.

Related Resources