Hearthstone Cluster Analysis

As a passionate data scientist, I try to use analytics not only at work but also in my every day interests. One of my hobbies besides data science is Hearthstone.  This blog post will focus on a cluster analysis for Hearthstone.

Hearthstone is a trading card game designed by Blizzard which is one of the most viewed and played games on the internet. You might compare it to Magic the Gathering or even the Pokemon trading card game.  You build a deck with 30 cards and play against each other in a turn-based fashion. In this post I will analyze those decks to identify patterns.

Data Crawling

As usual the most important thing for data analysis is the data. I found a very cool Python script here which allows you to crawl Hearthpwn.com for decks. It also enables you to use hearthstoneapi.com to get information about each card. In our analysis we will only use the decks itself, not the added card information.

The data is stored in a SQLite database which can easily be used in RapidMiner. The SQLite driver is not directly shipped with RapidMiner but can be download and add it to your RapidMiner.

Data Overview

My crawling process gave me two different tables. The first one includes the decks itself. It includes the information about which cards make a deck.


The second table I used contains some metadata about the deck – its class, rating, a timestamp and cost in dust (the in-game currency).


About Classes: In Hearthstone you can pick between 9 different classes. These classes have unique abilities and unique cards they can use.

Class Distribution

After joining our metadata to the deck data, we can do a simple aggregate to get the distribution of classes. The result can be seen in the chart below.

To be honest I am a bit surprised that the Hunter is that far down. I am not sure how to interpret this, but maybe there aren’t that many archtypes available to the Hunter so there aren’t that many decks around?

About Archtypes: Archtypes are different fundamental types of decks. You can imagine that there is the difference between playing aggressive (called aggro) or passive (called control). There are also some other types of decks around for example Midrange, One-Turn-Kill (OTK) or Tempo.

In my analysis I would like to find different fundamental architypes in the dataset. I will perform the analysis only for the Warrior. If you are interested in doing it for other classes, you can do it yourself with the attached processes and data.


As usual, for most kind of advanced analytics we need to convert the table into a *one line per deck* format. Usually this is called a customer profile, visitor profile or machine profile. In our case it is a deck profile. On this stage we can do very different things. We could build a profile of attributes (columns) like:

  • Number of Legendaries
  • Number of 1,2,3,4,5,6,7 Mana Cards
  • Number of Battle Cry Minions
  • Number of Spells

I’ve decided to go for a card-based profile. We count how often a deck occurs in our data set. The result looks like this



As you can see, this yields a dataset with 290 examples and 169 attributes. We’ll do a cluster analysis on this data.

Cluster Analysis: Decks

Clustering finds groups of data which are somehow equal. For this analysis, I’m using the K-Means algorithm. This algorithm searches for the k groups, which have the smallest average distance to the cluster centroid (= the smallest in-cluster variance).

The big question for all cluster methods is: How many groups do I want to find? To figure this one out we will build clustering models for a various number of clusters. I’ll calculate the David Bouldin Index to evaluate the clustering quality



It is very clear that the best value for the number of clusters is 4.  Now let’s take a deeper look on the four clusters. To do this we will built the k-Means model with k=4 and have a look at it.

Analysis of the Cluster Model

To figure out what our four clusters are, we will do two things. First we will analyze the centroid table. The centroid table shows us the central point of each cluster. We can interpret a 0.53 for amount_C’Thun as “An average deck of this cluster runs 0.5 C’Thuns” or “Every second deck runs a C’Thun”.

As mentioned earlier, we have 170 attributes. Visualizing all of them is not that easy in this blog post, but you can do it on your own by downloading this analysis and having a look on your own.

We will reduce the number of attributes to a more reasonable amount. We could use a feature selection technique with a 1-vs-All strategy to find the most distinguishing attributes, but I decided to go for a simpler way, using this calculation:

Difference = max(value – average(value)).

This is the maximum difference between the average usage of the card in all clusters and the use in this cluster. This difference needs to be at least 0.6 for us to make the attribute relevant. This is a quick filter method which yields reasonable results. We end up with a total of 29 cards. The interesting chart resulting from this filter is the average usage for the remaining cards for the different clusters. This is depicted below. We will discuss this chart in detail in the next section.



The other thing we will do with the clustering is to find the most prototypical deck. The most prototypical deck is defined as the deck with the closest Euclidian distance to the cluster centroid.

Cluster_0 – Control & C’Thun

The first cluster is a straight-forward interpretation. It uses Brawl, Shield Slam and Shield Block as unique cards. It further runs C’Thun in 50% of the times and sometimes Justicar. This is definitely the Control & C’Thun Cluster:  Prototypical Deck

Cluster_1 – Pirates and Weapons

Again something very straight forward. It uses Southand Deckhand, Dread Corsair and Bloodsail Corsair as pirate cards. Additionally it runs Upgrade and Arcanite Reaper as well as Heroic Strike. This is the famous Pirate or Weapon Warrior!  Here’s the Prototypical Deck.

Cluster_2 – Dragons

Again something very easy. It runs Azure Drake, Twilight Guardian, Faerie Dragon and Drakonid Crusher as well as Blackwing Corrupter. Yep, it’s Dragon Warrior!  Here’s a Prototypical Deck.

Cluster_3 – Tempo?

The last cluster is the hardest for me to interpret. The signature cards are Battle Rage, Whirlwind and Armor Smith. I think this that this group is most likely a Tempo Warrior. It might be that there are some other decks in there:  Prototypical Deck.

Here is a screen shot of my entire process.


Further Predictive Analytics Options

Of course there are more things to do than finding archtypes. One thing which came to mind is building your own decks or finding replacements for cards using machine learning. While those options are in general pretty nice, we lack enough data. Having only ~200 decks per class limits the options of what we can do.

The other idea would be a time-depended analysis of the meta game. You might classify all decks into aggressive, midrange or control decks and have a look at the evolution over time. This might be a very nice future analysis of the data set. Feel free to join the RapidMiner Community and do this together with other analysts.

Do your Own Cluster Analysis

The repository with a dump of the data can be found HERE. Feel free to download the repository and add it to your very own RapidMiner. If you need help adding the repository to your RapidMiner Studio, have a look at this Knowledge Base entry.

I am curious to hear your results.

Happy Mining!

Showing 5 comments
  • Antal Sofalvy

    Hello Martin,
    now it has just been revealed why you win so often… 🙂


  • Bethaney Peterson

    This is very interesting ! I have been working at deckbuildong and wondered how to sift through data without reading deck guides from players in forums. I am not a software engineer or programmer however I can see the conclusions you’ve drawn from the data while I play. I also noted that Hunter does not have nearly the integrated archetypes aside from beast synergy, and now a little murloc adaptation and trade off, but there have not been heal, c’thun, or massive spells to buff generally tied into the class. While beast synergy is great, the most interesting mechanic still available in wild is “inspire” which tied in well with Mukla’s Champion, and Lowly Squire, but didn’t allow for good development while part of the standard meta.

  • Sangeet

    Hello Martin,

    When I do clustering on text data, most of my data falls into a single cluster. Can you help me in this.

  • Carlos

    Hi Martin,
    Thanks for the data and processes. I downloaded and opened the process, but it does not let me execute it. I get the following error in Exceute_Python: Parsing failed. The script could not be parsed.

  • Martin Schmitz

    Hi Carlos,

    thanks for your feedback. Can you run Python scripts in general? Or is just this one the problem?

    Feel free to open a thread on community so we can discuss this.