Whether you’re a plant manager focused on minimizing product defects or a marketer who wants to predict the results of an upcoming campaign, there’s a good chance that the data you need won’t be easy to work with.
In many cases, you’ll need to pull in a large volume of unstructured data to create a useful predictive model, which means that there’s quite a bit of work that you’ll need to do to convert that data into a usable format. The problem is that datasets often have so many columns that it can be difficult to determine where to start.
That’s where clustering comes into play.
What Is Cluster Analysis?
Clustering is a form of unsupervised machine learning that describes the process of grouping data with similar characteristics without specific outcomes in mind. A typical cluster analysis results in data points being placed into groups based on similarity—items in a group resemble each other, while different groups are distinct.
It’s worth noting that clustering can take different forms, and there are multiple algorithms that you can choose from (Mean-Shift, DBSCAN etc.) depending on nature of your dataset. The best-known is k-means clustering, which creates groups by randomly selecting central data points and then optimizing their position through iteration.
It’s also important to know that you likely won’t apply clustering to every data science project––instead, there are specific instances where it can save significant time and energy.
Why Use Cluster Analysis?
As you may have guessed while reading the beginning of this post, the most significant benefit of clustering is the ability to take a large, seemingly unwieldy dataset and turn it into something that’s easier to use for machine learning. Here’s how.
When you’re dealing with a high volume of unstructured data, relying on a human to sort through it often doesn’t make sense. In these scenarios, manually organizing and categorizing data isn’t an efficient (or particularly effective) use of time given that real-world datasets can have hundreds of columns and perhaps millions of rows.
Clustering can help to drastically reduce the amount of time that’s spent on initial analysis by breaking large datasets into a form that’s easier to work with. By sorting through columns and finding common characteristics, clustering algorithms quickly organize data and help to identify meaningful patterns that are worth exploring further.
When is Cluster Analysis Used?
Because clustering is largely a grouping and pattern-recognition exercise, it can help to address a wide range of business challenges. Here are some of the ways it’s most used today.
Marketing teams can leverage clustering to develop customer segments based on common traits, which allows them to create tailored messaging and unique offers for groups with similar interests and behaviors.
This technique allows users to identify items in a dataset that don’t share characteristics with other data points or adhere to general patterns. This can be helpful in a variety of ways, such as identifying fraudulent credit card transactions or determining whether a piece of machinery needs repairs.
Recommender systems aim to make highly relevant suggestions to groups of users based on common traits––if you’ve ever binged a show based on Netflix’s recommendations or made a purchase based on Amazon’s, you’re already familiar with how this works.
How Clustering Works in RapidMiner Go
Now that we’ve established what cluster analysis is and when to use it, let’s further explore how it works. We’ve been talking a lot about beer recently, so to keep things interesting, let’s focus on wine.
Whether you consider yourself a connoisseur or just enjoy a glass from time to time, you likely know that wine has a long list of properties that contribute to its taste. In this section, we’ll use RapidMiner Go to run a basic cluster analysis––the goal is to find patterns in our dataset that can help us distinguish between different wine types.
The two most common challenges in unsupervised learning are:
- Finding clusters that separate well
- Understanding the main characteristics of those clusters
When it comes to reviewing the results of clustering algorithms, the first step is usually to plot clusters. In the real world, datasets can use hundreds of columns, which makes choosing the columns that show the best separation quite challenging and time consuming.
A potential solution for this problem is calculating and comparing the average occurrence of unique values in each cluster. When finding patterns in RapidMiner Go, you’re provided with a table of the most important driving factors for each cluster group, which allows you to pick the right columns and see if there are promising clusters.
In the case of our sample dataset, the “driving factors” table shows each column value’s positive or negative contribution to each group. In the case of categorical values, a green or red bar shows that the given value occurs more or less often than the global average.
Now, let’s analyze the results. As the “driving factors” table indicates, Wines in Group 1 have high acid and high tannins, which suggests that these are younger Syrah types. By contrast, Group 2 has a close-to-average acidity, with low tannins. These could be Pinot Noir, or maybe Zinfandel. Group 3 has low acidity and high tannins, which are Cabernets.
If we select the two most defining columns (citric acid and total sulfur dioxide) as axis values, we can nicely plot the 3 distinct clusters (it’s worth noting that some overlap is to be expected).
If we want to take this a step further, we can actually use this dataset to build a predictive model (with “Groups” as the target column) and use the Model Simulator to automatically predict future measurements.
With just a few clicks, I get 9 models with varying performance. The Deep Learning model seems to be the most accurate (93%), so let’s use that in the simulator.
Cluster analysis is a technique that can help you save a significant amount of time and effort when working with large, unstructured datasets by finding the structures implicit in the data. RapidMiner Go helps you to take this a step further by quickly identifying the driving factors that positively or negatively contribute to each group, allowing you to explore the right parts of your dataset to see if there are promising clusters.
By finding common characteristics and transforming data into a more usable format, clustering algorithms identify patterns you may have otherwise missed and help get to insight faster.
If you’d like to apply what you’ve learned about cluster analysis in this post, try RapidMiner Go for free.
Get Started with RapidMiner Go
Data scientist Martin Schmitz talks about using RapidMiner Studio to do your own cluster analysis of Hearthstone card decks.