Machine learning can sometimes seem confusing, with algorithm names and model types seemingly proliferating without end. But we know for a fact that anyone can understand and employ machine learning, no matter their skill level.
With recent advancements in AI and ML, it’s easier than ever for beginners to start leveraging machine learning as a powerful tool to drive business impact. That’s why we wanted to take a step back and draw up some explainers about the core concepts in machine learning for newcomers.
In that spirit, we’ll be looking at two of the most common categories of machine learning in this post: supervised and unsupervised machine learning.
Supervised learning vs. unsupervised learning
The key difference between supervised and unsupervised learning is whether or not you tell your model what you want it to predict.
In supervised learning, the data you use to train your model has historical data points, as well as the outcomes of those data points. Unsupervised learning doesn’t have a known outcome, and it’s the model’s job to figure out what patterns exist in the data on its own.
While both types of machine learning are vital to predictive analytics, they are useful in different situations and for different datasets.
Supervised machine learning
In order to train a supervised model, we first need a historical dataset that’s labeled with the outcomes of the data. This data maps the inputs that the model will have access to during production to the known outputs: what the model should predict, given those inputs.
For example, in a model to predict churn, the data would be various historical facts about customers (the inputs at production), paired with whether they churned or not (the outcome we expect the model to predict).
The dataset is broken into two parts: the training set and the test set.
The training set is used, as the name implies, to train the model to map certain patterns in the data to the historical outcomes. Once the model is created, the test set is used to verify the accuracy of the model by comparing the model’s predictions to the known outputs.
You can imagine this scenario as being something like a textbook with an answer key. After studying, you can try to do the exercises in the textbook, and then compare those answers to the answer key to see how you did.
Typically in data science, a model trained through supervised learning is considered successful if it can make predictions that match the known outcomes at an acceptable level of accuracy.
However, in the business world, it is better to consider value and return on investment rather than simply model accuracy when deciding a model is successful.
The importance of supervised learning
The goal of predictive models is not simply to understand the patterns in its training data, but to apply what it has learned to novel input data that it hasn’t seen before, allowing it to make predictions on datapoints where the outcome isn’t known. It’s this ability that makes a predictive model valuable in real-world scenarios.
A common problem during the model training process is overfitting. Overfitting is when a model is too closely matched to the training data. This allows it to predict outcomes in the test set with high accuracy but makes it less accurate when predicting on data from outside the training and test sets—that is, the real-world data you actually want your model to predict on.
For example, if you’re training a model to distinguish between dogs and cats, but only include Great Danes and Rottweilers as examples of dogs, you can easily tune your model to correctly distinguish the two based solely on size. However, when this model is exposed to the real world, it will likely classify Chihuahuas and Corgis as cats.
Overfitting can also be caused if the training data contains errors in the output values, which would naturally skew the model’s future predictions. This highlights the importance of data preparation and validation as a key step in the model-building process.
The best way to avoid overfitting errors is to use a simpler, less specialized model that can accommodate a wider variety of data points. And, of course, you should verify the integrity of your training data before model training.
Uses of supervised learning
Uses of supervised machine learning tend to fall into one of two categories: classification and regression.
In the case of classification, the model will predict which groups your data falls into—for example, loyal customers versus those likely to churn. For regression, the model will predict a number—for example, predicting how long a mechanical part in a factory will last before needing to be replaced.
Example of supervised learning
A good example of supervised learning is a classification decision tree. Decision trees are easy to use and visualize. As their name suggests, they use multiple conditional statements to arrive at a final decision. Decision trees are often selected because they are very easy to understand and explain—a key component of implementing machine learning in a business environment.
Decision trees use a recursive top-down strategy. An initial attribute (or column in a spreadsheet) is selected from the dataset to be the top of the tree, splitting the data into two categories. In this example, courtesy of our founder Ingo Mierswa, we can factor in different attributes of a dog and make a prediction classifying it as either adopted or not adopted.
We start with the cuteness factor and split the data between whether the cuteness is high or low. If the cuteness is high the dog is always adopted, meaning we have a pure category and the branch ends here.
If the cuteness is low, the size of the dog becomes a deciding factor, making size the new category to divide. From here, we see that dogs with low cuteness and large size are never picked, while dogs with low cuteness and small size are, giving us a complete decision tree.
Unsupervised machine learning
Unlike supervised learning, unsupervised learning uses data that doesn’t contain ‘right answers’. Instead, these models are built to discern structure in the data on their own—for example, figuring out how different data points might be grouped together into categories.
Importance of unsupervised learning
These features make unsupervised machine learning especially useful for transactional data, such as sorting potential customers into categories based on shared attributes for more efficient marketing, or identifying the qualities that separate one group of customers from another.
Uses of unsupervised learning
There are three types of unsupervised machine learning models:
- Clustering (or grouping algorithms) attempt to find items in your data which are similar to each other, for example, identifying customers which are similar to one another.
- Topic detection is a subset of clustering, which identifies the topic of a written text.
- Anomaly detection (sometimes called outlier detection) finds items in your data that are different than the rest of the data.
- Frequent itemset mining determines if a person is likely to buy X, given that they purchased Y.
Example of unsupervised learning
k-means clustering is one of the easier unsupervised machine learning algorithms to understand. The algorithm organizes datapoints by k number of centers around which it clusters the datapoints. The algorithm then groups data points into a category based on their nearest centroid, until each datapoint belongs in a category.
We can see this in action here, using our iris training dataset (available in RapidMiner Studio). The data contains measurements on sepal length, sepal width, petal length, and petal width of different iris flowers.
Although we know there are three different species of irises in our dataset, we can ignore those labels and see if an unsupervised model can accurately identify the species of the various flowers in the dataset, based on these measures.
Our algorithm chooses three centroids around which it can create clusters. The results put 50 flowers in our first cluster, 39 in the next, and 61 in the last. Our scatter plot shows the data clustered distinctly, allowing us to label each flower with its most likely species.
Semi-supervised machine learning
Not every use case falls into the category of supervised or unsupervised learning. Occasionally semi-supervised machine learning methods are used, particularly when only some of the data or none of the datapoints has labels, or output data.
For example, you could use unsupervised learning to categorize a bunch of emails as spam or not spam. From there, you could analyze the word frequencies of each of your two groups, and then use that information in a supervised technique to classify income emails as spam or not spam.
So, with all the differences and similarities between supervised and unsupervised machine learning, you be wondering which is better?
While each method has its strengths in specific circumstances, our Head of Data Science Services, Martin Schmitz is firmly in camp supervised.
As he writes in A Human’s Guide to Machine Learning, “If you can go supervised, go supervised.”
This is because it is difficult to measure which clustering is better in an unsupervised problem. Making an unsupervised problem into a supervised one can often be the key to developing the best optimized model, even if it requires more work to add labels to the initial data values.
Supervised and unsupervised learning methods are powerful tools for data scientists and have more uses and examples than we could possibly explain in a single article. But having a clear understanding of both is the first step in figuring out what’s best for you.
If you’d like to see how your business can benefit from the power of machine learning, request a free AI assessment and we’ll walk you through potential use cases and explore the impact they can have on your business.
New to RapidMiner? Here's our end-to-end data science platform.
RapidMiner announced the release of its platform enhancement, RapidMiner 9.6. This update prioritizes people – not technology – at the center of the enterprise AI journey.