What is active learning in machine learning?
Active learning is a form of semi-supervised machine learning that facilitates faster algorithm training by proactively identifying high-value data points in unlabeled datasets.
Here’s why that’s important: As we’ve discussed elsewhere, machine learning algorithms are able to train themselves to recognize patterns and make predictions by first analyzing large volumes of sample data points that have been labeled with the “answer” the algorithm is trying to guess.
Image recognition programs are a good example of this. If you want to develop a program that can distinguish between cars and trucks, you need to train it on a dataset filled with pictures of vehicles labeled “car” or “truck.” The same is true for fraud detection programs in finance. They need to train on a dataset filled with transactions that have been labeled “normal” and “fraudulent” so they can understand the differences between the two.
So having some amount of training data is a prerequisite for machine learning, but what happens if you don’t have enough of it? Adding labels to unlabeled datasets is a time-intensive process, after all, that may require extensive human labor before you’re even able to get started.
That’s where active learning comes into play! With active learning, a machine learning algorithm can scan unlabeled training data and identify only the most useful data points for analysis. The program can then actively query either a labeled dataset (like one that has all the “car or truck” answers) or a human annotator (who simply tells the computer if the image in question shows a car or truck). And since the algorithm is only selecting the data points it needs for training purposes, the total number of data points required for analysis can often be much lower than in normal supervised learning.
When is active learning useful?
Active learning has proven especially useful for natural language processing, as building NLP models requires training datasets that have been tagged to indicate parts of speech, named entities, etc. Getting datasets that both feature this tagging and contain enough unique data points can be a challenge.
Active learning has also been useful for medical imaging and other instances in which there’s a limited amount of data that a human annotate can label as necessary to help the algorithm. Although it can sometimes be a slow process, as the model needs to constantly readjust and retrain based on incremental labeling updates, it can still have time savings over traditional data-collection methods.
How is active learning used?
Active learning can be implemented through three main approaches:
- A stream-based selective sampling approach, in which remaining data points are assessed one-by-one, and every time the algorithm identifies a sufficiently beneficial data point it requests a label for it. This technique can require considerable human labor.
- A pool-based sampling approach, in which the entire dataset (or some fraction of it) is evaluated first so the algorithm can see which data points will be most beneficial for model development. This approach is more efficient than stream-based selective sampling but does require a lot of computational power and memory.
- A membership query synthesis approach, where the algorithm essentially generates its own hypothetical data points. This approach only applies to limited scenarios where generating accurate data points is plausible.
Active learning is one of the most exciting topics in data science today. If you’re interested in data science and not sure how to get started, check out RapidMiner Studio, which lets you play around with building your own machine learning models, and takes you from data to insights in just a few minutes.