What is anomaly detection?
In data science, anomaly detection is the process of identifying events and observations that deviate from a dataset’s usual pattern. As you may already know, datasets form patterns that represent business activities in an enterprise setting—an unexpected change within these patterns or an event that does not conform to an expected pattern is considered an anomaly.
For example, at a large e-commerce website like Amazon, there is nothing unusual about seeing a large revenue spike in a single day—if that day is a major shopping holiday like Cyber Monday or Prime Day.
It might be considered an anomaly if Amazon did not have high sales volumes on Cyber Monday for a specific year, especially if sales volumes for previous years were exceedingly high.
It’s worth noting that anomalies cannot inherently be categorized as positive or negative, so it’s more useful to consider the outliers as deviations from the expected value for a set of metrics before looking into the reasons behind them.
Successful anomaly detection analyzes time series data in real-time. A time series is a set of data collected in a sequence at regular intervals. By detecting irregularities inside time series data, anomaly detection can help you make more timely decisions.
Three main types of anomaly
To gain valuable business insights through anomaly detection, you first need to understand the ways it can be applied.
Generally speaking, anomalies fall into three main categories: point anomalies, contextual anomalies, and collective anomalies.
1. Point Anomaly
A point anomaly is defined as a single data point that is unusual compared to the rest of the data. A single balmy day in an otherwise chilly winter would be a good example of this.
On that day, the weather is considered anomalous because the temperature is extreme compared to the rest of the season. Point anomalies often occur in this way, as a singular extreme value on a single attribute of the data.
2. Contextual Anomaly
Also called conditional outliers, contextual anomalies contain data points that significantly deviate from the other data points that exist in the same context. An anomaly in the context of one dataset may not be an anomaly in another.
For instance, one of your customers may double their usual spending behavior in mid-December for the holiday season. These outliers are common in time series data because those datasets are records of specific quantities for given periods.
3. Collective Anomaly
A collective anomaly is a collection of similar data points that can be considered abnormal together when compared to the rest of the data.
For example, a consecutive 10-day period of hot temperatures could be considered a collective anomaly.
These temperatures are unusual because they occur together and are likely caused by the same underlying weather event.
It’s worth mentioning that data points in a collective anomaly may also be point anomalies, but that’s not always the case. In the case of daily temperatures in a heatwave, a single warm day in summer may be completely normal for the season, but several such days that occur consecutively can cause the phenomenon to be considered an anomaly.
Collective anomalies are particularly important in time-series analysis, where underlying events can cause several data points to appear anomalous at the same time.
Anomaly detection techniques
Before machine learning, anomaly detection was a manual effort––systems were built by hand by people who had deep business domain expertise.
The major challenge with this approach is that it doesn’t scale, especially when considering the amount of data that modern organizations manage.
The good news is that machine learning has allowed companies to automate anomaly detection, which in turn lets them analyze massive quantities of real-time data. This has opened the door for data science to help companies make impactful decisions in a timely fashion.
Why do enterprises need anomaly detection?
Anomaly detection is commonly used in business settings, and it’s not hard to see why. The signals in time series data can have serious implications in an enterprise setting—just picture a world in which your bank did not alert you about fraudulent credit card transactions.
Here are a few examples of how anomaly detection is commonly used:
Predictive maintenance helps manufacturers monitor the condition of their equipment and accurately estimate when it will require maintenance or replacement. Models monitor real-time data that’s being produced by sensors that are installed on the equipment and can alert production teams when that data is anomalous. Without anomaly detection, key signals can go ignored, which in some cases can lead to costly equipment failure and production downtime.
For financial institutions, few things are more important than protecting customers from fraud. Today, this is largely done using automated anomaly detection—credit card issuers can monitor the location, category, and amount spent in specific transactions to a customer’s past spending to either approve or decline those transactions.
Anomaly detection is commonly used in medical imaging analysis, which can accurately detect the occurrences of certain diseases in real-time.
Anomaly detection can also help make predictions for a specific patient and identify patterns that are highly unusual compared to patients with the same or similar condition.
As you can see, anomaly detection is best applied in situations that require timely action––while batch processing is appropriate for predicting the results of a marketing campaign, the scenarios above call for more urgency.
Key challenges of anomaly detection
Challenges in anomaly detection can include extracting useful features appropriately, defining what is considered “normal,” and dealing with the situations where there are significantly more normal values than anomalies.
Difficulties brought by anomaly detection can be brought on by both high dimensionality and the enormous amount of data.
High dimensionality creates challenges for anomaly detection. When the quantity of features or attributes within a dataset increases, the amount of information required to generate valuable business insights increases with it.
This brings about information sparsity––when the number of metrics increases, the size of data increases proportionally, making the data sparser and introducing more missing values.
These types of datasets are challenging to analyze due to the sheer volume they contain, and the extensive effort required to clean them for further analysis.
Using RapidMiner for anomaly detection
Interested in using RapidMiner for anomaly detection? We offer a user-friendly interface with step-by-step approach for each process, as well as a rich operator library with expansive time series analysis capabilities for anomaly detection.