The Many Tools of Data Prep: Outliers
THIS IS THE FINAL ARTICLE IN OUR RAPIDMINER’S DEEP AND RICH DATA PREPARATION SERIES.
MAKE SURE TO READ PART 1 , PART 2 , PART 3 & PART 4 (AND THE PREQUEL ON JOINING DATA)
The last topic of discussion in our Data Prep series is: Outliers. A simple definition of an outlier, is a person or thing situated away or detached from a main body or system. In Data Science, it refers to some arbitrary data point or value that is far from most others in a set of data. This type of data point is commonly found in dimensions such as gender, age, or income, and when compared to the entire data set, appears though it doesn’t fall within the “neighborhood” we assume it should be in.
Sometimes outliers are important and sometimes they’re flat out wrong. Understanding the difference between the two is really important.
Is it an Outlier or Incorrect Data?
I like outliers because they are interesting. Sometimes they tell us there might be problem with a particular data point or that the data point is just flat out wrong. Other times that outlier could help us extract more meaning from our dataset and be the impetus for new discoveries. But how do you figure out which is which?
Well, let’s take a look at the example below. We have loaded in our customer data and switched to our Statistics tab to explore the data (as we learn in Part 1 of this series). If we drill down on the Age column and expand it, we see a small histogram. From this view, we can see that the minimum value in the Age category is 2 and the maximum is 234.
Right off the bat, we know that the Age value of 234 is wrong, especially if we click on the Open Chart tab for a more granular view of our data:
You could argue that the Age of 234 fits the general definition of an outlier as it is detached from the main body of data, but hopefully common sense would prevail! When was the last time you met someone who was 234 years old? So thinking critically about your data is very important when dealing with outliers.
What about the entry with an Age of 2? Well that’s probably mistake IF and only IF our customers must have a minimum age to sign up (i.e., 16). Knowing that this is in fact the case in this example, we can conclude with confidence that both of these entries were mistakes.
There are a few methods to find outliers in your dataset. You can measure the distances between the points relative to each other. You can measure the data densities and compare the points to how densely they’re grouped. You can use something called Local Outlier Factor. Or you can use something called Class Outlier Factor.
RapidMiner has these four methods built into the platform. They’re located under the Cleansing operator directory.
Pro Tip: There’s a fantastic extension called Anomaly Detection available on our Product Extension Marketplace. I highly recommend it if you do a lot of work with Outliers.
I’m quite partial to using the Detect Outlier (LOF) (“Local Outlier Factors”) operator because of the resulting output. When you use the Detect Outlier (LOF) operator, it calculates an “Outlier Score” that helps you quickly summarize and find outliers in your dataset. When it finishes analyzing your dataset, it outputs a number:
The larger the number in the Outlier column, the further away that data point is relative to your dataset. From here you can quickly sort and inspect the data point or you could filter it out from your analysis by using a Filter Examples operator.
Pro Tip: Just what is Local Outlier Factor? To learn more check out this great explanation of Local outlier factors.
Significance of Outliers
Identifying outliers is critical to ensuring the accuracy of your results. These unusual observations can have a disproportionate effect on your analysis, which can lead to misleading results. HOWEVER, outliers can also provide useful information about your data or process, which could lead to new insights, so it’s important to investigate outliers before deciding on the best course of action.
Where do you go from here?
Although this post concludes my Data Prep blog series, it is only a sample of the data science functionality that RapidMiner provides. Be sure to look for more blog post series in the New Year!
You should also, take a moment to visit our Community, it’s a great place to ask questions and find solutions to your problems. Whether they’re related to RapidMiner Studio, RapidMiner Server, or RapidMiner Radoop, my first stop is always there.
If you are interested in some of the business applications RapidMiner has been used in, please visit our resource page! It’s filled with case studies and demos.
Want to share something cool that you did with RapidMiner? Share your comment below or on Facebook.