01 December 2016

Blog

Data Quality: The Many Tools of Data Prep [Part 2]

Data Quality: The Many Tools of Data Prep

Data quality refers to the right type of data being in the right place. Sometimes you have missing values or wrong values in your dataset. So what happens when you have the wrong type of data? What are some of the ways you can make your dataset more robust?

RapidMiner Studio has many different operators to help you increase your data quality but for this post, we’ll focus on replacing missing values and replacing values.

To explore this, let’s go back to our original dataset we talked about in the first post of this data prep series.

data-quality-1

When we first loaded the Labor Negotiations dataset and viewed the Statistics tab, we saw that there were many missing values. For example, in the duration column we have only 1 missing value and the pension column has 22 missing values.

What is the reason for all these missing values? That’s a difficult question to answer because each dataset is different.For example, logging how many times someone visited your website is pretty easy and the data is consistent – a hit is a hit, but data collected from an anonymously completed survey might be more challenging if respondents don’t answer every question.Other times it may just be poor data entry or the data wasn’t available. In some cases, it could be a poor join or appending of database records.

Whatever the reason is, it’s always a good idea to ask the originator of the data if they could give you a better dataset or find out what happened! Always ask first, this could save you many gray hairs later.

If you’ve asked and there’s nothing you can do to fill in the missing values, then it’s up to you as the Data Engineer or Data Scientist to make some assumptions and decisions. If the number of missing values is relatively low compared to the overall dataset, you could possibly filter them out or just delete them.

If that’s not possible, then you must devise a replacement strategy. Since our example dataset is related to Human Resources, you might want to devise a strategy to replace any missing values so that your data makes sense for Human Resources.

For example, if there are a lot of missing age values, perhaps you could replace those missing values with an average age of your workforce. You might want to replace the Pension missing values with the mode of values for that column, etc.

How you attack missing values will always be specific to your industry and use case. The short answer is that you must spend time thinking about how you want to replace your missing values if your dataset is of poor quality.

Missing Values

RapidMiner lets you handle missing values with two main operators: Replace Missing Values & Impute Missing Values.

Let’s look at the Replace Missing Values operator first. This extremely powerful operator lets you replace a single column, multiple columns, or specific data types. There are ways to replace missing values based on matching regular expressions too!

mv-process

In the above process we use two Replace Missing Values operators and set them to the Value Type parameters. This gives us complete flexibility on how to handle missing numerical values (here we replace them with zeros) and how to handle missing nominal values (we replace them with MISSING).

mv-parameteers2

This parameter window is quite powerful and lets you select your data based on columns (attributes), value types, or even with regular expressions. How you replace the missing values is up to you, you can provide an entry like “MISSING” or “9999” and even automatically calculate an average value or just use zero. Very flexible and powerful indeed!

When you run the process you should see the following results:

dq-replace-missing

Success – and from using only two operators!

BUT, what if you need or want to use machine learning to make a better statistical “guess” for your missing values. To do that we’ll need the Impute Missing Values operator.

Imputing Missing Values

Imputation of missing values is a method that uses an algorithm to pick the replacement values. It’ll scan your raw data and make a statistical guess on what value it should replace based on how similar this particular row is.  Usually we use a K Nearest Neighbor algorithm for that because it works really well for both numerical and nominal data values. Just search for Impute Missing and the operator will pop up in the Operator List.

impute-process

Note: The Impute Missing Values operator is a subprocess, so you have to double click on it and then drag and drop a K-NN operator inside.

impute-knn

You execute the process and you should see all your missing values replaced, whether they are numeric or nominal!

impute-results

Great! The data set has all its missing values replaced, but what happens when there are wrong values in your columns and rows instead of missing values. How do you handle that? To fix wrong values, you’ll need the Replace operator.

Replace Operator

Through our data exploration of the dataset, we can see that all other values are of many different types. Some are called Integers, Reals and Nominals. RapidMiner scans the attributes and automatically sets them to a data type IF all of the entries are the same. However, if a data entry error occurred or someone accidently recorded a strong value like “male” or “female” in the duration column, then RapidMiner would assign a “polynomial” value to the column as a result of that errant string value amongst all the integer real numbers.

Note: A string value is called a “nominal” value in RapidMiner and many different string values are called “polynomials.”

RapidMiner Studio offers a host of operators to address modifying nominal values. We can Replace, Cut, Remap and Split nominal values. You can find them under the Blending > Values operator folder.

From personal experience, I typically use the Map, Replace, and Split operators quite a bit in my day to day work. For this discussion, we just want to modify some nominal values and replace them with something else so I will focus on the Replace Operator.

dq-nominals

The Replace operator lets you replace nominal values (strings) in your dataset. Let’s try it with the raw Labor Negotiations dataset again. Our task is to replace the word “generous” in the Vacation column with “awesome.”

Note: We haven’t done any missing value replacements in this example.

Our process is very simple, just the Labor Negotiations dataset and the Replace operator.

dq-replace-process

When you click on the Replace operator, you should see the available parameters that you can adjust for the operator (see screenshot below).

You can select a single column or subset of columns by selecting the “Attribute Filter” parameter (you can even select specific value types just like the Replace Missing Values operator).

The “Replace What” parameter is the word that you want to replace. Here you can also write a regular expression to match something very specific too. The “Replace By” parameter is the nominal value replacement.

Pro Tip: If you click on the little magnifying glass and document symbol next to the “Replace What” parameter, a regular expression editor will pop up. There you can write your regular expressions and test them to see if you get the match you want.

dq-replace-awesome

dq-replace-results

Most of the time, the replacements are very simple like “male” for “m” or in our example “generous” for “awesome.”  There are many more nominal replacement operators you can use but this is one of my favorites and I hope it will become yours too!

So today in the data prep Series, I’ve showed you a few examples of how you can improve your data quality by replacing missing values.  Check back next week as I continue to discuss RapidMiner’s Data Prep functionality with Data Type Conversions.

This article is a part of our Many Tools of Data Preparation Series. If you haven’t done so already, make sure to read the following:

Download RapidMiner Studio, which offers all of the capabilities to support the full data science lifecycle for the enterprise.

Related Resources