The Many Tools of Data Prep, part 1: Data Exploration
Following our well-received post on “Different Ways to Join Data,” we kick off this new series of posts about RapidMiner’s deep and rich data preparation tools.
Data Science Starts with the Data
Without a solid understanding of the data you are looking at, the entire endeavor of building an awesome model becomes a moot point. The reality of data science and data munging (wrangling, mashing, etc.) is that you’ll spend 70 to 80% of your time building your training set and then the remaining time building the models. Domain expertise is definitely important here, but an understanding of what you’re trying to accomplish is equally important. You must spend a good amount of time on data exploration; you must think about the problem you’re trying to solve, bring the right data together and then inspect it. If you just “slap” data together and call it a training set, you might find that your models will be sorely lacking in robustness. Above all, we want awesome data to build awesome models.
For example, if you extract sales data from SalesForce and enriched it with customer demographic data from your Relational Database (RBDS), you will likely have missing data and erroneous entries in the various cells. While some organizations work hard to store clean datasets, it’s extremely rare to find pristine datasets.
While cleaning your dataset you might find yourself asking the following questions: If you had missing values in that dataset, do you remove them or do you seek to replace them? If you replace them, do you consider an average value? A minimum value? Something else? Some algorithms don’t really care if you have missing values and can function just fine with them. Other algorithms will crash and complain. What happens if all your data is uniform except for one or two data points that appear to be outliers? How would handle them? Would you normalize them or delete them?
Tough questions indeed.
A Typical Example
Before you can even decide what to do with messy or poor quality data, you have to inspect the data to see what you’ve got. We’ll use the sample dataset – Labor-Negotiations – from RapidMiner Studio. Find it under Samples > data and drag it into the design canvas. Make sure to wire it out and hit Run.
When we press play, this simple process executes and loads the sample data into memory. This is the first step you need to take to explore your data.
This is the raw data loaded into RapidMiner and we’ll start with this view as we inspect the data.
Scrolling through this data we see a few things, we have a “class” column (RapidMiner calls this an “attribute”) that’s in green. This column is called a “special attribute,” meaning that RapidMiner interprets it differently than all the columns in white. In this case, the green color denotes that this column is a “label.” A label is RapidMiner’s fancy way of calling that column your target variable. This is the column on which you’ll train the model and later use to predict or “score” unseen data.
Just by visually inspecting this dataset, you’ll notice a lot of “?” marks. Whenever RapidMiner encounters a missing value, it displays a ?. RapidMiner isn’t actually editing your raw dataset, it’s just giving you a visual clue that you have a missing value.
In the sample data file, we have a total of 40 rows (“example” is RapidMiner’s fancy word for “row”), so scrolling up and down and left to right isn’t very hard to do here, but the reality is that you might have one million rows of data to look at. There is no way you can remember everything you see and note every missing value.
RapidMiner helps with this problem by providing a Statistics window. From personal experience, when I build a training set I spend most of my time in the Data, Statistics, and Chart views. I use these to see if there are any visual patterns (Charts), any strange string entries in my data (Statistics), or just looking at the raw dataset (Data).
The Statistics view is very powerful and it’s reminiscent of the “summary” or “head” command of what a Python or R user would use to summarize raw data. RapidMiner makes this part very visual and very fast.
In the summary view we see all the columns listed under the Name column, their Type column, a Missing column, and a Statistics column. This view lets us see the data from a completely different perspective. We can see that duration attribute is an Integer data type, it has 1 missing entry and has a minimum value of 1 and a maximum value of 3. If you click on that row you get a small histogram visualization to get a quick idea how that attribute is distributed across your dataset.
Now if you click on the Open Chart link, you are transported to RapidMiner’s Charting feature, which provides you with additional details during your Data Exploration journey.
If you toggle the Chart Style, you can select many other chart types and visualize each and every attribute of your data for fast insight generation and problem identification.