17 October 2018


Data Preparation: Time consuming and tedious?

Today, I would like to discuss something which I personally find extremely exciting, although most people don’t really share this sentiment. It’s data preparation. It’s an incredibly important part of the machine learning and the general data science process. But, in fact, the way we are doing it at the moment is not good enough; it’s not smart enough and it’s taking too long.

So, I would like to introduce you to Turbo Prep, RapidMiner’s intuitive data prep for machine learning and I hope you find it as exciting as we do. It will change the way you do data preparation in the future.

But first, let’s get started by setting the stage to think a little bit about what the underlying problem in data science is.

The Problem with Data Science

Most organizations still face a data scientist bottleneck. People want to do data science, they want to use machine learning models in production, but they can’t achieve this because they don’t have the required resources. And while colleges and others try to create as many data scientists as possible, it’s just not happening fast enough and it’s hard to hire them.

The potential solution comes with two different flavors, but perhaps they can be solved at the same time. The first flavor – let’s empower more people to do the work data scientists do. So, that means we need to make this a little bit simpler because data science, in general, and machine learning is just too hard.

If data preparation is such an important part of data science, we need to make that simpler as well. The other flavor – let’s make sure that the resources we currently have, or we are going to hire in the future, are more productive.

So now, how does data preparation play an important role here?

The Data Preparation Problem

What do data scientists spend the most time doing?

We all keep saying that 80% of our work as machine learning experts and data scientists is preparing the data. I personally feel it’s actually much higher. But here are some data statistics from somebody who did a survey among data scientists and general analysts.

How much time do they spend on the different aspects of data science?

In fact, the big blue part here is cleaning and organizing data, and that’s 60%. Then the light gray part is collecting datasets, it’s another 19%. So, that’s already 79%. And even building training sets, the orange part at the top, it’s another 3%. So, in total, 82% of our time, we spend on data preparation.

Only around 13% of the time is spent actually doing the modeling work and then other stuff, the other 5%. So, we spend more than 80% on data preparation. That’s probably because we love it so much, right? Wrong.

What’s the least enjoyable part of data science?

If we ask the same data scientists and general analysts: well, what was the least enjoyable part of data science?

57% said it’s cleaning and organizing data. That was the 60% bucket we had seen before. So, we spend 60% of our time on that bucket, and that’s the one area we enjoy the least.

Then even collecting datasets and building training sets – the other two elements that form the total 82% of where we spend our time – are in the top three of least enjoyable parts of data science. So, we spend 80% of our time there, and we don’t even enjoy it that much. I mean, that’s horrible, but it’s reality.

What makes data preparation so difficult?

So, why do we need to spend so much time on preparing data and why is it really not that much fun? I think one way to think about data preparation – but also generally about data science – is to have a look into the different approaches to data science.

One is the code-based approach on the left side of the spectrum here, and then there is the data centric approach on the right side of the spectrum, and then there is something in the middle which is a process-based approach. Let’s quickly discuss all three of them.

Three Data Science Approaches

Code-based approach to data science

Code-based approaches to data science are pretty much the norm. I’m not saying this is necessarily a good thing. I love coding and I think there’s always a place for coding, but if you keep solving the same problems over and over again by writing code, you’re actually doing it wrong. It’s just not the most efficient way, and it’s not something everyone can do.

The beauty of writing code is: it’s powerful. Writing code is so flexible. You can solve whatever you want to solve. It is also repeatable. After you write the code, you can apply this code on the same data over and over again, and you’re supposed to get the same results.

So that’s one end of the spectrum and it’s amazing but, then again, learning programming languages like R or Python is difficult and not for everybody.

Data-centric approach to data science

The other end of the spectrum is a more data-centric approach, which is basically Excel. You look at the data, you change the data, you edit the data, and you constantly work directly with the data. This is fantastic because it’s a very intuitive approach, you immediately see the impact of your changes.

Unfortunately, that also makes it a little bit more limited. For example, it’s very hard to do things like loops or branches on the data with this data centric approach. So that’s not really solving all the problems, but it is very intuitive.

A big drawback also, is if you do 10 different things on your dataset in Excel, then you show me the final results, I would not be able to tell what you did exactly just looking at the resulting data. And that’s a bit problematic.

Often, these data centric approaches are not repeatable, which comes with problems for data governance as well and making sure that you can apply the same approach to new data. So, those are the two ends of the spectrum.

Process-based approach to data science

It shouldn’t come as a surprise that the process-based approach – something we at RapidMiner have been doing for a long time – is a balanced approach in the middle.

Especially if you love to invent code, it’s as flexible and powerful as coding, but it’s definitely a little bit easier and a little bit more intuitive than coding. Maybe not as intuitive as the data-centric approach but it’s balanced.

If you design processes, it is typically a little bit faster than coding, especially after you get used to it. The reason is because every step in the process encapsulates like 50, or 100, or 200 lines of code. So, you can do more, faster.

This approach is good for governance, as well. Your process can be applied to new datasets, it can also be shared with, checked, and understood by others. It really helps you with your governance around machine learning and data science.

There is also one drawback. And, although we are big believers in this process-based approach, I would be lying if I didn’t point this out. It still requires that you to think a little bit like a programmer.

This is because when you code, you write code, you interpret or compile it, and then apply this code onto some data. At the end, you see the result.

When you’re creating the code, you need to envision what this code is going to do to your data. This is true for processes as well. So, while you’re designing the process, you can’t see the result until you finally execute the process and, only then, will you see if you made errors or whether it is good. And this is what makes it harder than the data-centric approach.

Although building processes in general can be faster than coding, it can still take longer than necessary. This approach is not for everybody, for the same reason why coding is not for everybody – thinking like a programmer can be difficult.

The Solution

For us, as a data science platform vendor, obviously the most important thing is usability for most people, and not validation or total cost of ownership, not even the functionality of how many machine learning models we support or anything like that. And that makes sense if you think about the data scientist bottleneck we discussed in the beginning.

If you want to empower more people, then usability becomes the most important thing. But, at the same time even with a process-based approach, you can’t go the full 9 yards. You can’t get all the way to empower everybody. So, what could we do differently, then? Well, we have been thinking long and hard about this.

How can we get the ease of working in data while keeping the advantages?

One way of actually combining the best of both worlds and getting to the ease of working directly in data, but also keeping the advantages of code- and process-based approaches, is by turning things around a little bit. The traditional approach is to start with code, keep the coding processes, apply the personal data, and then see the results.

But, that whole way of thinking might be wrong. So, let’s forget about code. It’s definitely important for certain people, I love it myself, but it’s not for everybody and not for every solution.

So, let’s drop the code and turn around the data and the process. What if we actually work in the data, and then, depending on what we do to the data, we build the process in the background?

It’s like working with Excel, but, all the changes you make to the data are being recorded, almost like a video. Then it’s turned into a process which can be applied to new data and can always be followed along with to make sure there are no mistakes, and that everything looks good.

This is really a paradigm change. I think this is exciting because I believe, in the future, this is how people will create data science solutions and will work with data preparation.

It combines all the advantages of the different approaches. You get the ease of use of a data-centric approach, but since we build processes in the background, you keep all the advantages of governance and repeatability.

You keep the power because you can still edit the processes if you have to. You can even embed your own custom code if you have to so that you’re not losing anything. Just by turning things around, you really get the best of both worlds. And I think personally, this is absolutely exciting.

To quickly recap, in this new approach,

So, it’s an honor to introduce to you now RapidMiner Turbo Prep, which is doing exactly that.

Introducing RapidMiner Turbo Prep

The whole vision of RapidMiner Turbo Prep was to provide this interactive and very data-centric way of working with data, like sitting in Excel and changing the data directly. So, you do all the blending and data cleansing directly in your data, but then we record what you did in the background and we build those processes which you can then view and have control over.

We can’t accept black boxes. It’s so important that you’re not hiding what you did, even if you automate things or if you work in this data-centric view. Building process in the background is important for keeping those advantages but it’s even more important because otherwise, you wouldn’t be able trust those results entirely.

Turbo Prep is doing a couple more things as well. Firstly, it offers a lot of automatic ways to improve the quality of your data, specifically for machine learning. It also seamlessly integrates with Auto Model for machine learning. Turbo Prep is able to scan through large datasets.

Although it is data centric and very interactive, you can work on pretty large datasets and you don’t need to wait long. Finally, since we turned this whole thing around and have this paradigm change, Turbo Prep is supposed to be usable by anyone – if you can use Excel, you can use RapidMiner Turbo Prep.

To see a demonstration of RapidMiner Turbo Prep, check out the “Intuitive Data Prep for Machine Learning” webinar. I recommend fast forwarding to the 12:00 minute mark, since this post covered the introduction.

If you haven’t done so already, be sure to download RapidMiner Studio for all of the capabilities to support the full data science lifecycle.

Related Resources