11 November 2020

Blog

In-Database Processing: Preprocessing Data Like a Pro

RapidMiner has always been passionate about our open-source heritage. It’s led us to build a platform that’s extensible and flexible so that it can be easily integrated with existing products in the enterprise landscape.

Continuing in that vein, this blog post will introduce our new In-Database Processing Extension. We’ll walk you through what in-database processing is, and then explain how the extension can be a powerful tool to dramatically simplify the process of integrating with the data that already exist in your enterprise.

What exactly is in-database processing?

Before we share details about the extension, let’s get a clear understanding of what exactly in-database processing is. In short, in-database processing involves doing data preparation and preprocessing directly in the database system. This is opposed to reading the entire raw dataset into your local machine and doing your data processing work there.

Benefits of in-database processing

If you’ve ever worked with enterprise-sized datasets, you know that it can be very challenging—they can take forever to download, and even longer to process locally.

In-database processing is about shifting that workload from your local machine—which is inefficient at heavy-duty data prep tasks—to somewhere that it can be executed more efficiently: on the server architecture built to run these kinds of processes. This shift improves speed, efficiency, and cost consciousness.

Because database systems are heavily optimized for quick and efficient SQL query execution, by using the extension, you move the data processing heavy lifting from your own computer (when executed by RapidMiner Studio) to the database system. This will get you results much, much faster than trying to do everything locally.

Additionally, as all of the processing happens remotely on the database server, only the result of the process needs to be transferred over the network, which, depending on the use-case, might be orders of magnitude smaller than downloading a raw dataset. When working on larger datasets on a slower network connection (e.g., when working on a project from home, which is quite common these days), letting the processes run remotely instead of locally can save time and increase speed.

In-database processing can also directly save you money. For example, if you’re using Google BigQuery—a large cloud data warehouse backend—you get charged for the amount of data retrieved. By using in-database processing, you can ensure that the processing happens on Google’s servers, so that you only pay for the (potentially much smaller) resultant dataset that gets returned at the end.

Meet RapidMiner’s In-Database Processing Extension

Because of the strong benefits of doing in-database processing wherever possible, we wanted to make sure that RapidMiner users could easily run in-database processes as part of their workflows. The In-Database Processing Extension for Studio unites the world of SQL with the world of RapidMiner operators, unlocking the processing power of your database system using the simplicity of a visual workflow.

The extension currently supports Postgres, MySQL, MSSQL and Google BigQuery as database backends, and Oracle support is already on the roadmap. We’re constantly evaluating adding further database systems to our list of supported ones based on popular demand, so if you have a specific need, feel free to let us know by posting on our community.

Whether you’re a SQL novice who if spends a lot of time on Stack Overflow trying to figure out how to write a query, or an expert who’s intimately familiar with SQL queries, this extension can make your work faster and easier.

Let’s dive deeper into what the extension can do for you.

Processing data from databases without using SQL

If you’re familiar with RapidMiner’s data preprocessing operators such as Filter, Join, and Aggregate, you can use these operators to build a data processing workflow in RapidMiner. Then, using the In-Database Processing Extension, you can connect that workflow to your database server. The extension will translate the RapidMiner operators into a well-formed SQL query—automatically—which will then be executed directly on the server.

Aside from not needing to know SQL, this operator also has the advantage of being able to leverage the likely significant processing power of your database system, as discussed above, so you’ll get a result more quickly and easily. The result of this translated SQL query is returned to RapidMiner and made available as an ExampleSet. From there, you can initiate additional transformations, enrich your set with other data, or use it to start creating a model.

Something for SQL power users, too

If you’re comfortable writing SQL queries, you should still consider using the operators provided by the extension. The main benefit is that the resulting RapidMiner process can be easily understood—and thus be reusable—by others in your organization, including people who lack your SQL-composing skills. This way, your team members can leverage your work and turn them into insight and a clear business impact for your enterprise.

Wrapping up

As we’ve seen, the In-Database Processing Extension can help you improve the speed and efficiency of your work. If you’re already a RapidMiner user, try it now by downloading it from the Marketplace. You’re also welcome to start a discussion on the community to share your experience, or ask any questions about how to use it.

If you’re not a RapidMiner user yet, start a free trial now. This extension is just one example of the ways RapidMiner can streamline your work and integrate with your current enterprise landscape in an easy, efficient way.

Related Resources