The Data Core Project
The Beginning of a Journey
I have never read “Journey to the Center of the Earth,” but when we first kicked off a project to revise the internal core data management of RapidMiner, our discussion sounded similarly like science fiction and reminded me of that Jules Verne book title.
Customers love RapidMiner, not only for its advanced analytics capabilities, but also for its vast functionality for complex data prep. In conversations with customers, we received feedback that sometimes running complex, computationally-intensive data prep, RapidMiner would consume more memory & CPU cycles than we might want. So we decided to investigate ways to enhance performance.
We started by initiating a highly-skilled task force of RapidMiner engineers and eventually kicked-off a special-purpose project to revise the core data management and processing core of RapidMiner. Our aim to lower memory resource consumption and boost performance by speeding up execution.
Now, the time has come to reveal some first results!
Going in, we knew that one of the keys to success would be our ability to actually measure performance. So we set up a test lab that allowed us to benchmark process execution. We looked at various key performance indicators like runtime and memory consumption. We ran a broad set of test processes and identified a few potential areas to improve performance.
The “Data Core”
We saw an opportunity to improve performance in a couple of situations related to the way RapidMiner manages data in memory and how individual operators access, transform and analyze the data. This part of RapidMiner is what we commonly refer to as the “data core.” When loading data in, there is typically a data table kept in memory with views, such as row filters or column selections, wrapped around it.
The Status Quo
When investigating further, we found an opportunity most for improvements in how the data was kept in the data table (in the blue “data” rectangle of the picture above). Since its inception, RapidMiner has featured row-based data management, where a certain amount of memory is allocated to each row. This is neither good nor bad in general. A row-oriented data management is typically good for transactional data processing with lots of operations that insert new rows or filter rows. It’s typically not as good though, when you have many operations creating new columns or deselecting columns.
A Prototypical Issue
In the hypothetical situation depicted below, deselecting some columns from the dataset displayed on the left, e.g., attributes 2, 3 and 5, would leave the corresponding memory (yellow areas in the right) unused rather than freeing it up.
When looking at the RapidMiner processes that consumed more memory than others, we found exactly this same situation, as these processes often exhibited a large number of column operations, typically the creation of many temporary columns in loops, e.g., for model validations or optimizations. So while a row-based data management is certainly fine in situations where data is rather static, it gave us an opportunity to improve performance, e.g. in high-end data prep processes where data takes various forms within one process.
Changing Perspectives: Towards Column-Oriented Data Management
To tackle this, we started to explore using column-oriented data management for RapidMiner. Column-based data management has gained a lot of consideration for analytical databases, since column-oriented storage has several advantages for aggregating and analyzing data quickly. This helps us with the above-explained issue easily: in the column-based layout, memory used for temporarily created columns or deselected columns could be easily freed up after usage – and memory consumption stayed low as expected.
A Bright Outlook
Column-based data management does not only have advantages in the rather advanced use case mentioned above. In fact, it has many more advantages for organizing and analyzing data in-memory, ranging from the ability to have very compact representations of data (further reducing memory consumption), to organizing data in chunks which can be easily moved in-memory or to disk and back (for even higher scalability). As such, the project we started as an investigation of software behaviors in very specific situations, quickly broadened to the implementation of a new, next-generation data core for the RapidMiner platform – with enormous possibilities!
There have been some major advancements to the RapidMiner platform since this article was originally published. We’re on a mission to make machine learning more accessible to anyone. For more details, check out our latest release.