RapidMiner’s New Parallel Cross-Validation
Time is money
Time is money, people say. While simplifying a lot, this catchy phrase is certainly valid when summarizing the challenges for data scientists. In today’s fast-paced businesses, the companies which move most quickly have the competitive edge in their markets. Data science is becoming the key guide showing companies where to move and how and this translates directly into the central challenge for data scientists: produce as much valuable insight and knowledge in as little time as possible. Speed is of the essence, not results alone. Time is money.
Accelerating Analytics: Building it and running it
At RapidMiner, it is one of our key objectives to accelerate analytics and to help data scientists deliver accurate and valuable results faster. For quite some time now, we have been focusing on supporting data scientists to build analytics faster, e.g. by recommending next steps when building processes or allowing to re-use work done previously. While this provides a decent lever to accelerate analytics, it is only half of the game. The other half is execution. As such, we have now broadened our focus to speed up running analytics as well:
A new parallel execution framework
With RapidMiner 7.3, we introduce a new parallel execution framework under the hood of RapidMiner Studio and RapidMiner Server. This allows you to run calculations in parallel on multiple CPU cores, making full use of the available compute resources. In the next couple of releases, we plan to migrate many of our operators to make use of this framework, resulting in a considerable speed up through the parallelization of computations. As a first step demonstrating the value, we have parallelized one of the most important operators in RapidMiner: Cross-Validation.
Cross-validation? An excursus*
For those that don’t know (yet), cross-validation is the de-facto standard approach to evaluate how well predictive models predict – by repeatedly splitting a finite dataset into non-overlapping training and test sets, building a model on a training set, applying it to the corresponding test set, and finally calculating how well it predicts what you already knew. Each iteration of training a model, applying it and evaluating its predictive quality is called a fold. A cross-validation is not only the core step in verifying whether a predictive model could be used for a particular use case, but it’s also a central step in comparing different models and identifying and selecting the best one, or for tuning the parameters of a model. In short: cross-validation is used all over the place when it comes to modeling and model optimization. I have built processes in which the cross-validation operator was executed hundreds of times to figure out the best model. *Excursus? An excursus.
Now that we have ported the cross-validation operator to make use of parallel execution, all such modeling processes speed up. In the best case, a speed up equal to the number of folds of your cross-validation. But even in a case assumed to be standard, with a ten-fold cross-validation on a quad-core CPU, we can easily cut process runtimes by 50%. The benefit is clear: significantly less time is needed to run model processes. In effect, you get results way faster than before, can explore more models, variants and parameters in less time, and ultimately produce better results faster.
While modeling is a big part of the work of data scientists, there is more. And there is much more to do to speed up other parts of the process as well. With the new parallel execution framework we have laid the foundation to deliver more improvements to speed up the execution of core computationally-intensive tasks in RapidMiner considerably. Stay tuned for related improvements in the next releases. Again, we want to help you being faster when building and running analytics.
On a final note, we could not make purely a performance-related improvement without continuing to think about user experience: To simplify usage, we have consolidated three operators related to cross-validation into a single one. Where you previously could choose from X-Validation, Batch-X-Validation or X-Prediction operators, all of their functionality is covered by the single new Cross-Validation operator now making it easier to adapt to various use case requirements (see image below). Just another small improvement to accelerate analytics just a little bit further. Time is money, after all.
To find out what else is new in RapidMiner 7.3, check out the release notes.