RapidMiner & Big Data – How Big is Big?

Some would say the advent of big data will change the analytical world and that Hadoop has become the de facto platform for the distributed storage and handling of structured and unstructured big data sets. At the same time, enterprises are realizing the mere storage of data does not offer any inherent value: It is the analysis of the data which transforms it into actionable insights. Data mining and predictive analytics are the highest form of analysis: The results shown here are examples of identifying what has happened in the past, but also why something happened or what will happen next.

The RapidMiner platform is an excellent solution for handling unstructured data like text files, web traffic logs, and even images. Given this, the variety aspect of big data does not pose new challenges to the platform. But we will discuss how the volume of big data can be easily handled –without writing a single line of code.

 

Analytical Engines in RapidMiner

RapidMiner offers flexible approaches to remove any limitations in data set size. The most often used engine of RapidMiner is the In-Memory engine, where data is loaded completely into memory and is analyzed there. This and other engines are outlined below.

  • In-Memory:  The natural storage mechanism of RapidMiner is in-memory data storage, highly optimized for data access usually performed for analytical tasks.
    • In-memory analytics is always the fastest way to build analytical models.
    • Data set size is restricted by hardware (memory): The more memory is available the larger the data sets which can be analyzed.
    • Data set size: On decent hardware, up to ca. 100 million data points.
  • In-Database: The enterprise edition of RapidMiner offers a set of operators where the data stays in the database and the analysis is performed there.  This allows for essentially unlimited data set sizes since the data is not extracted from the database.
    • Not applicable for all analysis tasks.
    • Runtime depends on the power of the database server.
    • Data set size: Unlimited (limit is the external storage capacity).
  • In-Hadoop:  The advantage of Hadoop is that it offers both a distributed storage engine as well as a possibility to use a Hadoop cluster for a distributed analytical engine to distribute certain analytical and preprocessing tasks.
    • Not applicable for all analysis tasks.
    • Runtime depends on the power of the Hadoop cluster.
    • Due to massive overhead introduced by Hadoop, the usage of Hadoop is not recommended for smaller data set sizes.
    • Data set size: Unlimited (limit is the external storage capacity).
  • Loop-based workflow design:  All the storage types above can be combined with loop-based workflows where data processing / modeling is performed on partitions of the data and the results are combined afterwards.  It depends on the data set as well as the analysis if, for example, a loop-based approach using the in-memory storage, or an in-database analytics approach, is faster.
    • Not applicable for all analysis tasks.
    • Memory is no longer the limitation but runtime becomes more important.
    • Data set size: Unlimited (limit is the external storage capacity).

Runtime Comparison of Analytical Engines within RapidMiner

Below, you can find a runtime comparison for the creation of a Naïve Bayes model with:

  • The In-Memory engine,
  • The In-Database engine, and
  • The In-Hadoop engine.

RapidMiner. runtime_engines_nb

It can easily be seen that the default In-Memory engine of RapidMiner is the fastest approach in general but will of course fail as soon as the data set size hits the memory limit of the machine. However, training a model on millions of data points is a matter of seconds or minutes only with the In-Memory approach. On decent hardware, Rapid-I recommends enterprises use this fast engine as the default for data set sizes up to 100 million data points.

The In-Database engine is much slower but scales up to unlimited data set sizes.  It is almost as fast as the In-Memory engine for smaller data sets but gets relatively slow for more than 20 million data points. Still, for data sets which can no longer be handled with the In-Memory engine (usually from 100 million data points or more), this engine might be the best option if a Hadoop cluster is not available.  The In-Database engine is certainly is a good compromise between scalability and ease of setup and use.

With the In-Hadoop engine, the size of the Hadoop cluster was only three nodes in the experiments above, and is prohibitively slow on small data sets but scales up to larger data sets much better than the In-Database approach.  This can be further improved by adding more computation nodes to the Hadoop cluster. Since the overhead for most data sets of common sizes is so large, it is only recommended to use Hadoop as the underlying engine for data sets sizes of 500 million data points and more and where runtime is an important issue at the same time.  One of the major drawbacks of Hadoop clusters are higher setup and infrastructure costs, and that Hadoop is a premature technology in many regards.

Conclusion

It has been proven that Hadoop and Hive can not only be used as an effective distributed storage system for large amounts of data, but also as the computing engine for analytical tasks. A platform like RapidMiner efficiently allows access to data stored in files, databases, and Hadoop and the ability to analyze it – without the hassle of writing code.

Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in terms of data set sizes have vanished today – but at the price of larger runtimes. This is in all cases prohibitive for interactive reports, but likely also for predictive analytics if the model creation has to be done fast or in real-time.  In those cases, an In-Memory engine still is the fastest option. The In-Database engine offers the best compromise between scalability and initial setup costs. The In-Hadoop engine is prohibitive slow for smaller data sets but is the fastest option when data sets are really big in terms of volume. In the latter cases, however, not all analytical algorithms are supported.

Depending on the application at hand, a certain engine will always be superior to the others and hence RapidMiner will continue to support all three engine types in order to give users the full flexibility to solve all their analytical problems. The company’s goal is that all users of RapidMiner can always select the best engine for their specific application field and get the optimal results in minimal times. Since runtime is often an issue, RapidMiner continues to improve its own In-Memory storage as a powerful default.  At the same time on the company will support more functions for the two alternative analytical engines for In-Database and In-Hadoop analytics to offer users having to analyze billions of data points the same powerful but yet easy-to-use platform they are already used to.

Sales