

Hadoop Big Data Analytics – How Big is Big?
Hadoop big data analytics has the power to change the world. Hadoop, the de facto platform for the distributed big data, also plays an important role in big data analytics. Organizations now realize the inherent value of transforming these big data into actionable insights. Data science is the highest form of big data analytics that produce the most accurate actionable insights, identifying what will happen next and what to do about it.
The RapidMiner platform is an excellent solution for handling unstructured data like text files, web traffic logs, and even images. But we will discuss how the volume of big data can be easily handled – without writing a single line of code (unless you want to, of course).
Analytical Engines in RapidMiner
RapidMiner offers flexible approaches to remove any limitations in data set size. The most often used is the in-memory engine, where data is loaded completely into memory and is analyzed there. This and other engines are outlined below.
- In-Memory: The natural storage mechanism of RapidMiner is in-memory data storage, highly optimized for data access usually performed for analytical tasks.
- In-memory analytics is always the fastest way to build analytical models
- Data set size is restricted by hardware (memory): The more memory is available the larger the data sets which can be analyzed
- Data set size: On decent hardware, up to ca. 100 million data points
- In-Hadoop: The advantage of Hadoop is that it offers both a distributed storage engine as well as a possibility to use a Hadoop cluster for a distributed analytical engine for big data analytics.
- Not applicable for quick, interactive analysis
- Runtime depends on the power of the Hadoop cluster, but it has virtually infinite scalability
- Due to overhead introduced by Hadoop, its usage is not recommended for smaller data set sizes
- Data set size: Unlimited (limit is the external storage capacity)
- Loop-based workflow design: All the storage types above can be combined with loop-based workflows where data processing / modeling is performed on partitions of the data and the results are combined afterwards. It depends on the data set as well as the analysis if, for example, a loop-based approach using the in-memory approach is faster.
- Not applicable for all analysis tasks
- Memory is no longer the limitation but runtime becomes more important
- Data set size: Unlimited (limit is the external storage capacity)
Runtime Comparison of Analytical Engines within RapidMiner
Below, you can find a runtime comparison for the creation of a Naïve Bayes model with:
- In-Memory engine
- In-Hadoop engine
It can easily be seen that the default in-Memory engine of RapidMiner is the fastest approach in general but will of course fail as soon as the data set size hits the memory limit of the machine. However, training a model on millions of data points is a matter of seconds or minutes only with the In-Memory approach. On decent hardware, RapidMiner recommends organizations use this fast engine as the default for data set sizes up to 100 million data points.
With the in-Hadoop engine, the size of the Hadoop cluster was only three nodes in the experiments above, and is prohibitively slow on small data sets but scales up nicely to very large data sets. This can be further improved by adding more computation nodes to the Hadoop cluster. Since the overhead for most data sets of common sizes is so large, it is only recommended to use Hadoop as the underlying engine for data sets sizes of 500 million data points and more and where runtime is an important issue at the same time.
Conclusion
Hadoop is not just an effective distributed storage system for large amounts of data, but also, importantly, a distributed computing environment that can execute analyses where the data is. RapidMiner makes use of all the possibilities offered by Hadoop by allowing users to do a distributed advanced analysis on data on Hadoop.
Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in terms of data set sizes have vanished today – but at the price of larger runtimes. This is in all cases prohibitive for interactive reports, but likely also for predictive analytics if the model creation has to be done fast or in real-time. In those cases, an in-memory engine is still the fastest option. The in-Hadoop engine is slow for smaller data sets but is the fastest and sometimes the only option when data sets are really big in terms of volume.
Depending on the application at hand, a certain engine will always be superior to the others and therefore RapidMiner supports both in-memory and in-Hadoop engines in order to give users the flexibility to solve all their analytical problems. The company’s goal is that RapidMiner users can always select the best engine for their specific application and get the optimal results in minimal times.
There have been some major advancements to the RapidMiner platform since this article was originally published. We’re on a mission to make machine learning more accessible to anyone. For more details, check out our latest release.