07 September 2016

Blog

Hadoop Big Data Analytics – In Hadoop, In-Memory or Both?

Hadoop Big Data Analytics – How Big is Big?

Hadoop big data analytics has the power to change the world.  Hadoop, the de facto platform for the distributed big data, also plays an important role in big data analytics.  Organizations now realize the inherent value of transforming these big data into actionable insights. Data science is the highest form of big data analytics that produce the most accurate actionable insights, identifying what will happen next and what to do about it.

The RapidMiner platform is an excellent solution for handling unstructured data like text files, web traffic logs, and even images. But we will discuss how the volume of big data can be easily handled – without writing a single line of code (unless you want to, of course).

Analytical Engines in RapidMiner

RapidMiner offers flexible approaches to remove any limitations in data set size. The most often used is the in-memory engine, where data is loaded completely into memory and is analyzed there. This and other engines are outlined below.

Runtime Comparison of Analytical Engines within RapidMiner

Below, you can find a runtime comparison for the creation of a Naïve Bayes model with:

Hadoop Big Data-Analytics-graph

It can easily be seen that the default in-Memory engine of RapidMiner is the fastest approach in general but will of course fail as soon as the data set size hits the memory limit of the machine. However, training a model on millions of data points is a matter of seconds or minutes only with the In-Memory approach. On decent hardware, RapidMiner recommends organizations use this fast engine as the default for data set sizes up to 100 million data points.

With the in-Hadoop engine, the size of the Hadoop cluster was only three nodes in the experiments above, and is prohibitively slow on small data sets but scales up nicely to very large data sets.  This can be further improved by adding more computation nodes to the Hadoop cluster. Since the overhead for most data sets of common sizes is so large, it is only recommended to use Hadoop as the underlying engine for data sets sizes of 500 million data points and more and where runtime is an important issue at the same time.

Conclusion

Hadoop is not just an effective distributed storage system for large amounts of data, but also, importantly, a distributed computing environment that can execute analyses where the data is. RapidMiner makes use of all the possibilities offered by Hadoop by allowing users to do a distributed advanced analysis on data on Hadoop.

Looking at the runtimes for analytical algorithms, it can be easily seen that limitations in terms of data set sizes have vanished today – but at the price of larger runtimes. This is in all cases prohibitive for interactive reports, but likely also for predictive analytics if the model creation has to be done fast or in real-time.  In those cases, an in-memory engine is still the fastest option. The in-Hadoop engine is slow for smaller data sets but is the fastest and sometimes the only option when data sets are really big in terms of volume.

Depending on the application at hand, a certain engine will always be superior to the others and therefore RapidMiner supports both in-memory and in-Hadoop engines in order to give users the flexibility to solve all their analytical problems. The company’s goal is that RapidMiner users can always select the best engine for their specific application and get the optimal results in minimal times.

There have been some major advancements to the RapidMiner platform since this article was originally published. We’re on a mission to make machine learning more accessible to anyone. For more details, check out our latest release.

Related Resources