Those of you who visited RCOMM 2011 already know about Radoop , the powerful combination of RapidMiner with Hadoop. This make big data analytics easier then ever. I missed the talk myself (shame on me!) but we had a lot of fruitful discussions afterwards and from my point of view this will become the next RapidMiner revolution. Below you will find some information about the project.
What is Hadoop?
Hadoop is is a software framework that supports data-intensive distributed applications. It is based on Google now well-known map & reduce paradigm which makes it an excellent tool for analyzing large data sets. In principle, Hadoop is able to work with thousands of computing nodes on petabytes of data.
What about Hive and Mahout?
Hive is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.
Mahout is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.
You will see below that both frameworks will be tightly integrated with RapidMiner.
What can RapidMiner bring into the game?
Hadoop is great for large scale analytics, but it lacks an easy-to-use graphical interface. RapidMiner is an excellent tool for data analytics, but unless the analyst is not performing some nasty tricks, the data size is limited by the memory available. So we have the algorithms, the support for analytical process design, the user interface, and of course the community with a demand for large-scale analytics.
RapidMiner + Hadoop = Radoop
Radoop combines the strengths of RapidMiner and Hadoop. The result is a RapidMiner extension for editing and running ETL, data analytics and machine learning processes over Hadoop. The developers have closely integrated the highly optimized data analytics capabilities of Hive and Mahout, and the user-friendly interface of RapidMiner to form a powerful and easy-to-use data analytics solution for Hadoop.
Here is the presentation of Zoltán Prekopcsák which he made at the RCOMM 2011: