Taming the complexity of Hadoop Analytics: RapidMiner Radoop makes it easy to do hard things
Predictive Analytics in Hadoop: big AND complex
Everyone who has delved into the complexities of Hadoop has experienced how hard it is to handle everything that is outside of the data and the analytics itself. Big Data environments entail a huge number of configuration options that each client has to be aware of (High Availability, HDFS encryption, Kerberos configuration, etc.). There are enterprise firewalls to deal with and the difficulties of accessing different DataNodes which could potentially be in different data centers and you may not even know where they are – just thinking about it can give you a headache!
RapidMiner is here to help! Let’s examine this latter issue first.
Getting into the cluster
In general, a Hadoop cluster is a closed environment that administrators will only reluctantly allow users to connect to through a very limited set of ports and access points.
Figure 1- Security considerations constrain access to the nodes
However, to get value from that data, a comprehensive Big Data Analytics tool like RapidMiner Radoop needs access to most of the services located in all the Hadoop nodes and potentially from any user laptop in the company. How can we achieve this without compromising the environment’s security?
RapidMiner Radoop’s new proxy connect
In RapidMiner 7.3, the RapidMiner Radoop Proxy solves the issue of navigating complex Hadoop infrastructures by allowing you to install RapidMiner Server as another component within the Hadoop cluster and configuring it as a proxy. This functionality allows all the communication between RapidMiner Studio or RapidMiner Radoop and any Hadoop component to go through a single machine (RapidMiner Server) and a single port. Even the JDBC connection to Hive can use this proxy.
Figure 2 – The RapidMiner Radoop Proxy as a single entry point
To adhere to industry-accepted security standards, the RapidMiner Radoop Proxy can, of course, be configured to use SSL.
By selecting your newly configured RapidMiner Radoop Proxy, you can work from RapidMiner Studio’s visual design interface exactly as if your laptop were in the middle of the Hadoop cluster, which makes accessing and leveraging your data fast and easy.
Figure 3- Easy proxy configuration
Let’s now turn our attention to the second issue: what if you have dozens of variables (many of them difficult to interpret) that need to be set in your client?
Let’s ask the experts
Most widespread Hadoop distributions include industry leading tools like Cloudera Manager or Apache Ambari, that provide ways to easily configure and monitor clusters. So why not leverage their preconfigured connections to make your life easier?
Figure 4 – Cloudera Manager and Apache Ambari: the main Hadoop admin consoles
In RapidMiner Radoop 7.3, we have made it even easier to create connections to Hadoop clusters. We now allow you to quickly import connections from Cloudera Manager or Ambari by simply providing the Hadoop manager’s URL, your user name and password.
Figure 5- Retrieving the configuration is as simple as this
If your environment consists of several clusters, you will be able to select which one you want to connect to.
Then RapidMiner Radoop will retrieve all the needed configuration variables. If the environment is configured with High Availability or HDFS encryption, Kerberos or any other fancy option, RapidMiner Radoop will identify it and automatically fill all the details for you. In most cases, you will only need to update your user information or the particular Hive database you want to use as default.
Which is MUCH easier than configuring all this by hand!
Figure 6- Some of the variables in our demo environment
Short & Sweet
You already knew that RapidMiner Radoop was a powerful tool for simplifying Big Data Hadoop Analytics – NOW you know that configuring it just got even EASIER!