Machine Learning on Hadoop

Hadoop offers great promise to organizations looking to gain a competitive advantage from data science. Hadoop lets organizations collect a massive amount of data that can later be used to extract insights of immense business value for use cases that include fraud detection, sentiment analysis, risk assessment, predictive maintenance, churn analysis, user segmentation and many more. But deploying Hadoop can be extraordinarily complex and time consuming, making it difficult to gain the insights.

Hadoop is a collection of technologies and open source projects that form an ecosystem for storage and processing, requiring a host of specialized IT and analytics skills. Integrating these different Hadoop technologies is often complex and time consuming, so instead of focusing on generating business value organizations spend their time on the architecture. Data scientists spend most of their time learning the myriad of skills required to extract value from the Hadoop stack, instead of doing actual data science.

Hadoop is a set of loosely coupled software packages like HDFS (file system), Hive (data access and manipulation), Spark (parallel job management), Yarn (job scheduling), and others. All major Hadoop distributions have more than 20 integrated packages like those mentioned above, but there are dozens more available. It is a huge effort to simply keep track of these packages; it takes time and specialized knowledge to understand the Hadoop ecosystem.

As a consequence, mastering analytics on Hadoop through programming not only requires a very broad yet specialized skill set, but also means repetitively solving many tasks that are a necessity of the technology rather than part of the actual analytics initiative. For these reasons, developing predictive analytics on Hadoop can be a complex and costly endeavor.

Related Resources

Best Practices in Hadoop Webinar
Whitepaper on How to Minimize Complexity of Machine Learning on Hadoop