11 October 2016

Blog

Data Science Map For Your Predictive Analytics Journey

Data Science Map For Your Predictive Analytics Journey

Exploring a new discipline is always a difficult task – that’s why we are proud to provide you with a Data Science Map to help you with this journey. The Map segments common machine learning techniques into logical regions and helps you chart a course through the complex ecosystem.

As you can see from your first glimpse, the most prominent feature on the map is the Sea of Data Preparation. This sea includes many shoals, and the areas around the islands are dangerous territories. To properly navigate these, you will need the help of ETL tools like joining and appending tables, removing duplicates and creating new columns to name but a few. With these tools you can make short work of getting through the tedious sea of data prep and spend your time on more critical endeavors.

As you continue your journey, one of the largest land masses that you will encounter is Supervised Learning. Supervised Learning is the category of algorithms where you have a label (or target variable) as a ground truth and you would like to find the patterns to predict what is likely to happen in the future. The truth is most likely either historical data or gathered from experts.

This land is broken into two territories. The first of which is the Classification Territory, located up in the high North West. Here you will find the “tree based” algorithms, namely Random Forest, Gradient Boosted Trees and the Decision Tree itself. These types of algorithms extract general rules from your data set to predict “classes” such as: Churn/Loyal or Fraud/No Fraud.

Moving to the South, you will find the second territory, which is the Regression Territory. Here you will discover the area of General Linear Models including “good old Linear Regression”, as well as its neighbors the Ridge, Lasso Regression and Polynomial Regression. These use the same ground truth for predicting floating point numbers such as: forecasts or customer lifetime values.

Navigating towards the East, you will come across the confederation of Unsupervised Learning. Unsupervised Learning features grouping algorithms which do not need to have grounded truth in the data. Close to the coastline you will come upon the first of the two Unsupervised Learning dominions, the Clustering Archipelago, with its various algorithms to group and segment your data points. Moving further south you can identify the Peninsula of Ensemble Methods and right next to it the Peninsula of Frequent Item sets. These are used to extract items which are bought together to predict what you can cross-sell. Other common algorithms are scattered around the Central Plains, where the Neural Net guards the entrance to the temple of Deep Learning.

Moving even further Southward, your journey will take you to the distant shores of Outlier Island. If you successfully navigate to this isolated island you will be rewarded with the secret to identifying outliers, or data which appear to be inconsistent with the balance of your data set. This is particularly useful for preventing model misspecification, biased parameter estimation and incorrect results.

And finally, to the South West you will discover the Wilderness of Feature Selection. Luckily, our map provides you with some guidance as you traverse this untamed land. Feature Selection algorithms are usually “helper” algorithms to determine which column is important to predict something. Typically, they are used in supervised analysis, but can also be used to analyse your segmentations. It is important to note, that feature selection algorithms can also be core algorithms, because they answer the question: which characteristics make my customers do that?

We hope that you have found our Data Science Map helpful and armed with this knowledge you are ready to conquer any predictive analytics project that comes your way.

Want to start seeing the benefits of data science first-hand? Request a demo to get started with RapidMiner today!

Related Resources