22 September 2016


Unlock Behavioral Insight from MongoDB

Unlock Behavioral Insight from MongoDB

Recently Tom (@neuralmarket) and I had the chance to work together with Amanda Shiga (@AmandaShiga) from Nonlinear Digital to build web analytics process using RapidMiner. Amanda has an on-going pilot project to apply data mining techniques to clickstream and user behavior data collected from her client’s website. The website has a number of value-weighted micro-conversions, such as newsletter signup, or downloading a whitepaper, or event registration. For online retailers, seeing the visitors convert to paying customers is the ultimate goal. The focus of web analytics nowadays has shifted from getting visitors to a website to turning the web visitors into high value customers. Armed with advanced data collection methods and machine learning approaches, RapidMiner can help website owners to measure the successes of their online business goals.

Environment: RapidMiner 7.2, MongoDB

Collecting the Data from MongoDB

Amanda’s web analytics project presented us with a classic “big data” problem. The granular details of the visitors activities are originally stored in a NoSQL database. MongoDB captured many differente attributes of user behavior, so the following analytics ETL process needs to handle a wide variety of data structures.  This data was coming in at a high velocity from thousands of concurrent users so we have to deal with large volumns of data generated by monitoring user activity tracking how content is consumed and interactions are occurring on the website.

Snapshot of the raw data in MongoDB:

As you can see from the Robomongo interface, some features like the visitors’ browser type, referring site (google, bing, etc), clickstreams and viewed pages are all recorded in the unstructured format. We have to first set up the connection to integrate MongoDB account with RapidMiner Studio and then massage the data into the good structures. Using ‘Manage Connections’ from the menu bar of RapidMiner Studio, the users can add a new connection to their own MongoDB.


Among millions of cloud connectors and data (database) connectors, ‘Read MongoDB’ operator can be used to pull raw data from a predefined MongoDB instance to RapidMiner Studio. Just make sure the free NoSQL Connectors extension is downloaded and installed from RapidMiner Marketplace.

Ok. After the connections and preparations are all set, let’s load the data from MongoDB, and apply ‘Loop Collection’ to the retrieved collection of documents.

Because the output from MongoDB will be a collation of JSON documents (example of JSON document). What we can do inside the loop for each document is to add a converter ‘JSON to data’ right after the JSON document input to flatten and transform single JSON document into one structured example set.

In case you have some content in the JSON file that cannot be easily transformed, for example data stored via GridFS as ‘’, we can use ‘Remove Document Parts’ to clean up some unexpected special characters from JSON file.

So far we have successfully load the visitors’ activities from MongoDB and transformed them to the nice structured format.

But it is still early that the collected example sets generated by ‘JSON to Data’ are not ready for appending or modelling, because the example sets are not in the same format. We need some data preprocessing to standardize the visitors’ profiles.

Data Preprocessing

Recall that in the last two snapshots, the number of attributes for two visitor’s profile are not same (72 vs 81), meaning that some visitors may visit more webpages and there could be more children information for detailed page view. It is possible that some visitors have a valid ‘Campaign ID’ while others don’t have such attribute for Campaign ID. How to add the columns back if the important features are missing in the original visitors’ profile? Here we did some tricks to perform a ‘Transpose’ of the data and created a set of tags for indicating whether a target attribute is found or not. Then we use a set of Macros to deliver values of the indicators to the downstream. Here is the quick view for the sub-process built for checking a high value attribute is available or not.

The next step in my data preprocessing would be adding the important attributes if missing. We will use several branches, in the following sub-process. Inside the ‘Branch’ we simply apply ‘Generate Attribute’ if macro value indicates that the target attribute is missing.

After that we could match some keywords with the URL path, and extract how much time spent on each page view. Here a ‘De-Pivot’ is a suitable solution for that, because the exact total page views for individual visitor are unknown. We will create one line for each clicked URL path and later match it with a list of high value keywords.

Some data blending tricks to convert the data types (e.g. convert numerical to binomial, parse strings to numbers, ‘nominal to numerical’ for dummy coding categorical variables, etc.) may be needed. Please be aware that in RapidMiner some supervised learning algorithms can handle either nominal or numerical variable, but some algorithms are only compatible with numerical variables. Keep in mind that always double check the data types for all variables before fitting any models!

Classification or Regression?

If you have a numerical target for prediction, e.g. Exact value of the profile, we will build regression models. If you want to predict a binomial target, e.g. High or Low value customer, Converted or Not Converted visitor, it becomes a binomial classification problem.

The newly added operators in RapidMiner 7.2, Generalized Linear Model or Gradient Boosted Trees are so powerful that are capable to handle both numerical and nominal variables, and can be used to build either regression or classification models. A quick and dirty GLM is built inside a 5-fold cross-validation within one minute in RapidMiner. We easily achieved >76% average accuracy and 0.85 AUC on the testing set.

Reincorporate the Prediction Back into Marketing

The scoring results sometimes make people think how to make marketing efforts to direct those non-converted visitors who have extremely high confidence level of conversion. Are there any leaks from the first clicks that stopped a visitor from signing up a newsletter? Which search engine and search keywords drives the most traffic and creating the most conversions? Some online efforts including email campaigns, adding banner ads, personalized interface are optimization strategies to get more conversions and improve the online business. By understanding visitor behavior, the website owner can have more than just a presence on the web. They can have a successful growing business.

Download RapidMiner Studio

The free version of our RapidMiner Studio 7.2 now includes functionality that was previously only available in the commercial version, for example, connectors to commercial databases. All the features of our RapidMiner platform are now available to EVERYONE. Let us get started from the download page to unlock your business values from big data.

There have been some major advancements to the RapidMiner platform since this article was originally published. We’re on a mission to make machine learning more accessible to anyone. For more details, check out our latest release.

Related Resources