Over the summer the Data Science teams at RapidMiner were hard at work updating several of our extensions. We are proud to announce 6 new operators have been added across the Operator Toolbox, Smile, and Converters extensions. Here’s a quick overview of these extensions and what’s new.
This extension adds a bunch of operators to RapidMiner. They range from utility operators to improve the flexibility and usability of the process design, offer additional outlier detection algorithm, and additional performance criteria to advanced analysis methods like Local Interpretation or the SMOTE algorithm.
What’s new with Operator Toolbox
Understand your Models with Partial Dependency Plots (PDP)
When building predictive models, you often want to better understand the dependency the model is exploiting. RapidMiner already provides a lot of features for this – Most importantly ‘Explain Prediction’ and various feature weight methods. In version 2.2 of ‘Operator Toolbox‘ extension we provide you with yet another way to understand your models – Partial Dependency Plots!
Partial dependency plots are univariate methods to understand the dependency between your model prediction and a numerical input variable. The fundamental idea is to score many different examples with different values for our attribute. Afterwards we take the average prediction value or confidence to see the impact.
The algorithm to generate PDP data is the following:
- Take a value x between the minimum and maximum of attribute k.
- Set the attribute value of all examples in the ExampleSet to this value.
- Score the ExampleSet and calculate the average response. The response is either the predicted regression value or the average confidence of the positive class.
We repeat this for various values between the minimum and maximum and for all chosen attributes.
Here is one example for a PDP plot. We first built a deep learning model on the titanic data set. We then use the new ‘Generate Partial Dependency Plot Data’ operator to generate PDP data for the ‘Passenger Fare’ attribute.
We can observe that the model found a pattern. The likelihood of surviving is generally higher, the higher the passenger fare, but the model ‘saw’ a higher likelihood of surviving also in the 100 range.
One of the tutorial processes in the ‘Generate PDP Plot Data’ operator shows how to generate this chart.
Understand your Linear Model Contributions
Linear models are often used because they are relatively easy to understand. Linear models fit an equation of the following form:
a*x1 + b*x2 + c*x3 …
You may want to know how much each of these summands contributed to the overall prediction for every example. This is now very easy by using the ‘GLM Contribution’ operator in the ‘Operator Toolbox’ extension. It creates one attribute for each summand and gives you the product of coefficient and example value.
A common application for this is marketing analytics: Every month we invest a certain amount of money into our 4 marketing channels: TV, Online, Print and Cinema. Each month we also measure our demand. This may look like this:
We now fit a linear model which predicts the demand from the investments into our marketing channels. The ‘GLM Contribution’ operator can show us the individual contributions to the overall prediction.
This can easily be visualized to show the share per month.
Check out the tutorial process in this new operator to see how to do this analysis yourself!
Easier handling of objects: Sample Collections & Extract Tokens!
We also added a few operators to make your life easier while working with RapidMiner objects.
The first operator is ‘Sample (Collection)’. This operator allows you to draw a random subset of items from a given collection. This can be very useful if you want to speed up your development process by working on a sample of your collection. The operator supports the usual sampling methods ‘Linear’, ‘Shuffled’ and ‘Bootstrap’ and is also part of the ‘Operator Toolbox’ extension.
Another utility we added is ‘Extract Tokens’. This operator is part of the ‘Converters’ extension. This extension contains a number of operators designed for converting different IOObjects into other useful representations. For example there is a PCA to ExampleSet operator which takes a PCA model and converts it into an ExampleSet which can be processed further or written to file. The operator are grouped depending on the object to be converted.
When using ‘Tokenize’ on a Document object you end up with a segmented document, which might look like this:
Previously, there was no way to access the individual tokens in this Document with operators. This is now solved by the new ‘Extract Tokens’ operator. This operator returns the tokens either as an ExampleSet, where each example corresponds to one token or as a collection of documents, where each item is one token.
Smile is a fast and comprehensive machine learning engine. They focus on Speed, Ease of Use, Comprehensive, Natural Language Processing and Mathematics and Statistics. This extension wraps functionality from the Smile library (http://haifengl.github.io/smile/) and provides them as operators.
What’s New with Smile
Understand your Data with Hypothesis Tests
A common question to ask about your data sets may be: What is the likelihood that the age attribute in my training data set is compatible with the age attribute in my testing data set? This question can be answered with the new ‘Compare Distributions’ operator, which is part of the Smile extension.
The operator takes two ExampleSets and compares the attributes you selected. The result is a weight vector with the probability that the two attributes are drawn from the same underlying distribution.
One major use case for this is drift of concept detection. In many cases your data may change over time. This can, for example, be caused by wear on your machines. In a deployment scenario you want to check if the data is still compatible. With this new operator you have now an additional option to do that.
More Machine Learning options – Random Forest and GBT from Smile for Regression
Many of you may know the ‘No free lunch’ theorem, stating that you always need to try many models to find the best one. That’s why it is very important to have a wide range of algorithms and different versions of algorithms available.
We’ve added two new machine learning algorithms to the ‘Smile’ extension: Gradient Boosted Trees and Random Forests. These two algorithms are implementations from the Smile library. Smile is a very interesting library, which is very fast and competes with our existing implementations. You now have the choice between two different implementations of these popular algorithms. Note that both algorithms currently only support regression problems. We will add classification versions of it in a future release.
Take a look at our extension library on RapidMiner Marketplace. Please keep in mind that these extensions are not officially supported and we, as a team, may sometimes make changes which are not backwards compatible!