Tips and Tricks: Using the Execute Script Operator
RapidMiner Studio has a fantastic hidden gem for users that want to “hack” the platform. If you’re familiar with the Groovy script language, then the Execute Script operator will quickly become a favorite.
What is Groovy? It’s a dynamic scripting language based off Java. If you’re familiar with Ruby, Python, or Perl, then you’d quickly grasp Groovy. The neat thing about this language is that it’s subset of the Java programming language and is dynamically compiled by the Java Virtual Machine (JVM) and allows you to use the various Java libraries. Since RapidMiner Studio is built on Java, using the Execute Script operator and some Groovy script allows you to interact with all the various libraries and operators RapidMiner Studio has to offer.
Let’s take an example process from the built-in product tutorials. Find the Execute Script operator in your Operator window, right click on it, select “Show Operator Info”, and select Description. There will be a hyperlink titled “Jump to Tutorials.” Click on that and find the “Subtracting mean of the numerical attributes from attribute values” tutorial.
When you load the tutorial, you should see the process below.
This process loads the sample Golf dataset from the Samples directory.
The Groovy is embedded inside the Execute Script operator. Just click on the Edit Text parameter in the Parameter window.
When you do, you’re presented with a text editor that has a sample Groovy script in it. The following example code does a simple calculation of the mean for the attributes and then subtracts that mean from the values from each attribute. Note: It skips over the string values, because you can’t subtract a mean from a string.
Tip: You should read two “How to Extend RapidMiner…” PDFs. The first one, titled How to Extend RapidMiner 5 is an older version of the updated How to Extend RapidMiner PDF. The first PDF has a chapter on using Groovy and the newer one on how to build your own extensions in RapidMiner. Both are great ways to understand how RapidMiner handles its “guts.”
Real World Application
My colleague Martin Schmitz (@mschmitz_) recently had to write a “hack” to the PCA operator. He wanted to an easier way to output the Eigen Values from the operator and then use them downstream in the process. The PCA operator calculates the Eigen Values automatically, but it doesn’t output them to an Example Set the way most data is typically displayed. He rolled up his sleeves and wrote the following Groovy script.
Martin takes the sample Sonar data set and wants wishes to extract the Eigen Values and Eigen Vectors from it. His sample process looks like this:
You’ll note that he exported the PCA model (the “pre” port) directly into the Execute Script operator. His code is as follows:
I added a red arrow to alert you to the import of various RapidMiner classes and libraries. In these 4 lines you see Martin calling the RapidMiner Ontology library (this defines the value types of the attributes), logging in Java and RapidMiner, and the PCA model operator. He essentially passes the PCA model object to the script, extracts model values and then exports them back into RapidMiner. If you put a breakpoint right after the Execute Script operator, you’ll see two results: Eigen Values and Eigen Vectors.
Tip: RapidMiner allows you use your own libraries BUT there are two catches. The most important catch is that with RapidMiner version 7.2, several Java related security features were enhanced. I recently tried to recreate a personal project using Groovy in RapidMiner only to run into this problem.
The second catch is you need to put your libraries in your ..\RapidMiner Studio\lib folder. RapidMiner should pick up those libraries (unless they are a security risk) automatically when you launch it.
Check out the RapidMiner Community for more Groovy discussion.