We’ve put together a list of the top ten most useful tips and tricks from RapidMiner’s data science team, community and super-users.
Watch this webinar for a demonstration on the shortcuts and best practices used by our experts to improve efficiency, shorten design times, and maximize value.
Hello, everyone. And thank you for joining us for today’s webinar: top 10 tips every RapidMiner Studio user should know. I’m Hayley Matusow with RapidMiner. And I’ll be your moderator for today’s session. I’m joined today by Jeff Chowaniec, our RapidMiner data scientist. Jeff, we’ll get started in just a few minutes. But first, a few quick housekeeping items for those on the line. Today’s webinar is being recorded. And you’ll receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who are not able to attend today’s session. Second, if you have any trouble with audio or video today, your best bet is to try logging out and logging back in, which should resolve the issue in most cases. Finally, we’ll have a Q and A session at the end of today’s presentation. Please feel free to ask questions at any time via the questions panel on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. I’ll now go ahead and pass it over to Jeff.
Hello? Good morning, everybody. I’m Jeff Chowaniec. I’m a data scientist and solutions consultant here at RapidMiner. And I’ll be taking care of 10 tips every RapidMiner Studio user should know. I want to just cover briefly today’s agenda. And I’ll just run through a brief introduction of what is RapidMiner, what it means to be doing data science in RapidMiner. And then we’ll take a deep dive into the top 10 tips for RapidMiner Studio users. And then I’ll take any questions that you guys have after I finish that. So just to give you our mission statement: data science behind every decision. The idea behind RapidMiner is that we are a number one product. We are the number one open-source data science platform. And we are the number one marketplace for data science experts. So this means you’ll be getting help from communities of users. And there’s tons of materials and options and examples that could be generated by our community as an open-source data science platform, as well as we have an extensive marketplace full of data science experts that can help you bring your project to task. The idea is our platform is a key innovation and competitive advantage that lies in around the data science. If you invest the right tools and the right skills, you’ll be able to uncover new opportunities and change the way you interact with the business world.
So let’s get started with tips here. So we’ll start. Our first tip is to enhance productivity using your results overview tab. And what I’m going to do is I’m actually going to pull up Studio here. So this is Studio 7.3. So if you were to open Studio for the first time, you would get a very similar view to what I have on my screen except for this process here. I’ve just pulled some sample data from RapidMiner itself. So this retrieve titanic is actually in the samples folder. So if you’re watching this or want to attempt some of these tips and tricks, the data that I’m going to be using is going to be built into the platform. And I’ve just done a few data prep steps. But none of this is particularly important at this very moment in time because what I want to do is I want to jump to my results view here. And there’s quite a few things. Normally, if I run a process, it will take me into the results view and I can see what the operators that I’ve been using have accomplished. In this case, I’ve gone from raw data to a prepared data set for modeling. This data set, in particular, is the titanic data set. And there are actually plenty of demos offered on our website as well as the community site that cover the titanic demo in RapidMiner Studio. And it’s a very common demo for data science. But the idea here is that I’ve got this results view as such. And then I can also dash over to my statistics view. And I have a bunch of information about my data set.
But there’s another layer to the results view which is the results overview. So I can move between my current output results here and my result history. And in this results history, I’m able to see what jobs that I’ve run. And then maybe I want to save different iterations. So if I open up each of these, they’re very similar options. So if I scroll down here to the first one that I ran, I can see what came out of here. If I want to open this– I can save the changes here. And I’ll just put this in my local repository: what’s the 10 tips. So it opens up the process that I had actually run. Now, this was an earlier version of the process that we were working on. So now, I can come in here and see what my operators were and see what those operators were doing. And if I had other results in here, I’d be able to store them or decide what I want to do with them. In this case, all I have are example sets. I don’t have any models or anything in here. But I can go through and I can see what each of my data points are. I can see that I have age, passenger, class, gender, family size, passenger fare. I get that there’s no missing values. I can see that over here. And then if I put any comments on any of those attributes, I’d be able to read those comments here. So there’s quite a bit I can get out of viewing previous runs of processes. Maybe I tweaked these up and want to compare results, I’d be able to pull up performance sectors from previously run models and stuff like that. So there’s quite a bit you can do within the result history to drag up old runs of processes that you already did.
Now, the results overview tab also has a lot of other options. So the first tab will give me my data view so I can see what’s my example set. I’ve got 1,308 examples with two special attributes and six regular attributes. And I can scroll through as such. The next view, I have statistics. So there’s quite a bit– if you think, quite a bit of things that happen in here. I’ve got data types as well as missing value count. Here, I’ve got created data so I’ve taken care of my missing values already. And then for each data type, I get different statistics. So if I open up age, which is a numerical value, I’ll get a min, max, average, and standard deviation. But my polynomial value, I get the least amount of values, the most amount of values, and then it just runs through the values that changes depending on the data type that I’m loading into RapidMiner. The next step is to– in this open chart here. And this brings me to a chart view. Now, there are two versions of the chart view. I have standard charts which will allow me to just select different plot types. And it’ll, basically, automatically plot and automatically fit for me. So I can just take a look in histogram so I can see what the distribution of each of my data points looks like, and expand my bin size to maybe I want a very generalized bucket size, so on and so forth. So I see that there is a few expensive tickets, but for the most part, people haven’t paid a whole lot to board the Titanic relatively. I don’t know how this money compares from 1910 to now, but for the most part, everyone falls within– the majority of my data falls within this bucket. So I can do a lot of data discovery through this avenue.
So RapidMiner very much works on a CRISP-DM cycle. And this area, this data visualization is predominantly built for data discovery, and specifically that part of the CRISP-DM cycle. As I move down, I can open up advanced charts here. And I can now just build out charts as I go. I’m not actually going to try to build out anything in real-time here, but the idea is that the functionality is there and available for you to take a deeper dive into your data. Before you actually start prepping it or even modeling it, you’d have the availability to take a deep dive and say, “Okay. What’s happening with my data before I’m even trying to predict anything?” And for just about everything, that’s the results overview in a nutshell. There are quite a few different functionalities offered here.
Jeff, just one question that we have that came in. This person is asking what makes an attribute special versus other attributes?
Okay. Yeah. So if you’ve noticed, I have two special attributes. I have two colored attributes. They are my special attributes. In this case, RapidMiner has a role system. So if I jump back to my design process, this set role operator is what’s generating my special attributes. In particular, the green is usually a label. It has to do– green in RapidMiner always has to do with modeling. So if you have a greet output port or a green connector, it’s usually generating a model or involved with modeling in some way. In our case, survive is our label or rather our target variable for other nomenclature. So this is what RapidMiner knows to automatically predict. So that’s why it’s a special attribute. Our passenger name is set to an ID which is blue. And the ID, basically, works as an ignore-this attribute function. Maybe we join this back with a list or something like that, RapidMiner will know not to model on this name which makes it a special attribute. For the most part, a lot of our operators will ignore any ID attributes that you have in your data.
Great. Thanks. One more question here. And you might be covering this later on in the top 10 tips. But this person is asking can you give an example using an advanced chart?
I mean, I can build it. I don’t have an example built for this. But I mean, if I jump into advanced charts, I’d be able to– maybe I want to– I have the availability to see. But I’m not going to do any transformation. Maybe I will just want to track– see, can I chart? So I have a lot of options here. I want domain, dimension. Why can’t I edit here? The access. Ah, let’s take age. And I want to look at passenger fare. So now, I can drag in my attributes here. And basically, I wanted to see, based off of passenger fare, what’s the age distribution for that set? And I’d be able to alter my dimensions here. So I can also color different dimensions and such and size them and shape them. So maybe I want to plot multiple things based off of passenger fare. So if I drop in– I actually will probably– I want to drop in whether or not they survived and I can do a colored. So here, I can see what the age is based off of passenger fare. And the color denotes whether or not that they– whether or not they survived. So maybe I’m trying to look for some pattern between passenger fare and whether or not they survived because my first hypothesis is women and children we know survived. But also, if you were really rich, you also survived. So maybe I want to plot this. And I can go through and extensively add more and more detail to my plot as such just to get a– just to get a further idea of what’s actually happening with my data. Right. So that covers enhanced productivity using the results overview tab.
Now, we’ll take a look at our second tip here which is going to be save time by using pre-built building blocks. So if I jump back into Studio– I’m actually at a point where I want to generate a model here. And so I might not have any idea on how to do that, but I know I need to validate that model. And I know I’ll need a decision tree or, at least, some sort of classification model. Here’s where I’ll introduce building blocks. So building blocks are something that you can configure, but they’re also building blocks that come pre-installed. What they are are sets of operators that come pre-configured. So usually, it’s either data prep operators in the subprocess or an operator that has a subprocess associated with it, so you can go inside of that operator and place them in there. So in this case, we would want to generate a cross-validation. And you grab that operator as such. That operator, much like the subprocess we have here, has a blue subprocess icon which means I can go inside of it and place more operators. Well, here I have to generate a model on the left and then apply that model and generate a confusion matrix. I might be able to sling this pretty fast and get it set up. But RapidMiner comes with pre-built building blocks which you can access by right-clicking. And you can say insert building block. I have built my own set, so I’ve got a couple in here. But the basics one are nominal cross-validation and numerical cross-validation come loaded in. And I think there’s a couple of other ones in here as well. They’re intertwined with some of the other data prep building blocks that I’ve built. For our case, if you place in the nominal cross-validation, it said a cross-validating– or cross-validation evaluating a decision tree model, which is perfect because I wanted to run a decision tree.
So now, if I come inside up here, the decision tree is already hooked up. I’ve got all of the other operators associated with my testing data already preconfigured. And now, I think that they’re ready. In the training phase, the model is built on the current training data set, 90% of the data by default 10 times. And that tells me the model is created. And the training set is applied to the current test set 10%. And then performance is evaluated and set to the operator results. And so these are some pretty handy tools. If you need to generate a model quickly, you can drop this in just by right-clicking into insert building block and graph the pre-built models. And then some things you can do. If the decision tree isn’t the one that you want, I mean, you can easily just say, “Okay. Let me replace this.” We’ll go and right-click on it and replace operator. I’ll drop down the modeling predictive. Maybe I want to run a neural net. So I can switch this to a neural net as such. And then if I wanted to– actually, if I’m going to use neural nets all the time, I can now make my own building block by just saying save building block as. And I can say cross-validation where a neural net. And I can add a description if I like and save it as such. And then I’d be able to call on that building block whenever I need it which would become preconfigured with the neural net instead of the decision tree.
For our case, I actually need the decision tree. So I’m going to go back to the original one itself and actually output this. And I can go ahead and run it like such. And now, I’ve generated a model from that decision tree that we put in there which ends up being not too bad. So it splits on male and female which is pretty obvious. And then for the most part, if you’re male and you paid a lot of money for your ticket, then you survive. Otherwise, if you’re in first or second class and you’re female, it’s most likely that you survived. And then third class just a jungle of attributes if you are curious as to the result of that model. But that was just with the building block that comes with RapidMiner Studio, so not too bad. The model ended up performing pretty well. And all I had to do is right-click and find the building block on there to generate a model. Now, if I jump back here– so that’s tip two. That’s the pre-built building blocks. I think they cover how to build your own building block as well. So for tip three, we’ll stay on top of things using notes and folder organization. So I’ll cover a couple of other things in order to do that. I just see I have a question here about building blocks. How to share building blocks with the rest of the team? So you can actually package up your building blocks and just email them to a teammate. Or if you’re working with a RapidMiner server, you can share building blocks through RapidMiner server. I’ll also mention there is a community portal for shared building blocks where– I’m actually going to be putting up some content later today on a few building blocks that I’m sharing with the community. So if you go to community.rapidminer.com, there’s a specific portal on there where you can access building blocks that people have built and made to share with the community.
So to start covering using notes and folder organization, you notice that the– we’ll start with the notes option. With the cross-validation, there were notes attached to the operator and then attached to below. So these notes are accessed by this little sticky note button here. And if I don’t have anything selected, it will just drop a new note here. So I can say RapidMiner’s top 10 tips are awesome. And now, I can just leave this here or I can highlight an area. So this is probably more useful if I do something that’s– I’ll just click there. And now, I’m just creating chaos. So now, I can do stuff like this is data import. And my data import operators are the said colors. So I can say, okay, my data import’s happening here. And you have 10 tips that are awesome. And then maybe I can highlight some of my other options. So this is my data prep steps. And I can drag this in this area as well. And I can also add more notes as I go. And I can completely format these notes. So if I enter in here, I can say where I want this to line up or I differentiate by color and I’ve just got the option to delete. So I can organize as such or I can just drag this onto my subprocess here. And now, it tells me that these are my data prep steps. And if I have an operator selected, I can go ahead and delete this data import or I can just hit the note button and it will drop me a comment down there. And I have the same options. I can color-code it as well. So it’ll just work as a good bookkeeping option and makes it available for anyone you’re sharing this process with. Whether you’re sending them a process or you’re collaborating via RapidMiner Studio, you have the availability of outlining what you’re doing without actually having to sit down and talk to them.
The next thing is folder construction. So this is probably more of a tip for best practices. So what I’ll do is if you wanted to create a new repository in the repository view up top, we can do create repository. And when you just want a new local repository you can hit next. And I can just say top 10 tips and hit finish. So this will just create a new repository entry for me to be a new T. Now, if I have this error, we should open up our repository and there is nothing in there. So what I can do is I can create subfolders. So a very common folder organization, at least that I use and that the team members I work with use, are to generate a data folder as well as just a process folder. And then I can generate a results folder. And so within here, I can add extra subfolder constructions for specific projects or, particularly, I can just store a process here. And usually, we’ll start. If you’re prototyping processes, we start with a number so that way I keep them in order from which we’ve worked on them. And we’ll just call this titanic decision tree. And so now, as I alter this process and keep going and adding, I can add new iterations of the process as we go. Again, it’s mostly to just keep the RapidMiner user organized. And so that way, everything can be accessed as such. And you can actually import any data that you’re using and save it, store it as RapidMiner objects into my data folder. So maybe I’m working with a database file. But maybe I just sample down that file so that way I don’t have to call the database and bring it into RapidMiner every time I want to run it. I can store it there. Maybe I’ll have text files that I’m using to do some sort of stop words filtering, predict processing, or I’m doing some sort of replace operation and using a dictionary. For that, I can keep those text files stored in my data folder. My results folder can hold any of my models, any of my wait tables, word lists. Anything that RapidMiner operators generate, I’ll be able to store as objects in my results folder and then open them if I need to view them or utilize them in RapidMiner processes. And that’s kind of why we break down the folders as such, so we can keep it organized. And then you can also add in more and more layers of sub-organization. But that’s usually a best practice for folder construction.
Okay. I will take a look at– I’ve got a question here for can you make an example about packaging data and the process to transfer it to another computer or user? I can walk you through that actually. That’s actually pretty easy. So if I wanted to package this up, what I can do is I can just say export process. And I can export it to a location. And then maybe I want to drop this on my desktop or whatever, it’ll save it as a RapidMiner process then I can just email that process to somebody. Another option that you can do is actually navigate to your RapidMiner folder. So let me open up my folder here, and I will navigate offspring and then open up my RapidMiner folder from there. Usually, it’ll be found under your users. So if you were to go to your hard drive and then go to users, your name, and then there’s a dot-RapidMiner folder. So in my case, my drive user is Jeff Chowaniec and then there is a RapidMiner folder. Inside of that RapidMiner folder, there is a repository folder. And so this carries all of your repositories and the objects that are stored in them. So if you just navigate, I think we just stored in two top 10 tips. And if I go to the process folder, you’ll see two files. This is a properties file and this is a dot-rmp. RMP stands for RapidMiner process. All you have to do to send this process to a colleague is to actually just email this process. And then they can import it by the import process option by going file input process. Or what they can also do, and another pretty handy trick, is to take the email with attachment and drop it into a repository folder that they already have, and it will automatically import into RapidMiner Studio.
Data files work similarly. I don’t have any data stored in this folder. But if I run over to my — let me see here, I should have a local repository with a bunch of data in it. Data files are a same thing. They can be found in here. You want to look for the 100 file or IOO file. And that is the actual data file. So you can just send that, and they can import it the same way.
Right. I’m going to continue. I do have a few more questions I want to answer in here, but I’m actually going to save them to the end so I can cover all of my tips and then go over any remaining questions that you guys have. If any questions come in that are pertinent to the tip that I’m working with, I’ll cover them. If they’re other tips that you’re looking for, I can cover them. So if we take a look, number four is quickly access tutorials from within the design process like click on an operator. So what I can do is– if I didn’t know how to particularly use one of these operators, say, for example, I’ve dropped in this cross-validation. I have no idea what’s going on. I can right-click and do show operator info. And so what this will do, this view, in particular, will tell me everything I need to know about the cross-validation and whatever is in it. So it has or requires these options or can utilize these options. And then now I’ve got input ports so it tells me what it’s looking for and then what’s coming out of it and then I can take a look at the other options available. In my description, I actually have an extensive overview about what the operator does. But if I wanted to jump into a further tutorial process, there’s this little blue jump to tutorial process. It’ll drag me in or it’ll give me a description of the tutorial and then I can jump straight to it. So now, if could open, it’s open to this process. I can double-click inside. And the text that they gave me to read will follow along with what’s happening here. I’m going to jump back to our titanic decision tree. But all of the options for those tutorial processes are accessed all the same way. So if I check out the subprocess, I can go in the description. And then it says using subprocesses to structure a process. And I can hit close. And that would bring me into another process that shows me how to use that operator. So the tutorials are made readily available through each operator themselves.
So five is fast-track navigation using camel case searching. So this is something I might have already been using. And so sometimes, I’ll type in here. I know if I wanted cross-validation, I could just type out cross-validation. Something about the way I use and teach RapidMiner is I always teach the– it’s always good to be a time-efficient, lazy data scientists when it comes to RapidMiner. I like to do a thing– and I’ll back up that claim by saying I like to use the fewest amount of keystrokes and the fewest amount of clicks as possible. So for example, there are two ways to save: the smart way and the not-so-smart way. So the not-so-smart way is doing the file save as and all of that jazz, whereas earlier, we did the shortcut smart way which is just store a process here which works as a save-as function straight into that folder. So that’s an example of that fewest amount of clicks. Here, it’s the fewest amount of keystroke. So instead of typing out cross-validation, if I just type capital C, capital D or camel case, I’ll grab anything that has a capital C, capital D in it. Here, cross-validation happens to be the only operator that does which is perfect because it’s the one that I would have needed had I not used this building block. And it also works for stuff like a decision tree. So if I type capital B, capital T, there’s multiple options for decision trees in RapidMiner. So it gives me to all of them, not just limited to the regular decision tree. And then I can also see that I get distance transformation. And then I’ve got a data transformation folder. So it’ll open that folder for me even though the operators inside of it do not follow the camel case. So I can either quickly search folders or I can quickly search for the operators that I need. It saves time. It’s just quicker. If you know what you’re looking for, you can just click the camel case searching. I know sometimes I think if people are watching me do RapidMiner and all of a sudden I’m typing really fast in camel case, they’re like, “How did you get to that operator?” And so that’s what that function is in the search bar for the operators.
Six, now, I can leverage the regular expression helper for greater flexibility. So there are a couple of expression generators. I’m actually going to cover two of them. One of which is the regular expression generator. So there are a few operators that will utilize regular expression. I know, in particular, the replace operator does. And there’s a few options. There’s quite a few other operators that will as well. So if I drop in a replace here, what I can do is say attribute filter type. I want to grab a regular expression. And I don’t know what that regular expression wants to be. I’m not a master of writing regular expression for RapidMiner. So what I’ll do is I’ll open up this little regular expression option. And now, I can say I’ve got item shortcuts here. But the advantage is I can hit this button here, and it’ll give me options for typing in my regular expressions. So I can easily say, okay, I’m probably going to need any letter. And let’s see, maybe followed by a non-word character. So I can drop that in as such. And so that’ll give me a regular expression that does something. And then I’ll just say test. So maybe we want like word. So here, I pulled out any letter with a combination of that. So I can add in T there. And I can continue testing this. So maybe I want to expand this or figure out how I want to set up this regular expression. But I also have the availability of selecting extra options here and pulling in more options.
So here, I’ve got any letter combos with any non-letter character. So here, T gets identified as test and the information gets identified as test. And so now I can go ahead and keep debugging what I’m writing if I’m not a master at regular expression. So I can back down here. I actually don’t need this replace. I just wanted to show off the expression generator. The other expression generator that I show off quite a bit, too, is associated with generate attributes. So if I wanted to generate a new attribute, I can hit this edit list here. And now, I’ll give an attribute name. And I’ll just call it test. And I can go ahead and type out some function if I know it or if I know the functions well enough. But if I don’t, there’s a calculator here that I can open up. And this will allow me to actually utilize both my data set as clickable function. So if I need any of my attributes, I don’t even have to type them. I can just click them as such and it drops them in. Anything that’s orange– note if I start typing other things, it’s black. So here the orange will always denote attribute in my data set. So the other thing in here is there’s tons of list of different functions I have available as a user. So here, I can just grab a value and then write. So I’ve got this and it tells PD to condition F and attribute value N and F to be value L, else– so I can say something like if– I think, it’s survived peoples equals yes. If you had a condition, test one, and then I can just output test two if they did not survive. And so the cool part here is I’ve satisfied the conditions that the if statement is telling me to use by saying if survived equals equals yes, the output outputs. So it told me that my expression is, in fact, correct and I can apply here and apply. And now I can add a breakpoint after to see if this actually does what I want it to do.
And now, I get a test one test two for the different class variables. And so I can also expand this further. So I can add more and more entries to my data set as such. But what’s cool is there’s tons of different functions available to you. So if you needed to do some sort of date transformations, there should be a date calculation option where you could just grab different dates and get the difference between the dates or just get– if you have weeks– if you have a last transaction date, you can set it up so you have days, weeks, or hours, or minutes then flash transaction. And you can utilize these functions to further transform your data more than just the operators. So there’s plenty of availability in both the regular expression generator and the function expression generators.
Seven, so the last couple of tips– well, seven and eight, I’ll couple together. They’re fairly straightforward. So I’ve got this alt-click to delete connections and then shift-click, move result nodes to drag ports. So I’ll couple these together because they’re fairly straightforward and really simple. So maybe I have a lot of connections that I have setup. I want it to come over here. Now, it’s coming. So now that I have these set up like so, maybe I don’t need to output some of the outputs that I’m sending into my result nodes. I can hover over them and it gives me an X. I can hover over the X. But if I quickly wanted to do it depending on where my mouse is, I can just alt-click and close out my connections as such. The other cool thing is I can actually organize my result outputs here. So if I hold shift and then click on these result nodes, I can adjust them as such. Maybe I wanted to add another performance option coming out of here, but I wanted to join it with another output and send that output down here so that way I’m not crowding up my results and I know which results are going to my result nodes. So these steps are just small functionalities that if you didn’t know were there, you wouldn’t know to use them. So if you ever wondered how you could space out your results nodes, it’s shift-click. And then if you wanted to quickly close out connections, you can alt-click onto the connections and close them.
So nine is going to be embedding R and python scripts more efficiently. So just to cover that, we can– oh, we’ll need our awesome top 10 tips there. But maybe we wanted to compare our decision tree to an R or python-generated decision tree. So I can go ahead and just grab a multiplier operator down below in the RapidMiner Wisdom of Crowds. So if you’ve not activated this yet or never heard of it, it is an opt-in program for our users. So it tracks use of metadata, tracks what operators you place onto the canvas, and what operators you’ve connected them to. And then from that data, it actually makes predictions for users who are utilizing Wisdom of Crowds to give a best suggestion of what the community is using at the time. So multiplier was in my recommended operators bar which is perfect because I was going to search for it anyway and I was able to track it right from there. The advantage for the experienced user is– the lazy data scientist that I mentioned earlier. It’s just fewer keystrokes and clicks that I need to do. And most of the time, the operators that I’m looking for are down below. As well as the new user gets the experience of– RapidMiner users that have been dedicated to RapidMiner for a long time and have really hammered out their design process. And now I can get suggestions based off of what operators I’m placing on if I don’t really know what the next step is in RapidMiner. And so I can go ahead and click these and try them out and see if they are going to do what I need to do.
For our next step, I’m going to grab the execute R and python. They’ll work in the same way, so I might just grab R here. And I can embed it as such. So maybe I want to compare my decision tree to a decision tree that I can build in R if I can sling R well enough. But instead of just building this out in R, what I’ll do instead is I’ll just run my R script inside of RapidMiner. So now, I can hook up my data as such. And I can hit edit text in here. And now, I can say– my data gets loaded in as this RapidMiner main function. In here, I can now script away. And I can add in whatever packages I need to load in order to do my modeling and any repositories that I need to utilize to edit my data or anything like that. I have access of just embedding R inside of RapidMiner. And then using the same tips we used before, maybe I need to run multiple R scripts. And then maybe there’s a python package that I like using better that I can drop in here like such. So now, I’m using R and Python in the same workflow. But now, I need to know what each script is doing. Let me actually tag that. So now, I can utilize our comments that we talked about earlier. And now I know maybe this generates a decision tree. Maybe this runs a separate validation. And then my python’s scripting is doing an interesting pivot package or something like that. And then I make a visualization.
So you guys get the idea. I can build out– let me delete that. So now, I can parse up and tell my studio or tell the users what each of my scripts are doing. So now, I don’t have just one big conglomerate of R script. And the other availability to that, too, is if one of the scripts fails, if there’s some error like in miss a period or something, this script might run, and then this script might fail. So you know only I have to look in here instead of having a long list you have to figure out where it’s goofing up. And you have a question here pertaining to R. Will the package be downloaded automatically from R environment? Yeah. So there is an installation setup between RapidMiner and R, RapidMiner and Python. So once that installation– and that installation happens in R to establish the connection to RapidMiner. And then RapidMiner just installs the packages it needs. And basically, it has its own instance of R that it runs through R with the connection that you set up. So it has its own area where it’s installing packages as such. The last tip here, R does not have to be opened up during this process, so RapidMiner runs the R script in the RapidMiner engine.
So I’m going to clear up our final tip for today and then leave some time at the end here for questions from you guys. As our final tip for today covers tree to rules. So now that I’ve generated the decision tree in RapidMiner for my survivors, maybe I actually want to export the rules so I know what’s happening. So there is a tree to rules operator that I could drag in like such. And then what I can do is I can hit control-C and copy my validation. So if I come into here, I can send my training data in and I can send my model out. And I can send this model out like such. Let me take a look at what my example set looks like, if it’s going to kill this process if we go ahead and run it. So now, instead of my decision tree output, now, I have the actual rule sets. And now, I can share these rules with, maybe, my business team or whatever team that I’m working with so that way they can now be able to see which customers that they can target specifically. So for example, we said if– we can ignore the first two because I know they’re small. Now, this class is pretty large, 139 to 5. So we can say if sex is female, and no siblings and spouses, no parents onboard, and the passenger class is first, then predict survive. So that’s a good indicator of this. It can actually improve this model’s performance by omitting these family size attributes, but that’s completely another webinar around optimizing decision trees for the titanic data set.
But the idea is now I can tell that, okay, if you’re female and you’re in first class, you’re most likely to survive. And then I can see that if you’re female and if you’re in second class and you’re younger than 56, you’re very likely to survive. And then it breaks down. This can get a little more accurate too. But it says like if you paid a ridiculous amount of money for your ticket and you’re male then you definitely survived. The number in the parentheses up here, they refer to the predictions. So survive is on the left, did not survive is on the right. So this tells me in this bin, it’s predicted no. So zero of them were predicted to survive and six of them were predicted not to have survived. And then it tells us we had 1,050 out of 1,308 were correct of our training examples. And you can see here, there are people who paid in the higher end of our spectrum but just not as crazy high as these people did. So they do get in this predicted here. And I do know from the titanic, from trying to increase the performance of the titanic, that a lot of these are actually children as well. So anyone under the age of 11 falls into this prediction then as well. So because they didn’t pay for their tickets, they didn’t actually survive. That’s the tree to rules. So that’s an easy way to take a decision tree and get it into a set of rules that you could hand off to maybe your BI guys or your sales team or what have you to make a business decision on those users.
So quickly, I just want to cover next steps. So there are quite a few resources available to you as a RapidMiner user. So first, there is RapidMiner blog where you can find blog posts from users and data scientists like myself and others here at RapidMiner at rapidminer.com/resources/blog. RapidMiner community, there are tons of posts from users including people such as Ingo Mierswa who’s the founder of RapidMiner, and other data scientists from both the R&D realm and from the consulting side or just users that like to answer the other people’s questions. Features lists– and if you go to rapidminer.com and then to our products, navigate to studio, it’ll give you a breakdown of everything that’s included in the studio so far, and the most powerful tools available in RapidMiner Studio. Training videos, you can go to rapidminer.com/training/videos and get your hands on a bunch of videos that will walk you through generating your first– or bringing your first data set into RapidMiner to generating your first model, validating that model, and deploying that model in the studio. So there are plenty of resources to get started as a new RapidMiner user. And now, what we’ll do is I’ll– Hayley and I can go through the questions here and see which ones we can get an answer of for you guys here. So feel free. If you haven’t had a question, go ahead and ask away and we’ll get to as many as we have time for.
Great. Thanks, Jeff. Just as a reminder to those on the line, we will be sending a recording of today’s presentation. A couple of you have asked via the question panel. So yes, we’ll send that within one to two business days. And like I said, it looks like we have a bunch of questions that came in during his presentation. But if you guys have any additional questions, feel free to ask those via the questions panel. And Jeff, I don’t know if you’ve taken a look at the questions, but one question here is– this person is asking me, “Is there a way to export to chart?”
So there is, unfortunately, no functionality for exporting charts. You can screenshot the charts, which is currently the only way. But I know this is something that’s been brought to the attention of the development team.
Thanks. Another person is asking, “Can I directly email the whole local repository?”
Yeah. If you want to package up the whole repository and send it over, you can. All they have to do is, again, drop it into their repositories folder under users, their name dot RapidMiner. And then if they drop that repository in there or they can just copy over the contents into their local repository, it’ll automatically import into their studio environment.
And then, I think, this might be a follow-up question to that, “But will it be downloaded automatically from the R environment? And does R have to be open during this process?”
I think I covered both of those. So that was for R scripting. So you don’t actually have to have R open. The installation between R and RapidMiner will allow RapidMiner to just run R inside of its environment.
Great. Another person here. “What are some common tips for your text processing extension?”
Okay. Yeah. So I will show that off. So there’s an extensions option up at the top of studio. And you can navigate to the marketplace to get your updates and extensions. So I can update RapidMiner Studio from here. Or if I navigate to top downloads, I see there are a bunch of options for extra packages for RapidMiner, a lot of which, on the top downloads list, are official extensions that are supported by our development team. The most commonly used one is our text processing. It doesn’t come pre-installed just to save installation space, but you can access it in there. So now, if I do something like process documents from data, this in particular– and send it out there. I would need some sort of data set coming into here. But it is statistical-based text processing so it is not natural language processing which means there’s a little bit of configuration to be put into this operator. However, the power and capabilities far exceeds NLP in terms of how much you can actually accomplish. So because it’s statistical-based text processing, it can handle any language. It just requires that you set it up. So some things that we can use here, for example, we’ll need a tokenize operator. This actually selects the words for us. And so you can– you have different tokenizing options. So non-letters is what I use if I’m feeling lazy. I can write regular expression or specify characters that I want in order to parse up words. So non-letters will work like this. If there’s a space and then a period, it’ll take all the characters that aren’t space or period in between that space and that period. Or if there are any other characters in between, it’ll take any non– or it’ll only take letter characters between two non-letter characters, to explain that correctly.
The next step, you can also do stuff like filter stop words. For example, there are pre-built dictionaries. So this one– that is the dictionary. So this one is English, so this takes out English stop words. And there’s German, French, Czech, and Arabic. I know people produce other dictionary operators you get your hands on through the community. Or if you wanted to just generate your own dictionary, it just requires you to make a text file. And you can click and drag that text file into RapidMiner Studio, actually, just from your desktop. Once that data file is in, you can attach it here and use it as your own stop words dictionary. So I have one that I set up when I do stuff with Twitter. So I’ll use my text processing to search Twitter for certain hashtags and then do some sort of data science or just clustering based on that, based on the words that people are using in their tweets. And I’ve got my own stop words list that takes out stuff like retweet or follow or all of those things that people use to get people to subscribe to their social media. So I do that to take out a lot of stop words and URLs that are added in as one. Some other options. You can also stand– there’s plenty of options for frequent itemsets whether that’s in terms or actually building out frequent itemsets and FP growth and stuff like that. But those options are all available. Those are some of just the commonly used transformation operators. I think transform cases is another good one. I jump into here. I can drop in a transform cases. And then I’ll just make everything lowercase. So that way, stuff like THE and the don’t get pulled out as two separate words. So there are plenty of operators made available to you with this extension that make text processing available and easy for the new user.
Thanks, Jeff. Another person here. “Can I execute programs from RapidMiner?”
Yeah. There are some other options. So you can run R and python like we did. But if there is a program that you want to run or maybe you are just old-school and have some C++ code lying around that’s doing something that you can’t replicate in RapidMiner for some reason, what you can do is you can actually use and execute the program and just call a dot-exe file. Maybe it’s just a script or what have you. You can set the command up here. You can add a directory. And there’s environment variable piece. But you can actually have a process run as such through this execute program option. And so it’ll call that process. It’ll run it. And then maybe you’re placing that file somewhere– or the output gets hooked into here as such. But the idea is that you can actually run other programs and then just import the outputs into RapidMiner from there. So I can execute this process to run first and then it looks for a file that this process generates, so on and so forth. So there’s quite of extra availability if you play around with executing other programs. Maybe there other Java-based programs which would easily work with RapidMiner because it is a Java– it does sit on Java. Or it can be any non-Java-based program as well.
Great. Another question here. “Can I establish a web service from RapidMiner?”
Yes. So if you are a RapidMiner server user. So I actually have a web service setup. So let me actually log into my server here. And what I can do is once I’m logged in, I can then say browse which should log me in. There you go, connected to the Dortmund server. And then I should be able to say log in. And this will bring me into the actual server portal. So that’s the server that my studio is connected to. I’m in Texas and the server is in Dortmund so there’s a little time there. But what I can do is I can go into my services tab and I can quickly build out a new web service. But if I edit this one, which is one that I had access to because I’ve built it, all it requires you to do is build a process in RapidMiner that you want to establish as a web service and call it as such. So I give it a source. I can either type in the file path normally or as a folder option where I can select it from a menu. But once this is good and running, the web service itself is the RapidMiner process that you build in studio. This just gives you an output format and ways to utilize this web service specifically.
So I can grab– I can test this web service to see if it’s running. And of course, my output’s going to look funky. So what I’m going to do is I’m going to go over to few services because it usually works when I do it this way. So I am getting an undefined error, but it is actually running. So ignore the top stuff. The bottom stuff here is what’s important. So this is my actual output. All this is trying to do is predict whether an email is fraudulent or not. So this is a positive hit so I can say I have secret market in our permission for your credential just so it happens to be a trade product example. So I can hit test and this web service should rerun this sentence which I’m saying is an input so I’m marking like a rest API or something like that. And now it says, okay. It hit on these words which is positive for fraud. So it’s comparing this to some model. And it’s established as a web service. And I can access that web service and embed it into other programs such as QlikView, Qlik Sense. And then there’s a new Tableau API two-way connection that’s coming out as well that’s being worked on where I’ll be able to call RapidMiner web services right from Tableau. So I can send Tableau data, rebrand a model, and then open up Tableau. And I can do the same thing in Qlik as well. And you can utilize these web services from other applications or applications that you set up yourself.
Thanks, Jeff. One question while you’re on this topic. This person’s asking, “How do they create a web service which uses a post call instead of a get call?”
A post call instead of a get call. So I want to– so I can send a list of operators, hence the post body. So it would be– I guess, that would be– so this is set up through the RapidMiner Studio. So what’s coming out of it, whatever it’s– so you set up what the process is looking for. And so that’s what the web service– my web service is set up to look for something in the RapidMiner process itself in order to send stuff out. So maybe if you wanted to do some– if your web service is data prepped, it gets sent to something else. That’s just how you set up the RapidMiner process. The RapidMiner processes the data prep or does the scoring of the model and then the last operator you add is right through this database. So your web service, whenever anybody runs your web service, it’s just appending those results into a database section. Most of what you take in or send out of a web service is built in RapidMiner Studio.
Great. Thanks. So it looks like we’re just about time. For those, if we weren’t able to address your question on the line here, we’ll make sure to follow up with you via email within the next few business days. So thanks, again, Jeff. And thanks, everyone, for joining us for today’s presentation. And we hope you have a great day.