Martin Schmitz, Head of Data Science Services, RapidMiner
Python is the dominating programming language for Data Science. RapidMiner’s Python Scripting extension already provided a way how to integrate Python into RapidMiner. With the new Python Operating Framework (PoF) there is a new option how to work with Python within RapidMiner. PoF allows anybody to write new operators – in Python! No need for Java anymore. You can just package your favorite Python scripts into an operator and share it with anybody.
00:04 [music] Okay, guys. Hello. Good afternoon, everybody. The work I’m presenting is mostly work by Bhupinder who built that extension, and I worked with him to do that. So Python Operator Framework– but first a bit about Python. So why does RapidMiner need a Python integration? And there are actually two reasons, not just one. The first reason is data scientists are using Python a lot. If you remember what Scott did in the morning, that was a topic– well, they’re using a lot of Python and you wan to give Python coders an ability to join the RapidMiner platform. But actually, from a development standpoint, we’re facing a problem. RapidMiner’s a Java software, but a lot of new methods are not developed in Java but actually in C, C++. A lot of the Python is actually Fortran, so good old Fortran and it’s kind of hard – Agiza can tell you more about that, for example – to get C code into a Java platform. That’s a nightmare. So giving people the ability to add Python to the RapidMiner platform means we can integrate algorithms which before, we had a hard time doing so. For example, XGBoost which is an alternative to the gradient boosted algorithm which is already in. And that’s what we’re talking about.
01:19 And if you remember, also this morning, our statement was we want to reinvent enterprise AI so that anyone has a part to positively shape the future. And anyone, of course, includes Python coders. So we invented Python Operator Framework and the Python Operator Framework, and the Python Operator Framework allows you to create, use and share operators across your enterprise so that anyone can use it. Again, anyone means in this case somebody who has no experience in Python should be able to use your Python operators. We’ll do this talk in two parts. And you also know these two people, Sarah and Scott, from earlier. We’ll do a Scott perspective on the how’s a RapidMiner power user or also a business analyst seeing the Python Operator Framework? That’s the first part. And then I’m sorry, but I’ll bother you with some code – which is something new for me to talk about code – how to actually write what we use in a second. So Scott forced me to do that as a live demo, but I also got some slides.
02:34 So what do we have? We have this new object, the XGBoost object which is what the Python coder created and which I will show how to build this thing. And what you can do is we can get some data in like Sonar and then we can use this new operator called build Python model, connect it like this and–
02:57 We can’t see anything.
02:59 Oh, that’s perfect. That’s perfect. [laughter] Why don’t you see anything? Okay. So what I dragged in is this XGBoost object Sonar dataset, and this new operator which is called build Python model. And you see that the port out of that, the mort port is already green. What comes out of this is a RapidMiner model which for you as a user, there’s no difference between this model and any other model you may know from RapidMiner. So I can use this super standard Apply Model operator like this, just put it like this. Do what you should never ever do and apply the model on the training data. And you see there it’s populated, already, the list. This behaves like the GBT operator straight off. You know that. And if I run this– I’ll just use 10 [3?] so it only takes 10 seconds. I am applying that model, and I get predictions, confidences. It’s the same way you know it already. So RapidMiner user, no difference to before. There’s one difference. So we can actually now not just build a simple model and apply it on the data we train on, but we can also use this in a cross-validation. So again, you use this build Python model operator which works like a GBT. Only slight difference, you need to input this XGBoost operator, this XGBoost object here, which holds all the Python scripts and everything for you.
04:37 And then I can run this and this is a cross-validation. You all know that. because what comes out of the build Python model operator is a RapidMiner model backed by Python. And no surprise, you can also optimize on this or there’s an Optimize (Grid) Operator. Inside there is a cross-validation and inside, again, we have our build Python model operator which does everything for you. And the cool part is, of course, if I go here to add a parameter setting and look at my build Python model, I can optimize objective, boosting functions. All the good stuff from XGBoost is exposed to this model. Kind of straightforward. Let’s move back here. Now I need to press this button again. Hyperparameter tuning. Of course, since this is a RapidMiner model, you can use it like any other RapidMiner model. You can put it into the model simulator and slide around. You can put it into Explain Predictions and figure out what was this XGBoost model doing. And you can use it in the model ops in the new deploy custom model which is just using Auto Model models but your own models. Sure. We saw that earlier. You can use these Python models within. So for you as the Scott guy, who’s not a Python guy, you just use it and you’re happy. The only thing you need to get is this one object, this XGBoost object, and actually, we can now build not just XGBoost, but whatever you want, we can build as an operator.
06:18 Let me talk you through how that works. Let’s go into the thorough perspective and be a coder and code this operator. How does it work? So there are three files in the backend which you need to write. There’s a train.py. You can actually name it whatever you want to. This is a script which holds a function which is doing the training when you build the model. There’s an apply.py which is the code which is run when you run Apply Model. And then there is a parameter XML which holds all the parameters you saw on the right-hand side. Okay. Be aware, [dragons?], I’ll show you code. So what is the most minimal implementation for us? This is the train.py file which you can use. For those of you who know Python, this is a function. We built a classifier, fit the classifier on the data, returned the classifier and a string, and we’re done. That’s the most minimal implementation for the train.py. Apply.py doesn’t really look that hard either. You, again, find our main function and predictions are just clf.predict(). This is what we got from up here, and then we returned it again. Minimal implementation. And then you just use this create learner operator, point it here to the train, to the apply. Actually, the parameters can be empty in this case – this is most minimal implementation – and store it. That’s already it. That’s basically what you build to provide this XGBoost object which can then be used by anybody else to use it to build XGBoost modules.
08:05 Cool. Sounds very easy. Now there comes the tedious part. So as you probably know, Fabian and I, we built Operator Toolbox, and Fabian’s always always pushing me to build nice operators and not just crappy operators. And Scott is always pushing to write proper English and not my German version of English, so you want to build a nice thing. And actually, 90% of the time, it’s not building the minimal prototype, but building something which is nice and easy to use. For example, it should show error messages when something goes wrong. So what are we missing here? Well, what happens if there are nominals and XGBoost can’t really handle nominals? It would just crash. What’s happening with special attributes? Special attributes should be ignored. In RapidMiner [currently?] we just fit on everything. What about if the label is not called label? I look here for a column called label and not the row called label. What about regression? What if the label is regression and not classification? The classify actually has no settings. We did not set any settings. Apply model is only returning the predictions and not the dataset with the predictions. And what else do I have? What if this schema in application is different in trainings or if there are additional columns or columns which should be there because they were in the training? And also there are no rows in Apply so the column prediction doesn’t doesn’t get the row prediction. It’s not [green?] in RapidMiner.
09:44 If you think about it, probably most of you haven’t thought about it, how much work there is in the backend. If you write an operator properly, you need to handle a few cases, so what if the user adds a date? What if a user does X, Y, Z? And if you do it correctly, then you also write test processes which are testing for these cases and all these things. Okay. Now let’s have a look. So this will create quite some more code. So you will write code like this. So the data frame we turn into Python also has the metadata information available. So we actually do know what column has the row label. So you’ll end up with a lot of these Python functions you’ll write yourself where you check in the data frame, “How’s my label named?” And use it over and over again. So we actually wrote a new library called RM utility, so you don’t need to do this anymore, but you can say, “RM utilities get label for my data frame,” and then have to label and can work with it, because otherwise, everybody would need to actually write that. Small feature. You’ll also see that there’s an exception. So you can throw exceptions or error messages from within Python which turn up in RapidMiner like these small, red error bubbles when you do something wrong. That’s what you do if you throw an exception from Python.
11:12 Okay. Long script. This is the train.py, of course. What do we need to do to really build the model I just told you? We need to get what is our label name and what is our [X?] names. And you see I’m using this RM new library so that I just say, “Get label. Get regular,” to not write the code again. Then we actually need to check is this a classification problem or is this a regression problem? Most Python libraries actually do it the way that the algorithm can either do regression or classification. And you, as a data scientist, you’re smart enough, you choose the right version whatever you do. RapidMiner is different. RapidMiner just checks, it’s a numerical column, I do regression. It’s a nominal column, I do classification. So it makes it easier for the user, and what we want to build is an operator which is as easy to use for the RapidMiner user as any other operator. He already knows there should be no additional learning involved. So that’s why we check for it. We raise an exception because there can be that weird thing that people want to print dates – I’m not sure if that’s a good idea – and then you throw an exception. If it’s classification, then we need to do labeling, coding because XGBoost wants to actually [actually bring?] 0 and 1 and not true and false or survived, not survived in Titanic’s case, again, something RapidMiner is usually doing for you so we need to write it here. And then we will declassify our regressor depending on the task, and then we just fit it. And actually what we return is just not the plain model itself but we return an object which is also having all the parameters, the name of the label, the name of the columns we were using during training so a bit more complex object which knows more about itself.
13:12 Also, I hadn’t shown you that earlier. You can have your own HTML representation for the model. So if you open the model, then if you have a decision tree and RapidMiner sees the tree visualization, and you can actually write your own HTML displaying whatever you want to, whatever you like, we just display it. By default, this metadata to string just gives you the attributes which are used for training so you know the training header. We can do way more here. Okay. That’s the training.py. Apply.py, well again, this time, we take the name of the label column. We get the name of the regulars and get the classifier. Then we actually need to throw an error if the schema’s different because we need to check for it with an exception. And then we predict it and get the roles here. So we need to do set roles similar like you do it in an operator. We want that all the roles of the prediction are the same in RapidMiner. So the prediction columns have the role prediction in green on the left-hand side, and the confidence columns should have the role confidence so that they are like any RapidMiner user expects. And you see there’s again a bit of code that you need to add to make it easy for the user and I really, really would advise you to go that extra step to make it easy. I learnt that the hard way actually from [Ingo?] and we both wrote something like local interpretations. I was quicker, but I built something which had 20 settings and a it’s process, and it can build super great things if you know what you’re doing. You’ll use [Ingo’s?] operator, not my operator. And there’s a good reason. Because [Ingo?] made it easy to use. I didn’t. That’s a lesson I learned. So if you write an operator, go this extra step. This will make adoption way easier. More complexity doesn’t help you here.
15:20 Last step. There’s this XML having the parameters which we use. It’s just straight up. You have your parameter, the objective, and it can have the several values here and the default is auto, and for example, we have the max depth. It’s an integer between 3 and max value default is 100. [Makes no sense here?] but okay. And then you build your Python object from there and you’re happy. That was the code I really used when showing in the beginning. I just dragged and dropped the XGBoost coding and it works.
15:54 So do you define your parameters here?
15:56 Yes. Yes. The parameters which are shown in the build Python model operator. So those parameters which you see here.
16:10 How do you [inaudible]?
16:11 Well, these are now– like any other parameter, there’s no difference between this parameter. And if I search here for GBT operator, you see there’s no difference between those I got here, max depth, 10, and the H2O GBT has, max depth 10. And I can bury it and use it with the whole RapidMiner ecosystem and the whole surrounding operators. That’s somehow the beauty of it that after building this, you can use all other operators you are already using with it and it doesn’t make a difference [inaudible] Python.
16:47 [inaudible]. This is a more general question. How do you do parameter optimization?
16:52 Let’s have a chat later. We have this optimized Parameters (Grid) Operator. There is Optimized Parameter (Evolutionary). There are a few operators you can just use for it. Good. So we can do models. Great. We can build awesome models. We already had like 200 of them, right? Now we can add another 100 of them. Cool. This makes my life easier if once again somebody of my clients comes and says, “But I love XGBoost way more than H2O or GBM or [Number 24?] because I can just add it.” But what else do we get? There are actually three more operators with the Python Operator Framework and those cover basically all the actions you do. You cannot just create models with it. You can create read operators to read whatever you want to. Web services, files where you have a Python library for it to read it like Avro or something, Pocket, or. If you have something where we don’t have read operator for, use this, create read data object to do the very same thing to have readers. You can do the same thing for write operations, of course. If you want to push data to some web services, you can do this in Python. I mean, you could always do that in Java, but that ties in to the first point about data scientists are using Python. We want to give them the opportunity to do that in their language and not in Java. You could still do that.
18:22 But with a write operator, you can push now data to anywhere whenever you want to. And then there’s a transform operator which transforms data. Data goes in. Data goes out. In between, something happens. Generate attributes, filter examples, pivot. You know all of these. So if by any chance you want to create your own transformation because you think it’s not already present in RapidMiner which would transform [inaudible] interesting which one I have, you can do this now in a way than if you write it in a way that any other RapidMiner can use it. That’s really the punchline here. Even though I showed a lot of code, for you as a user, it should be easy again. And that’s it. I’m 30 seconds faster than I thought. [applause] [music]