Hi, everyone, and thank you for joining us for today’s webinar- Amplifying Predictive Analytics with Data Visualization. I’m Hayley Matusow with RapidMiner, and I’ll be your moderator for today’s session. I’m joined today by Founder and CTO of RapidMiner, Dr Ingo Mierswa. Welcome, Ingo.
Hey, good morning everybody. Thanks, Hayley, for the introduction. Yeah, today is really a topic I am very interested in. This is, I think, one of the three key problems we, as data scientists, often encounter; the other two are really about ease of use and the skill gap around data science. We always, of course, have data quality problems, but the third problem really is about how can we actually get the greatness we are creating, all the goodness of the models into the hands of more people. How can we actually make sure that all the good patterns we find are also exploited and used for, yeah, improving our business? So this is really what this webinar today is about; the implication of all our results with the help of data visualization. Before we talk about how all of you can do this, I actually would like to motivate the prominent a little bit better. And in order to do so, I would like to introduce a new character to you, a data scientist, and his name is Joe. All right, so let’s have a look into the story of Joe.
Joe is a data scientist. So we will see what Joe is doing every single day, but I hope you can actually relate to this a little bit at least. So many of the people here in the webinar today will probably see similar problems like Joe here. All right, so Joe’s sitting like we all all do, in front of his computer, working with problems like RapidMiner and sifts through all them, like a huge amount of data Joe has accessible to him. So he can really look for new patterns, try to optimize the business by figuring out, “Okay, how can we really do the right things?” That is Joe’s daily job and he’s doing this every single day. Of course, not always he’s finding good stuff, but hey, let’s say, six months later, Joe finally found the churn pin in the data. Let’s say you have some churn problem and you’re looking for who are the people who are shifting over to a competitor, for example, who are stopping contracts. So you found some pattern you can believe you can actually exploit this pattern. Well, in Joe’s world, this is really a pot of gold. So this is not a potato but a pot of gold, and I want to make sure that people not think my drawing skills are bad. So this pot of gold is important for Joe and not just for Joe, also, for his fellow data scientist colleagues, Fred and Peter.
So Fred and Peter are totally excited. They congratulate Joe. He gets the first prize as the best data scientist ever. The three of them are celebrating the whole night. Everybody’s happy. But about a week later, something happens. A week later, and Joe suddenly became very sad. So why is that? And that’s a typical pattern I, as a data scientist, see in all the other data scientists. You’re first totally excited about the results, but then a little later, you would like to take the results and transform your business, but you figure out actually nobody else really cares. It’s only Fred and Peter or your fellow data scientist colleagues who really like what you did, but everybody else, not so much. So why is that? In many cases, Joe’s models are not used or not even recognized – or the goodness is not recognized – because those predictive models, they tend to be really hard to understand. All those formula, all the mathematics, strange patterns which really take some time to understand fully. That’s why, also, unfortunately, poor Joe, Fred, and Peter often sit somewhere in the basements. Well, other people don’t really recognize the great value they’re creating because they simply do not understand the great value. And this disconnect, of course, is a problem. But Joe thought about, “Okay, what can I do to amplify my voice? What can I do actually, now, to take this great pot of gold I found and give this into the hands of more people so that actually our business can improve?”
Okay. I hope you can relate to the small story of Joe, the data scientist, because I certainly can. I’m in this field for 15 years now and I’ve been running into this problem over and over and over again. I found a great model with the potential to change the business but unfortunately, I didn’t really manage very well to get this model into production use. Well, often I did, but not always. So how can I do this in a way that actually this works much better? How can I really, well, improve the number of times or increase the number of times my models are actually used? And in general, I would like to present a framework to you, and this framework we call the Operationalization of Models. Because think about this, if you created a predictive model, many people believe, “Yes, this is about generating some insights.” Well, it is. Or it’s about just predicting what’s going to happen. Well, yes it is as well. But if you’re really honest to yourself, doesn’t really matter that much that you just– especially, what’s going to happen. And if you’re honest, not that much because if I just know what happens but I’m not doing anything about this, then nothing really changes. So the only thing that really matters to your organization is that you do some action, that you do something today so that you will get to the best outcome and the best possible result tomorrow. And that’s exactly what we call Operationalization of Models. You take the model, yes, you do all the scoring, create all the predictions, but actually, then there’s something else which needs to happen. And this something else is what we call the operationalization. Often it’s about turning those predictive insights you create into actions and then performing those actions. So that is the whole vision, actually, here on predictive analytics issues.
If you think about this, how can analytics, then, help us to find the right business action? The right business action in the sense of getting to better output or outcome tomorrow. And I would like to motivate this, in general, for those of you who haven’t been there yet, who are new to predictive analytics. Well, even for the experienced people, I would like to help you to better understand this whole framework around predictive analytics a little bit better before we see how we can connect this to data visualization products, for example, like Qlik, to show you one particular way of operationalization. But let’s spend a minute or two on the framework first. Okay, in order to do this, and so in order to motivate this, let’s have a look at a very, very simple example. I think this is the most simple example I could think about, and it’s about the weather forecast. If you know that it’s going to rain tomorrow, should you bring your umbrella? Yes or no? Think about this for a second. The immediate answer– well, you can’t give me the answer right now, but if you think about this, then you would say like, “Write it down or write in the center,” or whoever. Probably remind me if people would just say like, “Well, yeah, sure. It’s going to rain. Sure, I’ll bring an umbrella.” I, on the other hand, say, “Well, it depends.” And what does it depend on? Well. first of all, I now have the weather forecast- it’s going to rain. But there might be more– there might be more– sorry, I’m getting a call at the worst point in time, ever. There might be more information which is available to you. So you know it’s going to rain but maybe the weather forecast also tells you, “Well, it’s windy,” or, “It’s going to be windy.” So in that case, bringing an umbrella is not a good idea because if you, for example, commute by walking, then actually, your umbrella is going to be blown away. So that’s not good. So if it’s windy and rainy, so maybe you should better take the car then for your commute to work. Well, but if you take the car and it’s windy and rainy, many other people will do the same, so there’s more traffic so you should plan for a longer commute.
So you see that actually a very simple prediction like, “Should you bring your umbrella when it’s going to rain?” And then the question, “Would you bring your umbrella?” is actually leading to a half-complex decision framework already depending on so much other information which you might have available. So that is exactly the whole point from– well, if you think about just gathering information and making your the decision yourself, and then actually adding forecasts of those information– and maybe, you can actually do even a little bit more with this collected information to find or optimize, automatically, to find the right course of business action. So this is really coming in like four different layers. And the four different layers really start with business intelligence, more traditional data visualization, something we will definitely see today a little bit more, and it ends with something we call prescriptive analytics.
So let’s go through this weather forecast here on the left and also the churn example we’ve already touched upon on the right. So with a pure business intelligence approach, you could look into the data from the past and say, for example, “Well, in the last year, it has rained on 231 days.” That is really an interesting piece of information. Unfortunately, it’s not going to help you at all with creating a weather forecast for the next day. So this piece of information alone is not helpful to you. Well, not particular helpful at least. For the customer churn use case, you are in a similar situation, actually. You could say, for example, “Well, we lost five million customers last year. That is 23 percent of the overall customer base you have.” And that is good to know, but at the same time, it’s also shocking, and more specifically, it’s too late. You can’t do anything about this because now, all those customers have already moved on. They’re gone. You can’t keep them any longer. So the pure BI approach, as helpful as it is for so many aspects, is unfortunately not really helping you finding the right course of business action; at least not an optimal one. So you can get a little bit better insight and build a better gut-based feeling, but you’re most certainly not finding the optimal course of action to get to the best outcome in the future. Well, maybe you can add a little bit more predictive style, at least, into the mix by saying, “Well, maybe I’m not just looking at the last year, I’m looking at the previous three years,” for example, “and I know that,” for example, “it has range 231 days in the last year and 217 days and the year before and 253 days and the year before that.” Well, it’s probably a pretty safe bet now to say, “Well, it will rain at least 230 days in the next year.”
So that’s what I call a BI-style prediction. You basically take aggregated information from the past and then on this aggregated information, you find some trend curves or trend lines or some thresholds you think will be reached. Well, that’s definitely better than nothing but that’s exactly what I meant before when you say, “Well, in order to get churn inference from the past into predictions for the future, if all you have is this aggregate information–” typically, well, it’s a little bit better, but still, it’s kind of like a gut feeling. How do you know if it’s now 200 days or 210 days? You don’t really know. And especially, you don’t know anything about the present. Is it going to rain tomorrow? So it’s really difficult now to find a course of action. Same is true for the customer base. If you lost 23 percent in last year, 21 percent the year before, 25 percent in the year before that, then let’s say it’s a safe bet to say you’re going to lose at least 20 percent of the customers.
So yeah, it’s somewhat predictive but not really. And also, there’s typically not a lot of predictive analytics actually involved. And that’s exactly the next phase now or the next approach. Let’s say, for example, you could take a predictive model now to go through all the particular days and also the information you had about those days from the past, then you can actually take this model now to create a probability for rain for tomorrow. So now you could, for example, say, “Tomorrow, it’s going to rain,” with a likelihood of 95 percent. All right. So that is really where most data scientists feel very much at home. So this is exactly what we are doing every day- finding those models so that we’re actually able to make great predictions for every single situation. In the churn case we can, for example, create a prediction now which tells us, “You will lose John Smith tomorrow with the likelihood of more– or with the likelihood of 92 percent.” So this is really the norm. But if you have actually the predictions to other information and run optimization methods on top of this, that is now really what we call prescriptive analytics because now, you, for example, take other information into account and figure out, “Well, with the other information, I actually should go by car and plan for a longer commute.” Or for example, in the case of our Mr Smith here in the churn case, he could figure out that I should give Mr Smith a call, tell him about all your service improvement initiative and also offer a 3 percent discount on renewal. 3 percent, not 10 percent. 3 percent is enough to keep John Smith. So finding the right course of action that’s really the realm of prescriptive analytics.
So that is interesting because if you think about the value, really, then funnily enough, often, the value of the BI likes by predictions is actually relatively small. You can’t really act on this. It’s really more insight. And yes, you can make some business changes based on that and this will definitely deliver some value, but the interesting thing is that by going down to the detailed level and automatically do the right thing for every single situation, that creates even more value. So there’s really an operationalization spectrum then. And this spectrum really is interesting because some of the elements– I would not even go down that route and actually create those millions of smaller predictions. It’s really those are the cases, which are more, well, infrequent and often are a little more, yeah, strategic, a little bigger decisions you need to make. So this is why we start on the bottom left here. For example, if a strategic decision like, “Well, should we create a new product line? And what kind of product line? Should we acquire a company?” And often, the best way to achieve this really– well, it definitely can use analytics, in general, but also predictive analytics to figure out what’s most likely going to happen. But really, those are kind of on-off decisions and they take a long time, and automating them probably doesn’t make a lot of sense. Informing technical decisions, for example, defining pricing policies or underwriting policies doesn’t make a lot of sense. But if you go further down here, the number of decisions typically grow and the duration for the decision gets shorter. Like for single decisions, for example, making a specific hire, define a price for a specific product on a specific day, those are decisions where actually you can operationalize a little bit more. Scale plays an important role here because that’s just the number of decision roles. And then, you go further up here into operational and fully automated. Examples for this would be, for example, making a cross-selling offer, approving a credit, or even stopping such actions, which, of course, need to happen extremely fast.
So in our churn case, this definitely falls into this operational bucket. So if you think, for example, about a telcorp with 50 million customers and they want to predict who is really loyal and who is about to churn, those are cases where really you can’t just have a single person have a look in many cases. But the big, big question here is– I’m actually skipping the next slide because we’ve talked about this already. It’s often the most relevant thing. So you can save a lot of money by operationalizing this and fully automating this. But the big question really is– that is good. So if you have millions of decisions, in this particular case, if you have 50 million customers and you make predictions for every customer every day, basically, you end up with billions of decisions you make it. Why not just fully automating this then? So why not taking your predictive model you’ve created and put this into some, well, automation system – and RapidMiner can actually can do this as many of you might know – and just automate this and forget about this. And that’s exactly the reason why Joe has often some problems, sits in the basement, and is just upset about this.
There are so many different reasons– sometimes people call them political or whatever they are. There’s so many different reasons, actually, why people can’t just automate this whole full process and you want to have a human being actually in the loop. I give you a couple of examples. Let’s say you created a model which is totally disrupting your business process and you go to your boss and tell your boss, “Hey, look, I found this predictive model, if you would just change our business process, then we can reduce our churn rate by 10 percent.” Of course, people are excited to hear something like that, but at the same time, well, somebody needs to make the decision that the whole business should be changed. And in order to be able to make this decision, you’re not just trusting some kind of crystal ball because that is really what a predictive model, despite all the mass of statistics, is for many people. So if the trust is not there, this decision but it won’t be made. You’re not just automating this whole thing. So you first need to understand what’s going on and if this is really working, and just creating a cross-validation and accuracy estimation, it’s just not good enough for that because that’s another thing people don’t understand. So they need to see it for some time first, build some trust.
Fear is another interesting aspect. So just imagine you’re going to your doctor – and the doctor by now are the robots – and some pictures are taken automatically and the decision to make up surgery is automatically triggered and done. So you go into the doctor’s office, and based on what the machine learning algorithm is saying, some robot is performing a surgery on you. Well, although the actual decision might be better than the decision any human being is making, I personally – and I am a data scientist – I am kind of Joe. I have some problems, actually, myself, so I can’t imagine, really, any time soon that I would just accept the fact that there is no human being any longer involved in life or death situation or decisions about myself. And that’s kind of funny. Although I actually trust machines and algorithms a lot, I always like to have some human element in decisions like that. So this is often the case. I still believe that machine learning can support the decision-making process by, for example, pointing out the most likely cause of action or supporting, for example, doctors by looking or pointing out reasons on X-rays as an example to find, maybe, there’s most likely there’s some cancer. But we often assume beings often prefer to still have a human being in the loop confirming what the machine is saying.
So then there’s rarity. Sure, it feels like you’re landing on the moon. That’s not happening every single day. So probably, you still want to have some human being also stay the loop, although, most of those algorithms are completely flying that spaceship alone. But just as an example, a probably better example would be thinking about your acquiring another company. This is not happening very frequently and it’s also not just based on the question, “Well, will this company we are going to acquire develop positively?” Let’s say, make some revenue forecast. Well, that’s one thing, and the predictive model can help you there maybe, but there’s so many other factors. Are the company cultures a fit? How is the market developing? Etc., etc.. There’s so many other factors. This will be very unlikely fully automated any day soon.
And then last but not least, there is just familiarity. People are not very familiar with the machine running models yet, but for example, they are very familiar with data visualization products like a Tableau or a Qlik or whatever. And it’s interesting that this might be a very good channel for us as data scientists now for sharing of information, for sharing on models, for sharing the predictions or prescriptions with other people. And I think I will focus now most on this bottom right bucket because it can also help you actually with all the other three. So if you actually present your predictions in an environment in which you’re both comfortable with or other people are comfortable with, then it’s much more likely that people will accept those results. Okay. So that is really the whole framework, or the motivation really, also, for why should you really care about bringing together, let’s say, machine learning, predictive and prescriptive models, and data visualization. Well, you can amplify your voice, you can bring this into the front of many more people, they are familiar with this, and they can at least support us in making the right course of action then. So how cool would it be then, well, to just do it? Deliver predictions and recommendations for actions into a dashboard to you. And that’s exactly what we are looking at next.
So I’m now shifting gears a little bit and opening a couple of products. And by the way, just an invite, please feel free to ask questions at any point in time. Those are monitors, and me, from time to time, throw them over to me, but definitely, at the end, we will also spend some time answering as many questions as possible. Okay. So let’s have a look into a Qlik first then maybe. So let’s see, where do I have it. Here we go. So this is Qlik Sense, a product which was recently released by Qlik in addition to their Qlik View product line. It’s a data representation product, a product many of you will might be familiar with. And I created here, a very simple dashboard actually around churn, so stay on the churn use-case here. We have some data for the United States– and this is now a typical result, actually, in a more BI like fashion. So you took data from the past. So for example, we have aggregate information here for the different states of the United States, Montana and so on, and we calculate some properties. For example, how much churn do we have in Montana here? 21 percent. You should see this on the table here on the right. Or in Delaware, 14 percent. Or in Colorado, which is 13 percent. So this information, of course, you can visualize this like in this map here. We can definitely see that Montana has the highest churn rate here. We have – what’s that – New Hampshire here. Well, I just moved here a couple of years ago. Don’t ask me for every single state of United States, I probably would terribly fail. So let’s not go into more details here for now. So that is a typical situation. And of course, I could drill down here and have a look, for example, if I am the Regional Manager, I’m most interested in Maine and New Hampshire and Vermont and Massachusetts and Rhode Island and Connecticut. And now, I’m actually almost impressed that I found all of them right away. Well, this is really your region. Let’s say, you’re responsible for New England. So this would be your region here. You can make a selection and you can get the updated information here. Well, we have all those states here– New Hampshire actually has highest churn rate actually followed by Massachusetts and so on. The average churn rate is 9.78 percent. Those are some typical applications. So you know about your region now. Well, I have too high churn rate, maybe I should do something about this. But what now? What can you do really?
So and that’s exactly where, now, the combination of our data representation product like Qlik and RapidMiner can kick in. So this is at the first dashboard I created here and then maybe I’ll extend that. Make another selection. So we have now selected those 6 out of the 58 states here, as you can see at the top, and move on to this prediction tab. And in fact, I do something now right away because the last time I did this, I did this for Montana. I have to reload, now, the selection or the data calculation for the selection as estimate. So what’s happening now in the background is I take the selection, those six states, the New England states, deliver this information about the selection to RapidMiner. RapidMiner takes this information, creates a model for churn for those six states, makes a prediction for all the cases where we are most interested in, and delivers those predicted cases back into Qlik. That’s exactly what happened in the background. And additionally, we see the most important influence factors here and how do they differ for New England states. So if I stroll down here, you can actually see now a whole lot other predictions for your region. And you can see, well, also, in the future, New Hampshire will probably get most churn, and the rest is pretty much equal compared New Hampshire at least. So you now get the predictions for your own regions on where is going to be the most churn. So that’s what you maybe focus on. And this is now in contrast to the dashboard.
Well, in this particular case, it was New Hampshire. It was very strong in both cases. Not a huge surprise, but at least you got this confirmation here as well. Well, the other state actually changed a little bit. But also, you can see typical reasons for churn or– maybe not obvious reasons, but also like, “Okay, is there some patterns which are sticking out here?” And indeed for your region, New England, you can see that actually this age bracket between 20 and 30 here is much more likely to churn. It actually infects around 2.5x more likely than what you see in all the states. So in this red curve or the red bar charts, this is the information you have for all the states, but in your particular region, this age bracket actually is even more important. Also, the number of males are higher. So if you, for example, now want to work with your marketing team on those couple of churn campaigns, it might be a good idea to focus on this age bracket first, like male people in between 20 and 30. So that is good. Now we know what’s going to happen, we get some prediction about your region, you make the selection like you’re used to in Qlik Sense, you get the prediction delivered– and I’ll show you later how you actually integrate both products. You get the prediction delivered into Qlik.
But can we go a step further? And of course, we can, because RapidMiner is not just to create the prediction on top of the aggregated information, we can also get the prediction- really, who are the customers? How many customers per state gain most here in New Hampshire? Seven, I think, followed by Massachusetts, and so on. And who are the customers with the highest likelihood to churn? And this is exactly what this table here shows sorted by churn confidence. Those are the people you should focus on first. Those are the people who are churning most likely. Then if I scroll down a little bit, you are getting to a region where people are actually not very likely to churn, but I also credit the so-called upselling model. I didn’t talk a lot about this but you’re probably familiar with the concepts. And this upselling model here will be created, actually, for all those people who are not likely to churn and who are currently using one of our cheaper product tech, which is P2MP3. You figure out who is most likely to, well– is most willing to purchase the more expensive package, P1, as well in stats. So this is now very, very actionable insights. You know exactly who’s going to churn, we can focus on those people. You now have some opportunity to increase your revenues. So if I would be responsible for the New England region, I now would know exactly what I need to do and can actually act on this. And this is exactly the difference between here on the dashboard side– as we all know as data scientist, “Great. I know what has happened, I would focus maybe on Montana under that or whatever, but — I see, well, yeah, New Hampshire, Massachusetts here. But Massachusetts just becomes actually less of a problem for whatever reason – maybe other activities you already did – so let’s focus on New Hampshire here. And those are the people who are churning in New Hampshire we can see here and can focus on them. So it’s actionable bur based on the predictions of the future. And this is, of course, what, now, RapidMiner can bring to the table. But instead of having the need to understand every single model, people can stay in their usual working environments, which in this particular sample here is Qlik Sense.
Okay. So how are we doing this? Let’s switch over to RapidMiner. I hope you can all see my screen here. This is the process I have created. A very simple one. In the beginning, I just lost a load the raw data. Then, actually, I take the parameters from Qlik. And the parameters I’m taking from Qlik actually adjust the states. You’re selecting, in the beginning– you could also select the year we’re using for modelling, but actually I didn’t explain this, so let’s skip this for now. Smaller here. So here, now we have basically a filter defined which is using the states which are coming from Qlik. Well, some of you might not be familiar with this format and some of you might not be very familiar RapidMiner at all right now, so this is a so-called macro. It can be filled from outside of the process. In fact, here, you can see that I pre-filled this with all the states of the United States. So as a default, I just deliver everything basically. So I’m not filtering out anything. So that’s what’s happening here. But Qlik Sense can deliver the inputs. Well, those are the states I clicked on and so those are the states I, as the Regional manager for New England, am most interested in to deliver this. And then the rest is, well, kind of typical data science stuff. So if I show you the data– actually, let’s see if I find this right away. I’ll take it from here in RapidMiner. Why not? I have here this churn status column and you can see that in many cases, we don’t know the churn status, meaning we don’t know if those people are going to churn, yes or no. And in other cases, we actually have this information about loyal people, and some of you also have the churn cases here. Okay. So of course, as always, the task is to build a predictive model, taking the information we have into account, and predicting those other people we know about and filtering those. And create the churn model and then we apply this model on the people that we don’t know about. And those are the predictions we have seen in the second and third dashboards. Now, we do something very similar here on the upselling opportunities. So we know what packages are used and we can try to find out, are there opportunities for people who are actually low profile which is closer to the higher value package, which is P1 here in the end? All right. So other information in the data are not very tricky. So we have information about how much they spent in the previous three years on the mobile billing and the landlines, the age, the package they’re using, the general value– we saw this already in the dashboard. Okay.
So typically, data finds workflow. So what can you do next then? Well, the next step is actually that you can easily store this on RapidMiner Servers. So you install a new server repository– you can do this here and create a repository and take the server here. And after you’ve created a new server, you can save. Just really, that’s all you need to do, that how deployment works. It’s pretty simple, actually. You just save the process on the server. And after you did this, let’s now shift to the server. It only takes a few clicks, actually, to turn this process now into a so-called Web service. Look in here. If I can type my password correctly. All right. So all you need to do is, really, you go to the process we just deployed on the server here, which is this churn and upselling process, you take this process and you can click here on Export a Service on the right side. And after you did this, you’ll get a new service. Here, we have this. And this new web service is really easy to be configured. So this is the process which was going to be executed. You can select what kind of output you want to create, you can create charts or maps or XML or JSON files or whatever else which can be created with RapidMiner. But in this particular case, I just go with a very simple table here. And now, the last thing I can configure is– well, I have two macros for the year used for modeling – I called it M2 – and the states I’m most interested in – that’s PM1 – and I bind URL carry parameters to those two macros. What does that mean?
Well, if I test this web service here on RapidMiner server, you can see now that I can get here some URL and, for example, can see, let’s go with New York– let’s test this thing. Sorry. Now, I’m only getting the information for New York instead of the full dataset. So I’m getting, now, all the information – for example, what is the prediction and cases where we knew it already – we don’t need a prediction – all other cases where we have a prediction including the confidence and so on. So every single piece of information is here delivered now from this web service. Now, you can take, actually, this URL and feed it into Qlik. So let’s move over here. And how are you doing this? Well, in Qlik, there’s this data load editor. And I’m not really truly a Qlik expert myself, I have to admit. But I spent a couple of hours, really, only in learning how to use it and I have to say, yeah I managed much faster to really set it up. All I’m doing here is, well, I define a new data set- I call it Churn. And I’m loading all different kinds of columns here– and I read renamed some of them– all different kinds of columns here from this URL which just has been specifying, and I can use those parameters here– those are the states which are selected– as a default, I’ve selected all of them, but later, you would see how you can also– here, you actually can see already how we can select specific, yeah, states or also the year. This is the way how you can actually access the information that you’ve clicked on in Qlik. So use the dollar symbol and then round brackets and the variable name – in this case, it’s the year selected – scroll down a little bit– you see the same also here. It’s here. There, actually do the same for both, selecting the states and selecting the years. So this is the format, that’s all you need to do. You define the name, you define what columns you’re interested in from this table which is delivered by the web service, and then, adjust the URL from the web service, you just can copy here from this direct link. And that’s kind of it. Then you click on Load Data and that’s it. It takes a while.
All right. So now, I uploaded the data and I can now go into our app here in Qlik and can look at the data as we have seen this before. So this is the overall workflow. There’s only one small caveat for those of you who would use both products specifically and want to try it out, after you’ve made the selection, you need to tell Qlik that it actually needs to reload this data. If you look up on Google for a Qlik reload button, then actually you will find a nice way on how to do this out of the dashboard, which is a little bit more elegant than actually making you go back to the data load editor. Both of these would work. You can reload them through the editor, that’s the standard Qlik Sense way, or you add this reload button here to your dashboard, which I find a bit more comfortable than going to the editor. Okay.
So let’s go back into the presentation and recap what we just saw. So how do you amplify predictive analytics with Qlik? In general, probably with all these organization tools we use or make use of this deployment mechanism of RapidMiner Server. So you design a process first – and there’s a good process in RapidMiner – then you just save it in the RapidMiner Server repository, and on the server, you go into this web interface and turn this process into web service. So that’s basically one single click and we return a table. Now, on the Qlik side, you add this RapidMiner Service as a data source, you design a dashboard like you always would do this in Qlik or in Qlik Sense, and then, whenever you make a selection, you can also feed them through a mechanism into RapidMiner processes. So basically, a react to user selections- whatever the user’s most interested in. Make all kinds of other processing and deliver the results then back to Qlik. And just a little side note, since we can build all kinds of processes, of course, you can, quote-unquote, only use the fluctuating predictions, but you can, of course, also do all other kinds of data preparation work, especially when it’s more data prep around, let’s say, more advanced statistics, for example, finding and removing outliers or– well, what else? Like normalizing data. All those kinds of things. You can do this kind of data prep, of course, with rapid miner as many of you know. Then run those RapidMiner processes to prepare the data for Qlik as well. So we, of course, also have the whole data blending and cleansing functionality as part of RapidMiner, you can do this. Or since you still have RapidMinor process with 1500 operators in total, there are a lot of operators, for example, for sending out emails, triggering other web services. So whatever you can do in a RapidMiner process, you can embed this course of action, also, into Qlik through this web service integration. And in general, for those of you who are a little less familiar with RapidMiner – and this might be the first touchpoint for you – in general, this is really an important point for the whole RapidMiner platform.
So we really go through all three very important, well, phases of the data science workflow or the analytical lifecycle. So the first phase is– most often, the first phase is really investing, and more importantly, preparing the data. So really connecting all the dots, connecting all the data points or data sources, making sure that you blend them together in the right way, clean the data. So really, there are hundreds of operators in RapidMiner for doing the data prep parts. No matter where the data is coming from, no matter what the scale is, you can do the data prep in Hadoop clusters, in memory, wherever you want. The same is true for the modeling and validation phase, which is, of course, key to figuring out what’s going to happen and finding the right course of action. So finally, the right machine learning model and the central points. There’s 250 models in total in RapidMiner, you can all try them out. You can also do a lot of model optimization and automation, like automate this whole process, doing the right parameter selection, and figuring out what the right features are you’ve got to build the model on. So most of you will know this but just for the few of you who never have seen RapidMiner in action before, it’s really, very powerful there. And of course, the validation is equally important so we know exactly how valid those models will work in practice and in the future. And then the last bucket, that’s really the one we focused on today most is really the operationalization bucket.
And in general, of course, you can deliver all those results into all kinds of business applications. You can also automatically trigger the execution of certain actions. Just imagine, for example, instead of just visualizing the churn cases, why not just creating a campaign to those people and automatically, let’s say, in their marketing automation software – or HubSpot, you name it – and sending out this campaign. So you could, in theory, automate this whole process. But as we have discussed before, that’s not always the best and most appropriate approach. So sometimes, really, the best approach is just to embed the predictive insights and actions into a data visualization platform like Qlik. So this is a particular example for one particular operationalization. And then, before we wrap and open this up for questions answers, I think, really, you should give it a try. Also, this integration with RapidMiner in general. We talked a lot about the operationalization bucket, which is definitely one of the things which makes us very unique. To how many systems you can connect and how well the predictive modeling is then integrated into the usual business workflows. You know that RapidMiner really is about the set of service predictive analytics aspects. So yeah, it’s code optional, you can code if you want to, but you don’t have to. So really, it’s very easy, effortless as you call this. We guide you with things like our system of crowd so you can actually get recommendations to what’s the next best step. Very important for us. This speeds you up finding the right model. We saw in the beginning, why is Joe still taking six months? Well, because sometimes, there is nothing in the data. But in those six months, he was running through so many different data sets and tried so many different use cases. And that’s only possible because we were accelerating this whole model finding process. And of course, as an open-source leader here, we really embrace a lot of innovative solutions coming from ourselves, but more importantly, actually, also from our community and embrace also other modern solutions, especially in the big data space.
Yeah I think I will wrap it at this point. So it’s a great platform. Qlik Sense is a great platform as well. To bring together two great platforms seems to be a very good idea. For those of you who have Qlik Sense already or QlikView – it works, practically, in the same way – you should definitely give it a try and move there because this is really the key as we discussed in the beginning. The absolute key to really get your voice heard and make sure that your models are, well, recognized and then, hopefully, also fully operationalized and optimized your business processes. So at this point, I would like to open it up for questions and answers. Hayley.
Thanks, Ingo. Thank you for a great presentation today. You covered a lot of really great information and we appreciate your time. So as Ingo mentioned, we’ll take any questions that you might have now. You could submit your questions through the questions panel. We’ll take those questions now. So it looks like we have a question here for you, Ingo. Is this integration possible with QlikView Enterprise Server or is it only possible with Sense?
No, it’s also possible with QlikView Enterprise Server. And it works practically, really, in the same way. In fact, the first– I’m sorry, I don’t have QlikView right now, otherwise, I would show it to you. The first demonstration of this, we actually created it with QlikView. And it works, practically, in the same way. On our documentation server – docs.rapidminer.com – either it is already published or it will be published in the next couple of days. There’s a complete set of documentation on QlikView and Qlik Sense and how to integrate this. So it’s possible with both. And also, then, the documents.
Great. Thanks. I have another question here for you. This person says, “I use QlikView mainly. How does one integrate QV and RapidMiner? Is there a QV connector for RapidMiner? And do you have example after this integration?”
Unfortunately, I don’t have an app or I could show you right now. Too bad. Yeah. Good point. So there’s a couple of multiple ways. First thing, I should also mention, for RapidMiner, we also have the possibility to write the QVD file– I think that is the extension, I don’t know exactly. But we can actually export all kinds of data into the QlikView format right away. And from there, you can load it again. I’m saying it’s always possible and it’s a good way. Or in addition, similar to the answer before, you can also make the integration via the web service and that looks practically the same as we saw this not for Qlik Sense; no significant differences. Different UI but that’s about it. But the concept stays exactly the same.
Great. Thanks. We have another question. Can we get a trial software?
You’re asking for Qlik Sense or for RapidMiner? The answer, actually, is in both cases, yes. So for Qlik Sense, you can definitely download a version from the Qlik websites. For QlikView, I think, it’s the same, not 100 percent sure though. For RapidMiner, that might be the case. Yes, there is, first of all, our community version, which is freely available to you, also, without any limitation in time. And then, for all commercial offer, there is a trial version for that as well.
Great. Looks like we have another question. Can RapidMiner integrate with any other applications like Tableau?
Yes, indeed. The integration with– since Tableau was used as an example, the integration with Tableau is not, let’s say, completely bi-directional. So we can, for example, deliver data into Tableau through the OData format as an example. We can also write to Tableau or into Tableau format out of RapidMiner processes. What I mean by not completely bi-directional, right now, to the best of my knowledge, we can’t take, for example, user selections out of Tableau and deliver them into RapidMiner web services. Well, our web services can do this, obviously, as we saw here for Qlik, but there is no integration on the Tableau’s side for this, what I would call, a bi-directional communication. Beside that, since Tableau was an example, there are many, many other data representation products out there and almost all of them by now support this web service-based integration. And so, the integration, in theory, always works in more or less the same way. But in addition to this, we also have a lot of other business applications, let’s say, like Salesforce, Salesforce connectors, Marketo Hubspot. We have a lot of connectors, really, for on-premise based software, but we have even more – and in fact by now, more than 500 – connectors to cloud-based business applications. So I would be so bold and say, you name the application and the answer is very likely going to be yes, we support that as well.
Great. Thanks. Another question. Are you able to do embed R and Python code with RapidMiner processes?
Of course, we are. Yeah, that is actually a feature which is relatively frequently used because so many data science teams have R and Python coders on as part of the team. They can just actually create a new process here. Not even sure if I actually installed it. Yeah, here, for example, I have the python scripting installed here. So there’s an extension on our marketplace. So in general, whenever you are missing something on RapidMiner, it is always a good idea to click here and get more operators. This will bring up the RapidMiner marketplace. And on the marketplace, there are additional operations which are created by RapidMiner or by third parties or community members. So the Python and R extension– here’s the R one. You can see that one up here. Execute R. I think that’s actually– yeah, no, it’s the right one. So the R and Python are both RapidMiner extensions, so we are hosting those. And really, all you do is you put them as part of your workflow. So let’s say you have some data, you can even feed the data into the Python scripts here and then you can do whatever you want to do inside of the script. So you can add your own homegrown data prep scripts or your most preferred model, in case you have something in Python or R, and can then deliver – as we do here – the results at the end, which, then, can be fed into the RapidMiner process again.
And I think this is a really important question because sometimes, you have such a special data prep stepped up as an example– often, you can build a RapidMiner process for that, but if you already have an R script or a Python script which is solving a particular problem, why not just embed this? And I think the key point I would like to make around this is if somebody of you spent the time to code this, wouldn’t it be actually a nice idea that you can share this with others as so many other people? Maybe even non-coders as well? That’s exactly the idea. You can now create your favorite scripts, manage them in RapidMiner, can even save those with a predefined operator with those scripts as a building block so that you and other people can reuse them again. Accelerated work supports the collaboration as a team and then, also, really, to get the best out of both worlds. A lot of them are in Python scripting worlds as well as the non-coding world, which often makes it faster for those who can. Like you don’t need to code whatever. How to do multiple nested cross-validation including a feature collection parameter optimization etc.. This is something you could actually pick together with a couple off of clicks in RapidMiner, but you would need to use multiple hundreds of lines of code to actually implement this now in Python.
So I think it’s all about combining the best of both worlds and that’s what I called code-optional. In the beginning, you don’t have to code for practically all use cases, but if you want to, you can always do it. Last quick comment on that one, since you also have a Radoop offer which is pushing down computations into Hadoop clusters, we also supports PySpark and SparkR. So even if you want to push down on the execution of R or Python scripts into Hadoop clusters, you can use RapidMiner for actually governing even this process. And that’s really what we see often in practice, that people really use RapidMiner as the governing platform. Also, taking care about version control, scheduling, the operationalization aspect, integration through web service and data visualization tools, doing all the standard analytical tasks. And then, somewhere in the middle, use a very, very specialized R script or Python script solving a problem only you have. They actually have to code because there’s no other basis that you could solve it. And this is really a very efficient way to work in the situation we encounter quite frequently.
Great. Thanks. I have another question here. Can the integration with Qlik be used in the community version of RapidMiner?
Yeah, well, there is one way, at least, to do this, which would be, basically, by writing out of the model- I think that’s actually a part of the community version, I think so. So that would be one version, but it’s, of course, not that elegant than the integration if it’s a web service because that’s truly bi-directional approach where you can also react directly to the user selections in Qlik. For that, you would need the RapidMiner Server products in a version which is not freely available as a community product. This was a commercial offers of us.
Great. Thanks, Ingo. I have another question here. Can you do market basket analysis in RapidMiner?
Yes, you can. There’s actually, again, on the marketplace– I don’t think that I’ve installed this right now. Let me see quickly. No, I don’t have this installed. So there is a complete extension around this. I don’t remember the name, otherwise, I would look it up right now. But that’s an extension on our marketplace for exactly that. You also have a couple of standard algorithms like FP growth and association rule learning as part of RapidMiner itself. So if you look for FP growth here, you can already see the small shopping cart here. Those are the operators which are available to you as part of the RapidMiner core, and then there are extensions which are further optimizing those functionalities.
Thanks, Ingo. I have another question for you. How does the complexity of a model affect the actionability of the insight?
Was the question about how does the accuracy of the model affect this or–? Sorry, I just didn’t get it.
Sorry. It’s how does the complexity affect the actionability of the insight.
Well, in at least two different ways. Great question, actually. In at least two different ways. The first way it is affected very likely by a more complex model. If you’re not doing your validation of the model right, it tends to be less robust and tend to produce more overfitting. So what that means is really that you’re very good on the data, you trained it on, but as the data changes the limits – data always changes because the growth changes – the model is not very robust. So I often prefer to really, even if it’s not an accurate goal with a more robust model, which is this is maybe coming with a little less complexity, because I just would expect that for small data changes or smaller changes in the growth, my model is not merely outdated because in both kinds of operationalization, either if you automate the processes, you want to be sure that everything is working as expected, but also if you visualize the results to the user. If those results change too frequently, people, again, they’re on this process of building trust, they wonder a little bit, like “Well, what’s going on? Why is this model now– it’s created some prediction yesterday, we’re getting a different prediction today.” That is not wrong, but it’s just another result of that the little change in the model is not very robust, so it reacts faster than to those changes, and that’s not really good. Well, that’s probably the biggest effect.
Which was the second one I had in mind? In terms of, of course, a higher complexity also means if you actually need somebody to sign off, somebody needs to say – well, especially the fully automated process – “Well, I trust this model. This looks good.” Often, what I’ve found in practice is you can prove – of course, with the right validation techniques – well, this model is going to perform well in 98 percent of the cases in the future, I properly cross-validated this, etc., etc.. You do all your work as a data scientist, and then sooner or later, somebody asks, “Yeah, but can you explain the model a little bit to me?” And of course, with the decision tree or with the depth of free, that’s easy if people can typically follow and understand. But then, explain, let’s say, a deep neural network or then explain a support spectrum sheen with, I don’t know, a radial basis function or a kernel function. And all of a sudden, things get a little bit difficult and nobody can really follow what this model is doing. So even if people get this abstract number – well, it’s 98 percent correct – then, they look into the model, they don’t get it and they really have a problem then to trust it. This is, again, a little bit more important when it comes to the automated version of operationalization because when the human being is still in the loop, what you can always do is, of course, you create the predictions and often, you can also create the predictions plus why this prediction is, how it is. So basically, it shows, at least, the influence factors which led to the prediction as it is. And by doing that, you can build enough trust in the human being. So it’s really those two effects, understandability, one thing, and then the other one is just the normal over-fitting problem and missing robustness. That’s why I, personally, in practice, often prefer a little bit of less complex model over a super accurate one coming with a higher complexity. But that’s just based on my own experience.
Great. Thanks, Ingo. So it looks like we’re just about at the top of the hour, so I’m going to go ahead and wrap up the question and answer. If your question was not addressed on today’s session, we’ll have someone follow up with you with an answer to your question. So as a reminder, the recorded version of this presentation is going to be sent to all registrants within the next few business days and RapidMiner will also be at Qlik Connections next week. We’re a sponsor of Qlik Connections down in Orlando. So if you’re there, make sure you stop by our booth. And thanks again, for everyone, for joining today’s presentation and have a great day.
Amplify Predictive Analytics with Data Visualization
Data science teams are often frustrated at the length of time it takes to get their expert models into the hands of business users. With Qlik and RapidMiner, those days are over. Organizations who have invested in data visualization can now easily use predictive analytics to uncover hidden insights within big and disparate data. Dr. Ingo Mierswa, RapidMiner President and Founder, demonstrates how to put predictive and prescriptive analytics directly into the hands of every Qlik user. Make more meaningful decisions, drive more actions and significantly improve the likelihood of achieving desired outcomes.