In this Gartner Data Science and Machine Learning Bake-Off session, we had the opportunity to show off our data science capabilities. Watch the video below to see the RapidMiner platform in action.
See RapidMiner in Action
The end-to-end analysis used a machine learning model created in RapidMiner and discovered that some expected factors had a strong influence on life expectancy, including economic factors like GDP and unemployment, as well as personal lifestyle factors like activity level and alcohol consumption. But there were some surprising factors that were uncovered as well.
In the US, maternal and infant mortality are a major factor in the country’s reduced life expectancy when compared to other developed countries. With the global COVID pandemic causing economic turbulence, lowering GDP, and worsening unemployment, it’s clear that the US needs to remain focused on, and continue to improve, the quality of care during and immediately after childbirth by ensuring that resources from the country’s high health care expenditures are properly routed, thereby creating a healthy future for American mothers and their babies.
Hello, everyone. I am Martin and I’m running the data science team here at RapidMiner. What does running the data science team mean? Well, I help customers to reinvent enterprise, to teach AI so that anyone has the power to positively shape the future. And with anyone I really mean the full spectrum of people. Like on the one side, we have data scientists. I work with subject matter experts. I work with business analysts and all of them want to build sound models but not just sound models but models which you can put into business processes and which are actually viable and valuable. So that’s what I’m helping to do. And we do this by overcoming three big problems customers are facing. Well, one is people don’t find the right starting point, and what we do is we run an AI assessment to find the valuable and feasible use cases. Within our software, we have extensive support with our community with the academy and we have guided in-product exploration for use cases. And with our intuitive software, you can just run many use cases, test them quickly, see if you can solve them or not. Other customers struggle to bridge the expertise gap because let’s face it, machine learning is something hard, and with our center of excellence methodology we are directly approaching it. And with our easy end-to-end platform, there are more people who can actually contribute to data science projects. And other customers find it hard to sustain the value over time and with the model ops feature we see later in the AI Hub. We can overcome this because we are embedded into the business processes. And with these features, we make our customers successful and we later show more of this.
Okay let’s get started with RapidMiner, load some data, do some feature preparation, and also some feature selection. To do this, we first need to of course load data into RapidMiner. We can do this with this big green “load data” button here on the left hand side. I would like to import data from my computer. Let’s start with some economic indicators for the United States. The CSV file and RapidMiner is already doing some autoparsing so it detects everything correctly. I can click Next. Next, and finish and get the data into RapidMiner. What do I have here? So for the United States in 2007 we had run about 4.3 million tractors. In 2006 4.4 million tractors. And if I scroll down here I’ll also see the fertilizer consumption and so on and so on. So economic indicators for the United States for different years. But what I need for machine learning is I need one line per country per year. So in order to do this, I need to do a pivot operation. I can do this very easily by clicking on “pivot” and then I want for every country name, for every year, for every indicator value, the value. And you see it’s a very easy point and click interface to do, kind of complex reshaping of tables. After committing my pivot I come back to this view here and you see on top that there are red bars. If I mouse over it you see that the red bar is cost by 70% — 75% actually, missing values. And, of course, I need to do some cleansing to get good columns or good features to do machine learning later on. So I can click on “cleanse” and I can say, “Okay, please remove low-quality columns.” Clicking on apply, RapidMiner automatically does some feature cleansing here. See that the fertilizer consumption is gone because it hits 75% of missing values and that’s not really helpful.
I want to do this for not just the United States but actually for many countries so that I later on can do machine learning on more data. Of course, what I do not want to do is to do this point and click over and over and over again. What I can do is I can create a processor of this by clicking here on “create process” and you see that we automatically get a process which automates documents to do the same operations we did before. And this is now reusable and can be operationalized or reused in another workflow by other users. So different personas can now collaborate using this point and click concept or the process concept. What I could do is I can actually put this into a loop files operator to loop over my desk and do this operation for many countries. Afterwards, I can combine the tables into one big table and actually what I want to do is add quite some more indicators around tobacco and alcohol and so on from the World Health Organization. I do the same trick over again. As a next step, I want to add some more feature engineering. We use here some evolutionary algorithms to detect automatically what features are helpful for me and what not. This is actually where we lose our tractors because tractors are not really helpful to predict life expectancy later on. And then we split into two parts, one for training, one for testing. I’d like to show you something which we want to discuss later on. Let’s have a look at the data set we use and let’s do again a pivot operation. So let’s look at every country group. And country group means in this case the United States, Japan, or the rest of the G7. So France, Germany, the UK, and so on.
And let’s have a look at their life expectancy. There you see already that the United States is roughly two years or two and a half years behind peer group performance of the rest of the G7 and even four years behind Japan, the best in class. I can even look at this on a yearly basis. Let’s also look at per year. Let’s actually commit the pivot and have a look at a chart here. A line chart of the year against life expectancy. The colors here are the different countries. Below the blue line is the United States and you see that that’s below. What we want to do is figure out why? We want to know what’s going on here. Okay, let’s get started with some machine learning in RapidMiner. What I’ve already done before is I’ve loaded in the data set which we want to analyze to predict life expectancy. That’s the data set we prepared before you know, the one where we dropped the fertilizer, and later we did some feature selection also trapped our tractors. And now we want to predict life expectancy from these other columns. You see there’s quite some interesting stuff in GDP, income share of the lower 20%, unemployment rate. But also some health indicators. So let’s actually predict here. Life expectancy, select the column to predict. We see the histogram here of different life expectancies. Clicking on next will lead us to the input selection where RapidMiner runs, of course. The data quality test reminds us that there’s a high correlation between GDP and life expectancy which is a bit of what we, of course, expect. If we scroll down here, what I want to exclude is the country name and the year because we do not want to build a model which says the US is doing great. But we want to build a generalizable model. After clicking on “next” I can choose what kind of models I would like to take.
Let’s focus on high accuracy and let’s get this going. So this actually takes 20 seconds or 30 seconds on the cloud. So let’s start to move into the results here where we see first of all different performance metrics for the different machine learning methods outlined here so that we can compare them. So we have the cubic square, but we also have the average absolute error, relative error, all squared, and so on. Let’s have a look at the average absolute error. So what do we see here? We see that the support vector machine is the best model to predict with an average error of 0.22 years. Remember, we predict life expectancy in years. And the gray and boosted tree is a bit worse with 0.291. And kind of like the Challenger model, what’s also nice to see is that a relatively simple model like the generalized linear model, is actually performing fairly well with an an absolute error of 0.366. But the winning one is the support vector machine. Let’s have a look into our support vector machine. What do we get? So if I go here to the support vector machine, then we first of all see again the different quality metrics here how good our model is. But then you also see column weights. So how important was which column to actually predict life expectancy? As we see that the top one is prevalence for insufficient physical activity, then GDP raised fast in glucose levels, so something like diabetes, it’s important to predict. These are model-agnostic weights so we can use the very same methodology to do it for the GLM, for the SVM, but also for the grain boosted trees. We also see here predicted versus actual value. So if in reality the life expectancy was 80, we also round about predict 80 here. So the model seems to be relatively good.
So now what we, of course, want to do is we want to apply our model. You can actually export this model and push it over into the back into the process view so that now somebody who prefers processes can look at it, edit it, use it, and so on. But actually there’s no need to do this for this application because they can just click on “Apply model, apply on a new dataset”. I can select the data I would like to apply it on and just get it going.
So this then yields to this screen here where I can see my prediction 78.07, 78.74, and so on. But I also get indicators what drove the model to decide this prediction. So the most — or the key influence factors are the GDP and the unemployment rate for this specific transaction. These are not the global weights. These are really specific to a single row. And this way we can then easily see what is driving our models. What I can, of course, also do, is I can deploy the model instead of going here in my support vector machine again on “apply model apply on new data set”. I can just deploy it and if I click on “deploy model,” now I have already exposed it and there is my link which I can use an example (Jason) which I can send.
In our next session together, we’ll talk about how to monitor this deployment so you can be sure that it’s still a good deployment.
Welcome back. Let’s do some models management within RapidMiner. So where did we leave off last time? We had our support vector machine, our best model we found in the modeling section and we deployed it. In between I did the same thing for the GLM and the gradient boost trees you know the model which is simple and kind of okay GLM and our run up model the GBT. Both of them are deployed in Challenger mode, which means the primarily model which answers is the support vector machine but we use the GLM and GBT to also score, also to get notifications whether one of these is better or something like that. We see that before deployment our era of the SVM was 0.3%. What we see now is that our most recent error of the last month is actually 0.7%. So what’s going on? Let’s investigate this together. Let’s first have a look at the performance over time. So what we see here on the top left is the number of scores we had over time. And if I have a look at this, this is relatively flat so we can of course now check for spikes in scores or is there a day where there were way more scores or are the number of scores rising over time or something like that which would be of course a bit weird.
But for now for us the scores are perfectly fine. Also maybe even more important for us, we can check the error distribution over time and you see that this is relatively flat. All the time it was above 0.5%. And remember our pre-deployment error was 0.3%. It’s not like we have a spike in errors or that we have an upwards going trend. Both things aren’t there. So. Okay. That’s both not the problem. So what is the problem? Scoring time swell support vector machine is slower but that’s also not a big thing. And then we also see the distribution of breaks as far as an actual. Also pretty much the same. So we don’t see a problem here. So let’s investigate further. Let’s have a look at drift. So a drift is really where the training data distribution and the application data distribution differs. And we see here that we have quite some drifting columns like, for example, the tuberculosis treatment coverage. We see that the training and the scoring distribution are vastly different. So, apparently, our data changes, but is the tuberculosis treatment coverage really the problem for our misbehaving model? Likely not because if I go here to drift versus importance we see yes, on the one hand how much is this drifting but also on the other axis, how important was this column for the prediction. And that’s really again the weights we’ve seen earlier. And there we see that there’s the race and fasting blood glucose level, the GBT per capita, and most importantly the prevalence for insufficient physical activity creating this issue. So we need to fix that and we need to check our data. So first of all, let’s set up some alert so that we know what’s going on. So let’s add an additional alert. If the average error is greater than 0.2 and then I get an email. Done. I add an alert. The other thing in these cases is that what you want to do is actually do some auditing. Figure out what went wrong with the model. What RapidMiner gives you is a complete audit of the model. So I can go back here to models and click on my support vector machine and first of all I see the complete model itself. But I can also dig deeper here. I can have a look, what is the data this model learned on? Because maybe I’ve used outdated data or data which is not nice or something. I can dig into all the other results we had. Again, the weights, the training data, single row of the original data, the run times, everything is available here to check for you what is going wrong. We even go one step further. You remember that you could download the process when we’ve built the model.
Well, we actually also stored this process for you. So you can go to “process” and say, “Load in-design view.” And then you see here the process which defines the training. And you can now dig down and really see, “Okay, where is something going wrong? Is there something going wrong?” And check it and also show it to other people. So with this auditability and the scoring and the checks you have a very, very strong method to check your models.
Hello, welcome back to RapidMiner. Let’s talk about the lessons we learned while doing our analysis. If you remember, earlier what we did was we created such a chart here. We created the life expectancy comparison from the United States compared to peer group performance, the rest of the G7, and top performer Japan. And what we learned was that the United States are way below peer group performance. And what we wanted to do is we want to figure out why? During our modeling phase, what we did is we looked at these global weights. We looked at what is driving the life expectancy and you can round about put this into three different categories; general economical factors, GDP per capita, unemployment rates, and so on. Then personal preferences or personal lifestyle. The prevalence for insufficient physical activity, the raising false blood glucose level, beer consumption, something like that. And then in general how much is a country spending on health? Private prepaid plans, total expenditure on health, these kind of stores. Those three are in general driving life expectancy. Then what we can do is we can actually look also at the United States. What are these things for the United States? So here we have two new factors coming in. The maternal mortality ratio – how many women are dying in the process of giving birth. And the infant mortality rate, which is how many children are dying up to the age of 1. And that was suspicious. So we ask ourselves, “Okay, what is special for the United States here?” So let’s have a look together those charts. So we see here on top, the maternal mortality ratio and you see that the United States is way higher than peer group performance. So they are way more woman dying in the process of giving birth in the United States compared to peer group performance. The same story is true for that the infant mortality rate. So children up to the age of 1 are dying way more per 1,000 life birth in the United States compared to peer group performance. And that’s shocking. That is a big problem. It’s a big problem. We can obviously see and which we identified in our analysis. So what may I say is that’s because there’s not enough money in the system. That’s not true. The United States are a very wealthy country and the United States spent 16% of their GDP on health care, while the rest of the G7 is actually spending less in percentage of their GDP on health care. So there is money and we need to make sure that the money comes to the points where it is really needed. And it is needed in this case. with the current situation with COVID still being present everywhere, problems will just be harder. GDP is going down. Unemployment is going up. We have still lockdowns and curfews in place in some countries. That will make this problem just more. There will be more problems like that. So please let’s stand together and if we can change something on this, let’s do that together. Thank you.