RapidMiner and Tableau are recognized as best-in-class platforms for data science and data visualization, and using them together provides a complete solution for analytics teams comprised of both data scientists and business analysts.
In this webinar, we demonstrate how RapidMiner and Tableau work together by considering a challenge faced by all manufacturers – minimizing or preventing machine failure. We show how RapidMiner and Tableau work together for predictive modeling and visualization.
This 30-minute presentation includes:
- A RapidMiner Studio walkthrough: preparing and modeling your data
- Tips and tricks for how data scientists and business analysts collaborate using Tableau and RapidMiner together
- Q and A
Hello everyone, and thank you for joining us for today’s webinar Minimizing Machine Failure with RapidMiner and Tableau. I’m Hayley Matusow with RapidMiner and I’ll be your moderator for today’s session. We’re joined today by Michael Martin, Managing Partner of Business Information Arts Incorporated in Toronto. Business Information Arts participates in the RapidMiner and Tableau partner programs. And Michael is a certified RapidMiner analyst and Tableau professional. Michael worked internationally in a variety of business sectors that include market research, consumer packaged goods, retail, banking, manufacturing, telecommunications, hospitality, governmental, and non-profit. His project deliverables include business performance forecasts, strategic and operational case studies, recommendation engines, and operational reporting and such.
We’ll get started in just a few minutes but first, a few quick housekeeping items for those on the line. Today’s webinar is being recorded and you’ll receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who were not able to attend today’s live session. Second, if you have any trouble with audio or video today, your best bet is to try logging out logging back in which will resolve the issue in most cases. Finally, we’ll have a Q and A session at the end of today’s presentation. So feel free to ask questions at any time via the questions panel on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. I’ll now go ahead and pass it over to Michael.
So, good morning, good afternoon, good evening, wherever you may be. RapidMiner and Tableau are leading platforms for data science and visualization, but better yet, they work very well together. RapidMiner delivers insightful machine learning outputs and Tableau communicates them in a clear and visual way. Both platforms have rich feature sets, are well-documented, and have large and active user and developer communities. To demonstrate this use case to see Tableau and RapidMiner in action, we’ll use a use case for manufacturing Minimizing Machine Failure. We’ll look at the data inputs, a RapidMiner predictive model, and how the model outputs can be visualized in Tableau.
So, imagine that a company is manufacturing a new line of espresso makers in a pilot factory with 136 machines. Each machine in this factory has sensors that monitor environmental and mechanical conditions. But in the first month of productions, there were many lost product units due to machine failures and a limited understanding of why. To meet demand, another factory modeled after this pilot factory is about to go online. So this company needs to very quickly understand why machines in the pilot factory failed and identify machines at risk to fail in the future. So the next steps for this company would be to use RapidMiner to build and validate a predictive model for machine failure, generate failure predictions for machines in the new factory, develop and distribute role-based reporting, and take action to minimize and prevent at-risk machines from failing.
So let’s take a look at how the data is organized starting with factories. We see that each factory has a location ID, a name, and some other information. We see the machines in each one of these factories. Of course, each machine is indexed to a location ID. Each machine has its own identification number but very importantly, each machine has a position ID which corresponds to its physical location in the factory. For example, the machine at position 10 will always be a shaping machine in bank T, and a bank is simply a group of machines. Every machine at position two will be a center lathe at bank M in the factory. Then we have the sensors in each machine, 25 in all, that measure various mechanical and environmental factors. Last but not least, we have the actual sensor readings, and you can notice here that the data has been coded with his failure data field to indicate whether or not a given machine failed. This means that the RapidMiner model will be an example of what is called supervised learning wherein RapidMiner looks at the sensor data in each row and learns why a given machine failed or didn’t fail.
So let’s see RapidMiner and Tableau in action. We’ll switch to a demo. And to start off with, this report here for an analyst could be quite useful to understand and summarize learnings to date about the machines that failed in the pilot factory. So here are different machine types. We see 136 in all. We see that 60 did not fail; 76 did fail. So that’s a failure rate of 56% or a pass rate of 44%. Here we are leveraging Tableau’s ability to use a background image to map data on top of. Here we see all of the sensor readings for the 136 machines. Here are the sensors. Each line here represents a given machine. And then we can scroll down and see a distribution of values for each of the sensors for machines that either failed or didn’t fail. Right away we notice something rather interesting. If we look at outside temperature, we see the average value for machines that failed was 14.7 as opposed to 9.04 for machines that didn’t fail. Likewise, we can click on any given machine type we like. We see exactly where they are in the factory. We can scroll down a bit and see the readings for that individual machine and hopefully, we get a little bit of a better understanding about what may be going on. Here, an example for humidity, we see an average reading of 10.25 as opposed to 4.58. So an analyst, he or she could dive into the numbers here and start to get an idea of what’s going on.
Conversely here, this report more for a manager or a shop foreman, you simply highlight a given machine, you see the machines that meet that description, you see the failure rate, and again, you see average readings for various sensors. For example, air circulation had an average value of 5.5 for machines that failed but for machines that didn’t fail it’s 10.3. That’s a much higher reading which is desirable in this case. Then this report for a shop steward or shop foreman, very detailed. It lays out all the sensors, it lays out all the readings for each individual machine. So these different role-based deliverables can start to give us a sense of what went on in that pilot factory. And we’re now in the position of being able to create a RapidMiner predictive model.
And here we are now in RapidMiner. We’re seeing a workflow. This workflow has 20 steps and RapidMiner makes it very simple for you to see what order everything will happen in. And this workflow uses a variety of what are called operators. Each one of these little rectangles is a RapidMiner operator. Each operator encapsulates a specific functionality. So, for example, I could drag this operator, any operator, onto the canvas and all you basically do is then specify how they’re supposed to function, what their properties and parameters are, and then you connect them together into a complete workflow that RapidMiner calls a process. So let’s quickly step through this process.
We first read in the machine data from a sequel server database, in my case. We then use the set roll operator to tell RapidMiner that we want to predict failure. That’s the data field we want to predict. And in RapidMiner that’s a very special type of field; it’s called a label. That’s the value we want to predict. So it has a special role of a label but it’s a special label because it only has two values: Yes or No. So what’s called a binomial label. We then use the next couple of operators to introduce some weighting into the data, balance the data so that when the predictive model is being built the algorithm is seeing more or less equal instances of cases where machines fail or not fail. Here’s a really interesting and useful operator in RapidMiner. It’s called optimized parameters and what it allows you to do is experiment with the learning algorithm that you’re using. For example, I’m using k nearest networks– k nearest neighbors, pardon me. And so I’m able to click on the various parameters of the algorithm such as K which is the number of nearest neighbors. So I’m going to experiment with between 2 and 19. I’m going to try using a weighted voting, True or False. I’m going to experiment with all the different kernel types that this algorithm offers, and then for my cross-validation, I can segment my data into between 2 and 19 segments and different segmentation of the data may help the model be a little bit better. But if we look at that briefly again, I should mention that we have 5,776 combination using these parameters. And what’s great about RapidMiner, RapidMiner tests them all one after another in quick sequence, finds the best combination of parameters, and then we use the save model operator to save the model to disk. RapidMiner then generates the predictions, writes the predictions back to the sequel database, and then right within RapidMiner you have the right Tableau extract operator which writes a Tableau extract to disk with the predictions so that you can natively connect to them in Tableau.
We then do one last step. What we do is we throw a variety of statistical operators at the data that determine which censors seem to be the most indicative to pointing towards a machine failing or not failing. So this is extremely useful information to have. These are written to disk and now we have a complete process. Once a process is finished running, immediately you’re taken to the results. So in RapidMiner, here are our failure predictions. Here are the consonances of those failure predictions. Here are some of the weighting we introduced into the data. And here is the normalized data because k nearest neighbors tend to do a bit better with normalized data. So now, we’re in a position to use this model and to better understand what happened in the factory.
But the best predictive models are the models that learn the most important influential causal factors driving outcomes. They learn the signal in the input data but ignore most or all of the noise. Better models are said to generalize very well and make more accurate predictions when fed wider varieties of input data they’ve never seen before. This is preferable to models that learn the signal and the noise of the input data. But when confronted with data they’d never seen before make predictions that are much, much less accurate. And there’s a very interesting article on the web called Understanding the Bias-Variance Tradeoff. It’s at this URL here. It discusses this very important point in more detail and I found very useful. We’re now in a position to use this model very quickly to take one last look at our pilot factory to get an understanding of what may have happened. And remember, we had RapidMiner generate a table of key influencers, and we see in the pilot factor if we blend all the influencers together and sort our sensor readings by these influencers, we see immediately right away that a key influencer is outside temperature, next humidity, then rotation, vibration, and then air circulation. Rotation and air circulation, higher readings tend to favor machines not failing. Outside temperature, humidity, and vibration tend to favor machines failing. So we can look here and we can see very different profiles based just on these sensors. And, of course, what we can do is click on any given machine type as we did before. These great sensor influence indexes, Tableau uses them to sort them, and we can see that for this type of machine, center lathes. Humidity index is at an 82 followed by vibration and air circulation, and you can immediately see the profiles between machines that fail and don’t fail. And we can immediately see very wide variances in the distribution of values here for machines that failed, here for machines that didn’t fail. So right away, we’re able to leverage this information from RapidMiner to help us better understand what happened in the pilot factory. And we’re now in a position to switch back to RapidMiner and actually generate predictions for machines in the new factory.
So here’s a very simple way to do it. We can see it’s just a five-step process. We read in the model that we recently saved to disk. We then read in the new data from our sequel database. We then apply the model using the apply model operator. We immediately then save these predictions out to our sequel database, and we can write a Tableau extract that a Tableau user could directly connect to. You notice that with RapidMiner – I should have mentioned this before – you can add little notes and document your work as you’re going along, which is very helpful. So this would be one way we could generate predictions. Here’s one other way that’s a little more complicated. It has more steps but we get a few extra potentially interesting pieces of information for doing so. Of course, this process reads in our model. It reads in our data. We generate our predictions. We write them to disk. Then we write the predictions as a Tableau extract, but then we can feed our predictions through a clustering algorithm. Because it could be very interesting to see if there are any other things we could learn by clustering our predictions into different clusters based on different reasons why machines failed because after all, there are going to be a variety of reasons. Once we’ve done that, we read in our predictions, and once again, we go and try to find key sensor influencers like we did before. We throw a variety of mathematical operators such as weight by correlation, weight by Gini index, weight by information gain, information gain ratio, etc., and we’re hoping to get some extra insight into some of the factors that influence these new machines to either fail or not fail. And we write them to our SQL Server database and again, we write them out as a Tableau extract.
So now we’re in a position to get an idea about what is going on in the new factory. This is what we’ve been building toward. We generated these predictions. So, unfortunately, we’re in for a little bit of an unpleasant surprise. We see here in the pilot factory we had a 56% failure rate or a 44% success rate. We see here, however, “Wow,” 32% pass rate, 68% potential fail rate. That’s serious. So we’ve got to get to the bottom of that as quickly as we can. If we go down here, we see something interesting; a little bit different. We have, remember, readings failure information from two different factories, the pilot factory, and the new factory. And what’s really concerning to us is this group of machines right here in these positions. For example, position one in the factory. We will see this very low reading of air circulation, high reading outside temperature. These are machines that did not fail in the pilot factory but RapidMiner thinks are at rather severe potential risk to fail in the new factory. That’s 33 machines that flipped. This is particularly concerning. There’s a little bit of good news in the sense that here are 16 machines that failed in the pilot factory but RapidMiner thinks these are going to be okay in the new factory. And we can see higher readings of air circulation, lower readings of outside temperature, etc. Then these are the machines, a big group of 60 that failed in the pilot factory and RapidMiner says are at risk for failing in the new factory. And last but not least, here are 27 machines that did not fail in the pilot factory and RapidMiner thinks we’re in the clear in terms of also in the new factory. And we can see these machines have higher readings of air circulation and lower readings of outside temperature, etc. Just like we did in the earlier analyst dashboard, we can click on any given machine in a new factory, see where they are, and RapidMiner, if we look at the center lathe machine shows this index of 65 followed by an index of 43 for outside temperature, a index of 37 for exhausts. So for each different machine type, we get this sorted list and we can very quickly once again, immediately recognize profiles between machines that fail and don’t fail. We can see this distribution of values. For example, for internal tension, we see a real big difference for this class of machines, for the measurements for machines that fail and don’t fail. So for an analyst again, he or she can dive into this and get quite a bit of understanding about what went on in the new factory.
For a shop steward or a manager, it’s as simple as just clicking on a machine type. You see the sorted list of influencers. So we see a blended average of 47 for all of our lathes saying that air circulation, followed by rotation, followed by outside temperature and vibration are key sensors for this type of machine. We see the machines that failed. We see their sensor readings in a pop-up. We see here are the machines that flipped from not failing to failing. And if we take a look at another type of machine Turret milling, it’s a small number of machines but we see five machines flipped from not failing to failing. And if we go back to the original factory, we see that our pilot factory only 23% and here we see 54% failed. That’s a huge difference. Now could this be an example of an algorithm run amuck? It would be good to really test that a bit more if we could and fortunately, we can. So RapidMiner provides these predictions. It provides the indexes that point us in the right direction. And in Tableau, we can simply select a machine type like Turret milling that we just looked at. We select a failure state comparison of no to yes and we can then check out these key sensors and see is the data really that much different in the new factory to justify those failures? Well, we see decibel volume, a leading indicator for this type of machine is much higher. What about internal pressure? Much higher on average. What about air circulation? Air circulation is lower. What about outside temperature? A little bit higher. What about voltage? Voltage is higher. What about power drain? Power drain is higher. What about smoke? Smoke is a bit higher. What about air intake? Air intake, we’re bringing in more dust, and smoke is certainly not a good thing for a machine. So we can see there’s ample justification for why the RapidMiner model could flip those five machines from not failing to failing. And even if we were to blend in all 33 positions with 66 machines, 33 in the pilot and 33 in the new, here all of the machines that flipped. And is it justified by the data? Well, we could see that dust on average for all of those machines is higher. Outside temperature, more or less the same. Air circulation, a little bit less. What about voltage? Voltage higher. What about power drain? Power drain higher. What about internal tension? Internal tension, about the same. What about exhaust? About the same. What about decibel volume? Fair amount higher. What about dust? Fair amount higher. What about humidity? A little bit higher.
So we saw that many sensor values had higher readings for sensors that indicated failure, and we saw something like air circulation which is indicative of a machine not failing was actually a little bit lower. So there appears to be some real ample justification as to why these machines failed that didn’t fail before, and this is all leading to a management report. Very simple. Straight up and down. It’s showing our 48 machines that RapidMiner has told us are at elevated risk for failing in the new factory, and we can see it’s machines of all types. Very easy to filter. It’s a real call to action. We see that the machines that failed in both factories, they’re particularly troublesome. We see the machines that didn’t fail in the original factory but failed in the new factory. And this is a call to action for an analyst, for example, to then look at a more advanced sensor analysis report here. And we’re looking now just at shaping machines. And if we look here at these cells we see these green cells have below-average values. For example, for this particular machine, the average value for air circulation is 6.97. It indexes at 58 being a very important factor, and it’s no accident I think that these machines didn’t fail and they have these basically below-average readings for humidity. But once we get into the red zone, these are machines that are failing and they have much, much higher readings. So what you can do here as an analyst, as a user, you can highlight a couple machines here. You see their sensor readings. You see lots of red here. These are machines that really have lots of sensors with out-of-bounds values and you can come right down here and see exactly where they are in the factory, right? So you see what the problem is. You see where the problem is. You could also, for example, highlight a machine here that didn’t fail. Let’s say, we unhighlight this machine. Each of these are kind of poster children for their failure status. We see a machine that didn’t fail, very low humidity, higher circulation. We see a machine that failed with higher humidity, much higher readings for, for example, internal tension dust, etc. So RapidMiner and Tableau working together make it rather obvious where the problem is and what needs to be done about it.
Now, I want to leave you with one other last dashboard here. Remember we generated some clusters when we created our predictions? I’ve isolated three clusters here that have 52 machines with various elevated and moderate risks of failure. And here they are. What we can do though is we can color these by cluster. So we see the readings for these three different clusters and we see that we get three very, very different profiles. If we look at this very first cluster, we see that what distinguishes it are very high readings for internal tension. If we look at the second cluster, we see high readings for a variety of reasons: humidity, dust, internal pressure, and outside temperature. If we look at this third cluster, it’s totally different in the sense that – pardon the pun – the internal temperature readings are through the roof for this group of machines. And if we were particularly concerned about that, all we have to do is filter and we immediately can apply this filter, and we can come down here and we see exactly where these machines are. We can look at those really high readings for internal temperature. So RapidMiner and Tableau working together makes it really simple to see what’s going on.
We saw a lot of features in this demo. It’s really the tip of the iceberg of what Tableau and RapidMiner can do together. But we saw with RapidMiner drag and drop operation, no coding required, steps in the workflow are highlighted. You always know in what order everything will happen. You can read and write from databases, Excel, text files. You can even write Tableau extracts. You can mash up and blend data. You can optimize parameters. You can find influencers, normalize data, use cross-validation, see process results, reuse models. You can layer data over a background image in Tableau. In Tableau, we saw lots of filtering. We used actions to drill down. We used parameters to color measures by different dimensional attributes. We used a range of calculation functions to aggregate attributes and calculate ratios. We used highlighters to see subsets of data within greater sets of data. We sized dashboards. We could apply lots of different formatting options to text filters, parameters, and legends in Tableau. And I’m hoping there’ll be a lot more collaboration between RapidMiner and Tableau team members. Aligning to deliverables and knowing the data is always key, and good deliverables are one thing but you’ve really got to be clear on how they will be used, by who, to do what. A little curiosity goes a long way with these great platforms. You’re always rewarded for being curious but not always as you might expect. It’s good to have an awareness and be interested in the capabilities of both platforms. Show your work in progress, validate your outputs with stakeholders, and even there, there’s so many possibilities. Staying on message, of course, is key. Barriers of entry to getting started with RapidMiner and Tableau are really low. If you are a Tableau user you can download the free RapidMiner community edition as a full feature set: many examples, many tutorials. There are books out there on RapidMiner. There are helpful support forums. RapidMiner users can download trial versions of Tableau Desktop or the community edition of Tableau which is called Tableau Public. It has a full feature set. There are many solution examples to learn from online. Lots of videos on YouTube. Very helpful support forums and books.
If I can leave you with one last thought. If your company uses Tableau, I suggest that your company seriously checkout RapidMiner. If your company uses RapidMiner, I suggest that your company really check out Tableau. Because simply put, these two great platforms are even better together. That’s it. And thank you for listening.
Great. Thanks, Michael. Thanks again for your great presentation. Another reminder to those on the line, we will be sending a recorded version of today’s presentation within the next few days via email to everyone who registered. So now some of our Q and A as I mentioned before. I see a lot of questions coming in, so we’ll go ahead and address those now.
So the first question here is, is we use RapidMiner and I think we should try Tableau. What’s the best way to get started?
Well, I think that what’s great is, as I mentioned in the presentation, there’s lots of great material online. Taking a class, a good structured class is always good. And if you’re getting started with Tableau, use data you know really well so that at least you know the data very well when you can put your full focus on learning the features of the software. Those are the two biggest suggestions I would give in the short amount of time I have to answer that question.
Great. Thanks, Michael. Another question here is, did you deploy any feature selection to get a final list of sensors?
Yes. Essentially, in this process I used what I demonstrated here, which was to actually throw the different algorithms– throw those different learners at the sensor readings to rank them. I did also try principal component analysis, which is quite good for capturing variance. It didn’t make a particular difference but, of course, there’s also forward selection, backwards selection. So RapidMiner gives you many different ways to address that. I found that I got very good results using the information gained in Gini index operators in this particular case.
Great. Thanks. Another question from the same person. Is it possible in a tool to tie back sensors to trips without visual inspection?
Pardon me. Could you repeat that again, please?
Yes. So is it possible in the tool to tie back sensors to trips without visual inspection?
I suppose it would be. You would just have to be able to integrate that into your data model so that you have the right metadata to make the differentiation. Yes. It’s going to really depend on your data model.
Great. Thanks. Another question here is, this person uses RapidMiner for market basket analysis. Can Tableau visualize the network graph, they’re asking?
Yes. Actually, if you go on the Tableau website and look for the Toronto Tableau users group, you will see a presentation I gave from September of 2012 where I illustrate how to make Tableau do network graphics, network scene graphs. There’s a little bit of work you have to do in terms of getting that data model ready, but by doing certain types of double access graphs in Tableau you can actually create a network map within Tableau. Yes.
Great. Thanks for that. This person’s saying they’re using– or they’ve used Tableau but they haven’t used RapidMiner. They’re saying when you showed the clustering stuff, was that clustering in Tableau or was that in the RapidMiner?
Right. The clustering all came from RapidMiner in this case. Tableau does have some built-in clustering but in this case, I used a clustering operator from RapidMiner as its capabilities are a bit more advanced than what you get natively within Tableau.
Great. Another question here. This person is asking what stats can be taken to validate the predictive models, for example, over time.
The best way to do it over time is obviously, to continue to code your data. In other words, in this use case, we had a single factory with coded data. Then, of course, we generated our predictions. So, of course, part of any project like this is maintaining your data model going forward which would involve continuing to score how well the model predicts and obviously, to retrain the model periodically so that a greater range of data can be incorporated into the model and increase the model’s capability to generalize well in the future.
Great. Thanks. Another question; this person is asking what are their options for exporting the data source?
Oh, they’re exporting data out of RapidMiner, you mean? I’m assuming that’s what you mean.
I believe so.
Yeah. RapidMiner has been great on this, and this is one reason why I like RapidMiner. It used to be really hard to get data out of opaque data mining or data science platforms. RapidMiner allows you to export and import data from a variety of data sources just like Tableau does, actually. And RapidMiner has also come out recently with a converters extension which allows you to convert many, many RapidMiner objects – internal RapidMiner objects – such as clusters, such as various arrays of data that have been always internal to RapidMiner into objects that can then go into other programs. For example, in RapidMiner you can log the operation of every single step and you can now convert that log into a data set so that you can actually analyze how RapidMiner has built your predictive model. So RapidMiner’s been really great in that area in terms of opening up results to go into a lot of different platforms.
Great. Another question here. Is there a possibility to trigger RapidMiner with Tableau? Or is the data shown in Tableau pre-calculated? So this person’s asking can they choose a specific time frame in which they want to see that analysis or prediction in Tableau, and then RapidMiner is doing the calculation for this in any given time frame? Does that make sense?
Absolutely. In other words, RapidMiner– and I’ve helped RapidMiner actually, with some feedback on this. RapidMiner is developing a component that absolutely allows you to send a filtered data set to RapidMiner server and RapidMiner server has a model deployed– assume that a RapidMiner server has a predictive model aimed at that data set but a Tableau user could send a filter data set over to a RapidMiner server with the predictive model. The model is run and then the results are sent right back live in real-time to your Tableau worksheet. So absolutely, this type of– if I’m understanding your question correctly, this type of real-time dialog between RapidMiner hopefully can be made generally available before too long because it’s a wonderful feature.
Great. Thanks, Michael. One more question. This person’s asking: what would be the main reason to convince their supervisor to start using RapidMiner and Tableau?
Return on investment. If someone’s a real widget man or widget woman and they need to be convinced, I would say return on investment. The outputs you can produce with this tool, with these combinations of platforms, turn into extra dollars frankly, is the best reason. Of course, you have to use the argument that your management will really respond to. If they respond more to detailed use cases, then the ball’s in your court to configure a use case that’s relevant business-wise to what your company does and demonstrate an effective solution that can actually be implemented within your company. And again, that’s always a risk in machine learning. We have to be able to, not only come up with some great outputs, but we have to be able to explain them in a structured narrative with a beginning, middle, and an end. And we also have to take care that whatever we’re producing can actually be implemented with the people we have onboard. Will they know how to use the outputs in order to drive the results, so at the end of the day, everyone agrees there’s been return on investment.
Great. Thanks for that. It looks like another question here. This person is asking about the reports in Tableau. They’re asking were those manually added on the interactive ones that you showed at the end of your presentation?
Yeah. The version of Tableau I used is Tableau Desktop. That’s Tableau’s authoring tool that allows you to create visualizations and/or dashboards. So using the RapidMiner outputs, all the great stuff the RapidMiner did for us in the background, I then built a few dashboards using Tableau as an authoring tool in order to highlight the findings from RapidMiner.
Great. Thanks again, Michael. So it looks like we’re just about time here. If we weren’t able to address your questions – we had a lot of questions come in online – we’ll make sure to follow-up with you via email in the next few business days. So thanks again, Michael. Thanks, everyone for joining us for today’s presentation and I hope everyone has a great day.