Using Data Science for Predictive Maintenance

Organizations often face the challenge of ensuring maximum availability of critical manufacturing systems, while simultaneously minimizing the cost of maintenance and repairs. Early identification of potential concerns helps organizations deploy limited resources more cost effectively and maximize equipment uptime.

Watch this webcast and product demonstration where we share how to leverage machine learning and data science on your available manufacturing or operations data to help you:

  • Minimize maintenance costs
  • Reduce unplanned downtime
  • Avoid cost for failure recovery
00:00 Hello everyone, and thank you for joining us for today’s webinar, Using Data Science for Predictive Maintenance. I’m Hayley Matusow with RapidMiner, and I’ll be your moderator for today’s session. I’m joined today by Leslie Miller, our industry solutions experts, and Jeff Chowaniec , our RapidMiner data science expert. Leslie and Jeff will get started in just one minute, but first, a few quick housekeeping items for those on the line. Today’s webinar is being recorded, and you will receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who are not able to attend today’s session. Second, if you have any trouble with the audio or video, your best bet is to try logging out and logging back in, and that should resolve the issue in most cases. Finally, we’ll have a question and answer session at the end of today’s presentation. Please feel free to ask questions at any time via the questions panel on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. That’s all for me. Now I’ll pass it over to Leslie.
 
00:57 Thank you, Hayley. Hello everyone. So today we’re going to talk about predictive maintenance. That’s what we’re all here for. So we’ll just go quickly over the maintenance journey, and some of the challenges. Why predictive maintenance can help. Some items you want to make sure that you’re getting when you’re looking at a data science solution. A little bit, just a little bit about RapidMiner, and three customer use cases that we have. Then, Jeff will go right into the predictive maintenance demonstration. And as Hayley mentioned, we’ll follow-up with some Q&A. So if we look at maintenance engineering goals. We see that the Internet of things and the ability to add centers to everything is continuing to permeate every sector of manufacturing, from transportation and logistics, to automotive and utilities. And our maintenance engineering goals don’t change through all of that, but they are to reduce downtime and to maximize efficiency.
 
02:10 So when we look at a few of the things that are impacting organizations right now, certainly there’s global competition. Manufacturers across industries are grappling with faster product cycles, increasing complex global supply chains, rapidly rising offshore labor costs, mass customization, regulatory demands for traceability. And then, there is certainly industry 4.0. It encompasses much more than predicted maintenance. But trying to coordinate you know that global supply chain, all the way across, and to be gathering data and to optimize all across those things. And then, of course, there’s the Internet of things. Machines and equipment are increasingly connected with ever more sophisticated data gathering, generating billions of data points every year. And sensors are critical to monitoring the health of our equipment. Think of some of the things that we have out there. Infrared tomography, sonic, ultrasonic analysis. Motor current analysis, vibration analysis, oil, all types of sensory analysis are out there. And in our global economy, even minor differences in efficiency and productivity can determine which companies thrive and which ones fail. So maintenance engineering is one area that can make a huge difference to both the toppling productivity, and also to bottom line efficiencies.
 
03:52 Equipment maintenance has come a long way in the past decade. Many businesses have adopted smarter strategies to improve their efficiency. When we began, we were just doing reactive maintenance. You know, fix the adaptor, it’s broken, where little or no maintenance is conducted until after the fact. This is segued into preventative maintenance, or scheduled maintenance, based on the repair or replacement of items on a fixed calendar, schedule, regardless of the condition of the equipment, the piece, or the component at the time. This approach has obvious benefits over a reactive approach, but it can lead to excessive replacement of components that may still be in good working condition, as well as increased amount of downtime due to service of the equipment on a fixed schedule.
 
04:50 Some of the other challenges we’re beginning to encounter with preventative maintenance are we’re not using the existing data that we’re collecting. You can see I have two different quotes here, one from Gartner, one from McKenzie. Gartner says that 72% of the manufacturing industry’s data is unused due to complexities involved with variables such as pressure, temperature, and time. When we look at McKenzie, they talk about all of the data also not being used. For example, on an oil rig that has 30,000 sensors, only 1% of those data are examined. They’re not optimizing with prediction, which does provide a greater value. So ss we discussed, we have too much data, it becomes increasingly difficult for humans to brew forth answers from these mountains of data, and to find the best solutions to our ongoing quest to keep our operations up and running at an optimal level.
 
05:59 The next also is very data related. We have these huge amounts of data to sift through, and they’re being generated across massive distributed operations, so all different types of data from across the supply chain. And those data tend to live in different systems or data silos. Frequently, this data is not yet intelligently evaluated across all ofthe various machines and processes, and the useful findings that we may be able to uncover are not being exploited. So this makes it difficult for the maintenance engineering team to find the correlations across all of these different pieces and parts, and to preemptively get to the anomalies that can cause the breakdown or the outage.
 
06:54 Next, it’s just complicated. And I thought that this image just really brought home the fact on this one jet engine. Think of all the sensors that are on this particular engine. So, with regular operations, there are numerous variations and sensory output. And while this data will fall into normal ranges, pinpointing significant anomalies is extremely complex for a human team, there’s way too many pieces and parts that mountain of data again. And industrial machines don’t just stop working. Failure is almost always the result of a chain of events. As one problem leads to another, a digital signal is created. One of my colleagues says that it’s similar to symptoms of an illness. So for complex machines and systems, these symptoms are scattered over millions of data points and come from various sensors at different times, and of course, as we just discussed, are stored in separate silos. So finding the critical signal amidst all of this noise becomes humanly impossible.
 
08:14 So the first companies that can figure out how to automatically convert their vast data into actionable information are going to gain a competitive advantage. Think about Google, how it gained a huge advantage over traditional advertising platforms by applying big data techniques and machine learning to their consumer mouse clicks. So they turned this into high-value information and very healthy revenue stream for themselves. So in a world of too much information and complexity, you need to find something that’s going to help point you towards what’s relevant, what’s interesting, and valuable in your mountains of data. Detecting machine faults early drives efficiency in your maintenance process and is going to open up completely new possibilities for your company. One of the most important is maximizing operational productivity. Fewer downtimes maximizes the available critical equipment, and that means that your organization has additional capacity or productivity to grow revenue. And I’ll talk about a use case later on that showcases that exactly.
 
09:34 Also, using data science can help you to build highly efficient maintenance services. Say that three times fast [chuckles]. Improved efficiencies allow for integrated, smoothed processes throughout your process and allow your team to deliver even more stringent SLAs. And then, another of the items that we talk about a lot is optimizing costs. When you have an effectively planned maintenance process you can save millions of dollars a year in repair costs and expendable materials, and also help extend the machine lifetime. Because there are just a few of them, you can also see things such as increasing the health and safety of your organization for your workers. These are some of the other things that can come out of a predictive maintenance process. Every day you’re facing the challenge to ensure the maximum availability of your critical manufacturing systems and trying to minimize the cost of your maintenance and repairs. Applying machine learning and data science means that there’s no more guess work that’s going to be involved. Your engineers can say with certainty which pieces and parts need replacing and when.
 
11:07 It is predictive maintenance that is going to help you address all of these many challenges and give you that competitive advantage. As every efficiency and element of productivity counts, we can apply data science to all of the vast data that we’re collecting to predict and prevent equipment failure and to fix problems right on time. Machine learning can find the critical signals hidden in all of the noise of your operational maintenance and inspection data, and can automatically pinpoint deviations that indicate the possibility of damage, wear and tear, that can indicate partial or complete machine failures. You can predict when, where, why acts of failures are likely to occur. And again, that means no more guesswork for your engineers. So they can say with certainty which parts will need replacing and when.
 
12:07 What should you look for in your data science solution? Well, number one, you want to ensure that your solution will address the whole data science life cycle, from data prep to the modeling and the validation of those models, to the operationalization and the model management of your data science endeavors. Next, you want to make sure that you’re handling all types of data, big data where you’re conditioning, collating, analyzing months even years worth of data. You want to be able to ensure that your solution can manage Y data problems. Since each piece and part of your equipment and process has sensors, you want to ensure that your data science solution can help you analyze a vast amount of variables. And then you want to make sure that you’re using all types of data, not just structured data, but also semi-structured data from log data, unstructured data from inspection or maintenance reports. You want to incorporate all of these data into your models so that the result is a more accurate, all encompassing prediction.
 
13:27 Along the same lines, you want to make sure that you’re connecting and joining data from these multiple sources across those data silos so that you, again, can make a far more accurate prediction when you’re taking into consideration all of the different pieces and parts that impact the health of your equipment. Next, you want to ensure that your data science solution has the full breadth of machine learning algorithm. You want to be building thousands of effective data models that are going to run in parallel in order to deliver the optimal outputs. Only a handful of the most often used algorithms are not going to give you what you need in order to solve your predictive maintenance problem. The next item you want to make sure you’re including is automatic model retraining. It’s so important with machine data since patterns are constantly changing and prediction models can become obsolete quickly. Therefore, it’s essential that your solution account for this change in variability by automatically retraining the model on a regular basis. And then, finally, it’s super important that your data science solution gives you the ability to effect change, whether that’s effecting human action, for example, sending predictive analytic results to the maintenance engineering team that showcases where to go next, or an automated action. Something that will interact with your event processing system in order to create an automatic action within your equipment and materials.
 
15:29 How can RapidMiner help? Number one, it’s our mission to put data science behind every decision. When we talked about early on in the webcast, we talked about industry 4.0, and certainly data science presents a huge opportunity for the organizations to put data science behind every different piece and part of industry 4.0. We’re talking about predictive maintenance here, but just so you understand who we are and where we’re coming from, we want you to be able to use data science wherever it applies within your organization. We do that in two different ways. We do that through offering the right tools and also by offering people who don’t have data science experts, access to data science experts through a marketplace platform. RapidMiner from a software perspective, from a data science platform perspective, we are the number one open source platform out there. We offer self-service machine learning, so it is a visual tool which certainly accelerates the process of creating your predictive models. Simply because it’s a visual tool, some people say, “Well, that wouldn’t have all of the tools that I could possibly want as a data science expert.” Certainly, within the product we have a very deep array of over 100– I’m sorry, over 1,500 machine learning algorithms and data prep functions. So that type of broad and deep capability within our product is very important to us as a company made up of data scientists. Then, the third piece and part of the platform is ensuring frictionless operationalization. We feel that insight without action is pointless. The purpose of gaining insight is to ensure that we do have action, and we want to enable both in an easy fashion without taking months integrations. We want to ensure that there is a quick path to getting action for human-generated actions or machine-generated actions, those automated actions.
 
18:07 But the other side of the coin is we have the number one marketplace for data science experts. And for those companies that don’t have a team of data scientists but would like to create a pilot project to prove out the value of data science, we provide access to data scientists, consultants, who do these pilot projects for– it can be from– I would say on average they’re between $5,000 and $10,000 if that pilot projects occur for. But what we’re trying to do, again, is provide domain expertise in every industry, access to those data scientists and have that access on a global scale. So just quickly, because I talked about when you’re looking for a data science solution you want to make sure you are taking into consideration the full data science life cycle, certainly our platform is a unified platform and it does accelerate that time-to-value looking at the data prep, helping you address the model and validation, and helping you to operationalize.
 
19:25 So now I’m going to discuss three cases very quickly, who we’ve helped. We’ve helped a lot of folks with predictive maintenance. We have airplane manufacturers, automotive manufacturers, telecom, energy, and I’m going to talk with you in particular about a airline, a cement producer, and a transportation. Our first one is in the transportation area. A global shipping organization was looking to use predictive maintenance in order to lower their cost associated with time in the shipyard and also to increase the efficiency of their spare parts storage. So they used RapidMiner to build models from a diverse group of data across many different systems including their error logs and messages, their on-board sensor data, their roots schedule, the weather history across those roots, and also their maintenance reports. So as you can see they had data that was coming from multiple different sources, and also different types of data. Again, structured data, unstructured data, and semi-structured data, from the log messages. And they were able to achieve their goals. They reduced the times that their ships were spending in the dockyard. They optimized their spare parts storage worldwide so that they had the right spare parts in the right places. And then they also were able to really develop a proactive communication system for their maintenance and overhaul.
 
21:18 The next case was one of the two largest cement producers in the world, who deployed RapidMiner to predict and prevent machine failures and damages to their drilling equipment. One of the things that they wanted to do was increase the mill pressure without breaking the gears. They wanted to increase their production without putting undue stress on their equipment. So while trying to maximize the throughput of their machines and their revenue, they deployed predictive models with RapidMiner, and they did simulation, optimization, looked at machine failure prevention. And they trained about 40 different folks around the world to use this. And they did find that they were able to increase revenue by increasing their throughput without unduly impacting their equipment. So when we look at all of the predictions involved in such a complex system, two million sensor values that were being analyzed on a continual basis, they actually were able to generate $250 million in revenue through this. Excuse me.
 
22:50 The next is a major global airline which uses RapidMiner to do predictive maintenance on it’s planes, to predict and prevent failures of components and devices. And they use a range of different data sources in their model. Sensor data from the operation of planes. Log entries on issues, failures, services, repairs. Error and failure messages, and also looking at repair and maintenance service reports. So some of their goals were to predict the remaining life of components and devices and optimize the use of their maintenance service crew. What they found through this was they were able to reduce the out of service times of the airlines. They could reduce the lost income, or lost revenue, and also reduce the cost associated with parking their airplanes at airports. So those are just a few examples of how predictive maintenance with RapidMiner has been used in different types of industries. With that I am going to turn it over to my good friend Jeff. And Jeff is an expert data scientist who is going to show us about how to use RapidMiner to do predictive maintenance modeling.
 
24:27 Hello. Good morning, everybody. I am the resident data science expert for this webinar and one of the data science guys here at RapidMiner. Let’s get started this morning. I just have RapidMiner Studio opened here, so if you were to open Studio with me. This is the view that we’d see here if you were to download Studio and open it for the first time. You’d get the same view that we’re looking at here. Just to give you an idea, it’s a visual programming platform. So by that I mean code optional. A lot of what we’re going to do today is just going to be dragging and dropping some operators into the canvas, building out a visual workflow, and showing off some predictive maintenance inside of RapidMiner. So, what I’ll quickly do is I’ll grab some data and then kind of talk about it in the realm of predictive maintenance and some of the examples that were talked about by Leslie. So quickly, I’m just going to grab this machine’s sensor reference data here. Anything that I grab from this repository window is either pieces of data that I’ve stored or models that I’ve stored that I want to use or deploy. Or there’s simply workflows that I’ve already built and want to embed into another process.
 
25:44 Most of what I’m going to do from here on out is actively pulling operators from my operators tab or pre-built building blocks that I’ve built specifically for using predictive maintenance inside of RapidMiner. So, what we’ll do is we’ll go ahead and run this data very quickly. If I hit run here, it should bring me to my results tab. We did everything hooked up. You’ll see I’m working with just an array of sensor data – think I’ve got sensors one through 25 here. For the end, I’ve got a machine ID so I know what machine’s being pulled up here, and I’ve got whether or not this has failed. So, this is my historical data. This is not coming off the floor on machines that have failed, and I’m kind of being tasked with generating a model that’ll allow me to do predictive maintenance in RapidMiner. So this way I can say, “Okay, these sensors are showing characteristics of a failure. Let’s make a maintenance decision here to keep this machine up and running.” And then, ultimately, these decisions result in cost savings for the business.
 
28:37 Same thing happens with label. The reason why failure is green is because we’re telling RapidMiner this is the attribute you want to predict. So we already know off the bat, RapidMiner knows what to do with our ID, and it knows what to do with our label or target variable, if you will. If we jump into statistics here, I can take a look. I’ve got different data types. Most of my sensor data’s numerical, which looks great. I don’t have any missing variables, so there’s no– this is a very well structured data set. But these are all problems that I can identify here and handle, if I need to, in some meaningful way. And also I do get some statistics so I can see– I can get a relative snapshot of what sensor one is doing, and maybe I have some background into what type of machine this is and where the sensor is. So I can see right off the bat, do they have some dangerous values? I can open up my chart here. I can say, “This is my average distribution, everything out here might be a little warning or telltale sign.” So I can already check and look at my sensor data and see, is there anything that right off the bat I can say might be a little dangerous?
 
29:54 A lot of this data visualization is just primarily data discovery. So I can switch between chart type here, and I can take a deeper dive into my data. I can plot multiple aspects. Maybe I’ve got time dependent data or temperature dependent data, so I’d be able to track those– I’d be able to take a data discovery dive into those and actually track what’s happening on temperature– what’s happening on my sensors when I track for specific temperature types or specific pressure data. I have the availability to look at that in my platform. The fact is I’m working with very generic data. So what I can do is I can jump back into design. I know I don’t have anything in terms of data prep where it’s missing values or anything like that. Step one is then loading the data and take a look at it. Take a dive into our data and discover it. The next step would be to say, “Okay, let me try to identify key sensors that would probably be the dominant predictors of any sort of maintenance on my machines.” What I actually have here, and I’ll actually walk us through, is I have a building block that’s useful. This is to set of operators that have already built in RapidMiner for this purpose. The reason why I’m inserting it here is because the rest of my demonstration it’ll be fairly straightforward and I’ll only need to drag and drop a few operators. This step would probably take me upwards to about an hour to drag and drop and configure and talk about each operator.
 
31:29 What I’ve done is I’ve gone ahead and built this building block, so I’ll go ahead and grab, determine influence factors. It is a block of operators, so I’m going to drag this onto my workflow here. It is a sub-process so I can come inside of this. Once I’ve brought inside, you’ll see I’ve got an array of operators going on here. All of which I’ve managed to drag from my operator tab, drag onto the canvas, and configure. But what I’m going to do is I’m going to add a couple of break points here and I’ll walk through what’s going on in this data set to kind of get an idea of some of the things that you can be doing with your predictive maintenance data inside of RapidMiner.
 
32:15 What I’ll talk about here is I’m individually weighting all of my attributes, via multiple different weighting methods, and then generating an attribute based off of that weight. I’ll then append all of these together right here. So what I’ll do is I’ll go ahead and run this so you can get an idea of what this looks like. Once I’ve grabbed this, I’ve got individual weights for each sensor and the type of weight method that’s there.So if I sort on attribute here, you’ll see I’ll get– for sensor one, I should have four different weights. All of my weights are located here. And I’ve got a correlation, a Gini index, an information gain, and an information gained ratio. So I’ve got four different weights generated here. They’ve all been appended together so I’ve got all of my data. The next step, I’m just going to quickly pivot that data to get a good idea of– condense down my data set. I’ve generated each of the weights as their own attribute. The next step is to actually pivot around this data. So what I do is I just take the sum of the different– I sum over the different methods. So if I continue running this, we should get an aggregation where I get a new variable – there it is – of importance. So it takes the total contributions of all the different weights, and so I can sort based off of importance here. So I’ve identified that sensors seven, six, eight, and five tend to be the sensors of most importance to machines, when it comes to predictive maintenance. So I have a good idea of what sensors are useful, and I can go down a list and see what sensors are not so useful, too. So from there what I do is I just normalize that data. And ultimately, I just sort it so I can get an importance here. And then after the normalize, as seen by sensor seven, is that is an importance of one, and sensor 24, which has the lowest amount of weight importance, is now my zero range. So from there I can get a final list of ordered attributes. So I know that sensor seven is of most importance to me, and I can see how that drops off via each sensor.
 
34:49 So here I’ve just taken a look and trying to find out what areas of my machine are most important to our predictive maintenance case. So what I’m going to do is I’m going to remove those break points, because it’s great to see what sensors are providing useful information of those predictions, but now I actually have to make a prediction. So step three, what I’m going to do is I’m just going to quickly add and multiply from the operators window here. And all this is going to do is allow me to send this data set out to multiple things. So I’ve just copied my data set here. Each output port is a new copy of that data set. So what I can quickly do is I can say, “Generate a k-NN.” So I’ll use a k-NN here just for simplicity use. But if I take a look in here, inside of my predictive folder. So these would be anything that I want to do, any sort of predictive analytics around. I do have k-NN, … phase. There’s a whole slew of decision trees in here including random forest, as well as the H2O gradient boosted decision trees. I can take a look– I’ve got plenty of neural nets, including the H2O deep learning. Both of those H2O operators have been a brand new addition to our library. I’ve got plenty of regression analysis, bulk linear logistic, and a couple other regression options, as well as support vector machines – a vast array of those. And then, recently, there was an article saying that ensemble models are the hottest used models for data scientists. So there’s an extensive ensemble models folder which, again, will allow you to utilize all the other machine learning algorithms we have available in RapidMiner. So you can see you have an extensive library of tools here for machine learning and the availability to actually make these predictions as good as possible.
 
36:51 So with that, I’m just going to run with the simple model of a k-NN, and I can easily just go ahead and hook this up and generate a model based off a K of 5 or something like that. So I can go ahead and run this. All this tells me is I’ve generated a model based off of a nearest neighbor in a class of fives, yes or no. So it just tells me what the models do. I have no idea if this model is great or not. So the next step would be, let’s validate this model. I will go ahead and copy out my k-NN and I’m going to grab a cross-validation for this case. I’m going to output a couple of things. I think I really only have to take my performance vector. So if I come inside my validation, I can quickly say, “Okay, grab me that k-NN that we had.” Ultimately, I can grab training, I can grab my model. So on the left-hand side, its out of my cross-validation. I’m parsing up my data set into training and testing chunks. The cross-validation will allow me to utilize 100% of my data for training and testing, which is fantastic. So what I can do is I’ll keep my K of 5. I’ll say I need to apply a model. What I’m going to do here is I’m just going to grab from my RapidMiner Wisdom of Crowds. It’s an opt-in tool for RapidMiner users. We track what operators you’re placing onto the canvas and what operators you are committing those operators too, and sometimes we track specific parameters.
 
38:21 In this case, if my model was down here saying 100% of users using my model and they have k-NN inside of a cross-validation. So I was able to click and drag an operator instead of searching for it. Same thing with the performance operators. So really good tool for getting up and started for a new user as well as– I, myself, as an experienced user, I find myself using the Wisdom of Crowds often because what I need is always down there so I’m clicking and dragging it. It’s as easy as that. All these operators do is apply this k-NN to my testing data so that way I can actually get a confusion matrix out and see how my model is actually performing under this machine learning with this data set. I can go ahead and run this. It is a k-NN, so I get about a 64.67% accuracy. I have no idea if my case selection is proper. There is a bit of a swing on there, so I could possibly be doing up to 75%. But I could also, on the flipside, also only be doing around just better than a coin flip. The next step I could easily say is, “Okay, I don’t know what the best option for a k-NN is, and I’ll have RapidMiner optimize this for me.” So I can easily drop-in and optimize parameters and say, “Okay, validate this model for me based off of performance.” So what it’ll do is I’ll say, “Okay, give me a K. Let’s start by K of 1, we’ll go to K of 50, and we’ll do that in 50 steps. So RapidMiner actually picked the best possible K for me so I don’t have to guess and check. And this is really handy when you’re ultimately trying to automate this entire process. So as you’re getting new sensor data, maybe you get updates weekly and want to retrain a model off of the new sensor data that week, or that month, you’d have the availability to know that, that process is always automated, and that automation encompasses optimization of your models.
 
40:27 So, if I send this out I can go ahead and run this. I should see a jump to about a 70% accuracy, which is a lot better than what I was doing. So 70% of the time, I’m making a correct prediction of whether or not this machine should get maintenance, which is great. Or whether or not this machine is showing the characteristics of a machine that should be maintained. So, I’d at least be at 70 right, 70% of the time, that this machine is showing the similar sensor characteristics that a machine that has broken down in the past had been showing when it broke down. And I can easily log this information as well. But what am I going to do here is now that I’ve shown you how to actually go about building up this optimization, the next step is if I’m satisfied with this model is I can easily do something like, I want to build this model and deploy it in the same process. I already have a set of data that I want to deploy over, which is my machine sensor data. So I could easily grab and apply a model here, and I can send out my k-NN that I’ve just generated, and I can send out my machine sensor data. And then, lastly, I think I need a sort– so I can send in my example set here and I can go– based off of the specific attribute name I want to sort off of confidence. Yes. We’ll do increasing, and then lastly I want a remove column because I only want my squirt data. So what I’ll do is I’ll say, “Select attributes, alt,” and then I’ll do invert selections. So what this will do is it will only keep any of my prediction attributes for me. So I can go ahead and run this.
 
42:39 I’ve now deployed that model we just generated. So I can see I’ve got plenty of machines, machines 137 through 272, and then I’ve got a prediction, and they’re sort of based off of the confidence of that prediction. So I can see I’ve got a ton of machines here that have been predicted no, and I’ve I got a handful of machines that have predicted yes. They’re showing characteristics of machines that should get maintenance. So this kind of small deliberate processes are really useful and really analogs to all types of predictive maintenance. Use cases inside of RapidMiner, hopefully it shows off how quickly I can go from loading in a data set, doing some sort of discovery, especially if I’ve done this kind of discovery before. I can be utilizing sets of operators that I use for identifying key attributes all the time as I did with those building blocks. Next, I can automate my model generation and model deployment inside of a single process. So maybe every of couple months they want to retrain a model and deploy it over the floor. I could easily do that or I could easily store this model and deploy this model weekly so I can track– or weekly or daily or hourly or what have you, based off the sensor information coming in from the machine. And I’d have the availability of scheduling those processes to run if I’m utilizing RapidMiner server.
 
44:06 Lastly, what I’ll do is I’ll actually open up a whole processed version of this data. I don’t want to actually say this because I have everything here. I’ve got that influence factor that I’ve generated. What I can also do is I can create a log so I can track the different types of k-NN, and I can save that data set so I can see model performance based off of the automated k-NN. That’s what’s going on here, and this is processing those scores in a very similar way that I did. Again, ultimately I’m pulling in model equipment failure data based off of observations of past machines that I want to generate a model upon and apply that model to current machines in order to anticipate machine failures. I load through the data to find influence factors. I then train a model and I’ll automate that model generation process. And then, finally, I deploy that model either on my machine or in my manufacturing center or what have you. And then, ultimately, the outputs – I get a set of influence factors for that model. I get an optima, a k-NN, optimization log, and failure predictions for my current machines. So if I go ahead and run this, you’ll notice I get the same set of results. My log tells me, if I sort an accuracy here, that I had a K of 37. So I can get all this information out on any models that I’m optimizing as well, whether it’s a decision tree or a neural net or a support vector machine. I can always track what’s getting optimized, and why, and how well those optimizations are performing.
 
45:50 With that, I’m going to pass this over to the question and answer section. Go ahead and visit RapidMiner.com. You can download the number one open source data science platform and get started with your data science project today.
 
46:09 Great. Thanks, Jeff. Thanks Jeff and thanks Leslie for your presentation today. As a reminder to the audience, we’ve been getting a couple questions, but we will be sending a recorded version of today’s presentation within the next few business days, via email. And like Jeff just said, we are now happy to take any questions that you have. I see there are already a ton of questions, so we’ll try and get through those now. So, let me ask you, Leslie, the first question, “How can I take advantage of this without sensors?”
 
46:40 Well, that’s a good question. You must have some older information. My significant other just bought a 1948 Ford and it has zero sensors on it. I think what you have to rely on is the data that you have. Whether those data are maintenance reports that you’ve been gathering. If you do have any type of information on your equipment. But you use what you have. If you don’t have sensors information, then you’re not going to get as precise a prediction, certainly, because those– if we were looking at a car, for example, we think of all the different pieces and parts of the car, and how many different sensors might be in there. But you have to use the data that you have. Just what you all have, potential. You’re not going to get as precise and accurate prediction with data that isn’t giving you all of that minutiae, and minute insight into it.
 
47:50 Great. Thanks, Leslie. I have a question here for you, Jeff. For the sensor source data, will RapidMiner collect directly through Spark?
 
48:00 Yeah. We actually have a RapidMiner Radoop, which is a connection into Hadoop. So if you’ve got anything sitting on any of the Hadoop infrastructure, such as Spark, we would be able to connect to it. And this is definitely a question about scalability. I think Leslie mentioned in the presentation that there’s so much on untapped data, specifically around that sensor source data. You can petabytes of sensor data just sitting inside of your Hadoop cluster, and RapidMiner would have the availability to not only tap into that data source, but as well as actually run your machine learning on top of a cluster. So you actually wouldn’t be running that in memory and RapidMiner, instead it’s a native process pushed down on to Hadoop.
 
48:51 Thanks, Jeff. I have a question here for you, Leslie. Before anything, I believe there’s a need to know what the goal or possible achievement we are trying to get by using machine learning.
 
49:02 Oh, absolutely. When you saw me present the Predictive Analytics Life Cycle, I’m talking about the actual platform and what you can do in the platform. Certainly, a data scientist is going to want to sit with their subject matter experts to understand the goal, to understand the operation, potentially to understand all the different sources of information that they’re going to have. Certainly, that’s the first step in– if you look at Chris DM, the predictive analytics challenge and life cycle– I just had an equipment failure here, right in this room. Absolutely, you’re correct. You do need to understand the goal. You can’t develop a measure for it, if you don’t know what it is you’re trying to accomplish. So yes, excellent point.
 
50:05 Great. Thank you, Leslie. Question here for you, Jeff. What if the data is in a log file and not nicely organized? Will RapidMiner still be able to handle that?
 
50:15 Yeah. RapidMiner has an extension that you can access inside of RapidMiner Studio, which handles text processing, as well as getting unstructured data. So RapidMiner has no issue working with unstructured data. Obviously, if you have intense, long data via your sensor data, no situation is as perfect as what I showed, and there are plenty of ways of grabbing that data out of a log pile, and actually structuring it in similar ways that I used here, and then eventually pushing that through machine learning.
 
50:51 Great. Thanks, Jeff. Question here for you, Leslie. Are the algorithms needed to accomplish this included in the free version of RapidMiner?
 
51:00 All of the algorithms are available in all the versions of RapidMiner, including the free versions of RapidMiner. What we recommend for folks who are looking to try out and to prove the tool, we recommend you do an evaluation using that free version of RapidMiner, and you can try out all of those algorithms in there. Just go to RapidMiner.com, and click on the download the software. Now, what you will do when you’re doing your evaluation is you’re not going to be able to look at your millions and millions of rows of your data, but you should be able to showcase how you can look at those things. RapidMiner is actually priced by data, and after you get beyond 10,000 rows of data, that’s where the commercial versions kick in. But all the algorithms are included in all of the products.
 
51:58 Great, thanks. Jeff, I have a couple questions for you around the building blocks. What is inside the building blocks, and how can we get it? And then another question is, how did you create the determined influence building block?
 
52:13 I’m actually going to pass this over to RapidMiner, and actually just show this because it’s a very easy visual answer. I have the main process I built annotated to show off at the end, so I pre-built this before. The determine influence factor, this is just a renamed sub-process. If I wanted to actually build a building block– the reason why I did it, first off, was that this set of operators would be impossible to build in real-time, and I wanted to show off the availability for a user to actually go from opening your data to actually deploying a model in a single pass. So what I did to speed that up is I actually selected all of these operators, and you yourself have to build all of your building blocks. There are building blocks that come shipped with RapidMiner; this is not one of them. So the actual determine influence factor that I built, all I simply did was right-click with all of these highlighted, and I should get a– I should be able to package these up as a building block. What I actually have to do is move them into a sub-process like I had before and I can right-click this sub-process and say, “Save as building block.” I’ll give the name, iy was determine influence factors, and go ahead and save this. And then I’d have the availability of right-clicking– I’m actually going to undo these changes really quick. I’d have the availability of right-clicking and just saying, “Insert that building block.” Now, I think there is data prep churn. No, that’s one of my own phenomenal cross validations and the transforms here, they all come preloaded in RapidMiner. But you have the availability of managing these building blocks yourself, and also sharing them with your team if you are utilizing RapidMiner server.
 
54:03 Thank you, Jeff. Question here for you, Leslie. Can non-statistics IT professionals use the tools to create models?
 
54:12 That’s a really good question. I would say if you understand data science you can certainly create models using RapidMiner. We’ve designed it as a visual tool so that you don’t have to be a programmer in order to use it. It has a lot of deep capability within in, so even data scientists love this tool, even programmers love this tool. In fact, if you do like using some of your RN Python code, we make it easy to do that as well. What I would say, however, is if you don’t understand the data science and the pieces and parts of there, what I would recommend for you is to go to RapidMiner.com and think about starting a pilot project, and going to the marketplace and working with a data scientist through there to do a prototype or do a pilot project for what you’re trying to do, if you’re not an expert in data science. Because one of the things is that data science is– it is a science and it’s not like learning the alphabet. It’s more like learning a language, and it does take some skills over time. But if you understand the components of that, certainly the visual tool is going to make it very easy for you.
 
55:46 Great. Thanks, Leslie. Jeff, I have a question here for you. This person is asking, “Is it possible to extract value even if you don’t have a failure event?”
 
55:58 There are ways of extracting value from RapidMiner, not necessarily what we built. You can do some unsupervised learning. Maybe you wanted to run through some segmentation so you can see what machines, what types of sensor data might be associated with anomalies, but not necessarily breakdowns. I think I saw a couple of other questions on how to determine outliers. You could easily run some sort of outlier detection. We have an anomaly detection extension that doesn’t ship with the package, but it’s free for users. It’s generated by RapidMiner and you would acquire it the same way you acquire the text mining, both of which are free. They just don’t come shipped in order to save installation size. But the idea is you would download them. There’s a bunch of operators for doing that outlier detection, so there are ways to extract value from your sensor data without actually having failures.
 
56:59 Great, thanks. Another question here for you, Jeff. Can all RapidMiner operators be pushed down to run natively in Hadoop? 
 
57:09 Yeah! Actually, they can. We do have a set of Radoop specific operators that will be run across all nodes of the cluster, and that machine learning library grows on an update basis for constantly adding new additions to that. But, furthermore, any of those in-memory operators that you can see can be run on a single node of your cluster, so there is a what’s called a single processed pushdown operator. So any of those operators that you want to run, you can push it down. It will only run on a single node, but you would have the availability of using the extensive library that’s built into RapidMiner. 
 
57:54 All right, great. Thanks, Jeff. It looks like we’re at the top of the hour. It looks like we still have some questions coming in. If we weren’t able to address your questions here on the call today, we’ll make sure to follow up with you via email. And like I said, we’re also going to be sending the recording within the next few business days. So, I wanted to thank Leslie and Jeff again. Thanks to everyone for joining us for today’s presentation, and have a great day.