Michael Gloven, Managing Partner, EIS
This session outlines how to use RapidMiner to support investment & maintenance decision-making for linear and networked assets such as pipelines, roads, electric transmission lines, water distribution systems, etc. Join as Michael demonstrates how machine learning can predict undesirable events and monetized risk for linear and networked assets.
These results may then be used to support specific risk mitigation strategies and budget plans. The objective is to put in place a more strategic data-driven approach to resource decision-making, which should improve the risk profile and profitability of the asset owner. Key to the presentation is demonstrating important considerations for the application of machine learning to these types of assets.
00:03 [music] So thanks for coming to this session. I don’t know if I can meet the bar from the last sessions here, they were just tremendous I thought. I am not a data scientist, I’m not a developer, I’m more of a domain expert and I’m going to show you a use case. It’s going to be kind of ad hoc, too, as far as showing Rapidminer. So it’ll be a little chaotic. Hopefully, you’ll get something out of this thing. To get started here the topic is machine learning-based risk management for linear network assets. And what we’re trying to do is use RapidMiner as a tool and a set of practices to try to figure out where these things might happen, okay? This is a linear asset, a natural gas pipeline. A failure in Northern California cost this company about $3 billion, okay? I’m going to walk you through an example of how we can kind of measure these things. Here’s a faulty electrical transmission line. You may have heard about the wildfires also in California. These are the kind of things we’re looking for with machine learning.
01:06 What are pipeline failures? You may have seen this picture in the news before, this is a fire truck responding to a main failure and just happened to fall into the sinkhole that the main created. It’s interesting to note that water pipelines if you care, they have about one failure every four miles per year. So every four miles of pipe, water pipeline, fails every year. And one reason that the failure rate is so high, it’s a very fail to maintain, it’s a fail and fix type of mentality. But what we’re trying to do is bring in some machine learning so we can proactively predict where these high susceptibility areas are so these things don’t happen. And then here’s a sinkhole for a wastewater line. So just a giant hole in the middle of the road, and you may have seen these before.
01:55 So I want to show– I want to frame up this use case then and talk about risk first before we dive into the hands-on here. I’ll show you what’s going on underneath it here. But if you’re not familiar with risk, what we’re doing is we’re monetizing risk by trying to determine the number of these unwanted events that may happen. So frequency of unwanted events, trying to count these events, trying to determine their susceptibility. And we use both classification and regression to come up with these numbers. Once we get those, we can multiply that times an expected range of consequences. So once you monetize the risk, now you’ve got some visibility, business visibility into what’s really happening with these assets. And you can start to nominate your proactive-type measures to mitigate the risk.
02:44 This is Tableau here. So it’s the first kind of like how-to, this is a user session. So it’s more the like technical stuff to show you. We do use Tableau. Tableau is really easy to work with if you’re not working with it already. We simply just write results out of RapidMiner into a SQL database and we show them here and then we use some free imagery from Mapbox, which is really cool. It gets down to one in three-meter resolution so you can really hone in on what’s happening with the system. So this is showing a natural gas pipeline system, about 130 miles, and we’re going to zoom in on this little area here. And this is an output out of RapidMiner. So I thought I’d start with the end first and I’ll work through how did we get to here for the time that I have.
03:29 So what this is showing is for a linear or a distribution type network, what the risk values might look like as you walk across it. So look at this is a road. It could be electrical transmission line, could be a water line, whatever you choose. And each one of the spikes here represents a net present value of risk and risk is calculated by first determining through machine learning these unwanted event rates, multiplying it then times the consequence to come up with the risk value, a monetized risk value. Strategically, what we do is define what’s called an action threshold. It’s very difficult to figure out that’s a whole nother topic. But this level here where you see that red line that would be anything above that line might tell you that that’s where we want to address our spend. How do we want to mitigate risk? This risk above that line is not tolerable. So we define that is what we call an area of interest, and this is a really cool– this is something that RapidMiner does that I’ll show you is we use the explain predictions operator to figure out what is driving that risk. So what you see here are in this area of interest, and you may not be able to read the writing at the bottom, but each of these bands, each of these colors is an influencer, it’s a level of importance to the risk values that we just looked at. So you can see how this can guide us towards what kind of mitigations we should propose, okay?
05:01 All right, then the last slide here is once we’ve learned a model with all the observations of the levels of failures that we’ve had, we can then take that model and we can push scenarios through it. Kind of like a prescriptive analytics type approach, what these different colors are showing are the different types of projects like A through D, not very descriptive here, but showing you what projects can drop us below the action threshold, okay? So it’s very much around the business of we have this risk, this profile, what’s acceptable? What’s not acceptable? How do we use machine learning to drive down that risk to some level of acceptability? That was a whole mouthful of stuff here, but that’s framing up the use case. That’s where we’re at.
05:46 Okay, for linear and network assets, we have to deal with this thing called dynamic segmentation. Has anybody heard of dynamic segmentation before? Okay, so we’ve got a few. So whenever you’ve got assets out in the public domain that are spatial, the asset may lay on top of different attribute layers that are spatial layers. And we have to have a way of creating examples, okay? And we do that through something called dynamic segmentation. So you can see at the bottom each one of these layers here may have different information, but we have to figure out what information belongs to a specific piece of that asset because these examples are not like the Titanic. They’re not people, they’re not machinery, like pumps or compressors. How do you define a 100-mile piece of pipe and break it down into smaller parts to work with? That’s what we’re doing here. So we use this thing called dynamic segmentation, which happens to be a tool that we’ve developed.
06:46 And this kind of gets us going, “Okay, now, how do we figure out what attribute layers to use?” And I put this little link down here, I think Fabian might be here, but there’s a really good couple posts on feature weighting, feature engineering. And I’m going to show you what that looks like. I borrowed your processes and kind of fixed them my way a little bit, but this is a way to now take all these different attribute layers that are somewhat driven by the domain experts and figure out, “Well, what data is really important that we need to work with?” Okay, so that is that. Let’s get into RapidMiner.
07:37 So step one is it’s an iterative process, we’ll sit with the domain experts regarding what kind of threat or what kind of target of interest we’re interested in modeling. And I failed to mention it, but let’s say we’re looking at degradation, like corrosion on a steel asset that’s sitting in the ground. We would sit down with the domain experts and come up with what their suggestions are on the data that needs to be collected, identify where we can get that data, and then we go through that dynamic segmentation process to create the record sets. Once we collect that data, this is an iterative process. We will create a data set and then we get to here, okay, so we’ve created an example set based on dynamic segmentation. If I run this, this is what it looks like. Okay, so we have roughly in this case, only 800 plus examples. This is showing us our label, so the Green column is saying that “Yeah, there was some observation of degradation on that piece of pipe in this case,” and then these are the selected attributes that the domain experts came up with. And the first thing that we want to find out is how do these correlate or how do they relate to the target of interest? And there’s different ways that you can figure this out. One is through a weighting type method where it’s kind of you’re looking at a particular feature and seeing how it relates to the target. Then there’s wrapper methods that you can use where you can integrate the performance objective into the analysis.
09:11 But if we run this thing, and then I’ll show you the process– let me show you the process first, the process is pretty straightforward. We’re taking the data, we do some preparation on the data, typically normalize it, we also put in a random variable in the process because we want to see how a random variable behaves along with the selected data, and that’s always quite interesting. And then we loop through all these different weighting operators. So if you go to the weighting operator– weighting folder, I think, you’ll see that there’s probably 12 or 20 different types of techniques to use to figure out what features are most important to the target of interest. So I ran this process pretty quickly with those 800 plus records, with the given target of interest. And we get this in front of our domain experts to say, “Okay, you’ve given us all these observations on where the problems are along with the data. Does this make sense as far as the data driving the prediction of interest?” And so we iterate through this because most of the time we’re like, “Well, why is this piece of data so high up here?” So what these bars are, these are just normalized values for the different weights, given these different techniques here.
10:27 So you’re probably familiar with like Gini Index and information gain and et cetera, et cetera. We just picked a handful of those that we thought were important, and we want to see how they all relate together and see if the right data is popping up on top. So it’s just a process we go through. And this is something, Scott, I could share, if you want, on the repository, it’s kind of useful. This is a classification, we can also go through the same process for regression. So, again, it’s just a looping process where we’re just grabbing the data. In this case, since we’re doing regression, our target of interest is– let me run it, I’ll show you. Our target, in this case, is not true-false, it’s a numerical value. And then there’s different techniques. See, the methods here are a little bit different than you would have with classification, so the weighting just depends on whether you’re doing classification or regression. All right. So that’s that.
11:32 Now we go through this iterative of process, collect some data, condition it, kind of feel good about, “Okay, we think it’s going in the right direction,” then we pop over to auto model. And the question that we want to answer is, “What’s the best method?” So I’ll pick one of these here, we’ll do classification, okay, you’ve already seen all this. Pick our prediction, okay, it’s unbalanced. We can deal with that later. And normally what we do is even though we have this red light, green light scoring system for the data condition, we pick everything because the domain experts have already told us that, “Hey, all this stuff’s important, don’t just arbitrarily remove stuff.” So the first run, even if it’s a red light or yellow light, we’ll still include it in the analysis. There’s a couple reference variables that need to go away. Okay, and then we just run it. And I know it’s going to be faster, too, with the RapidMiner Go. Okay.
12:53 Run that on RapidMiner Go.
12:54 Yeah. So we use this as a first cut to kind of see where we are with the data, where we are with the best method. And what’s interesting here is we don’t necessarily just go for the best accuracy, we put this in the context of what the business needs. And in the case of asset failures, sometimes we’re more concerned with sensitivity. And if you’re not familiar with sensitivity, that’s more about misses. Because if we have a model in place, and remember the first slide I showed you with the natural gas exploding pipeline that cost $3 billion? Yeah, that’s a lot of money. That’s a lot of impact. So we want to focus on that. We do use the cost matrix, but kind of outside of auto model right now. But that’s a consideration. A false positive doesn’t cost that much more money than– well, it doesn’t cost anything compared to a miss. So we’re always leaning the models more towards making the sensitivity high. So here you can look through and see– let’s look accuracy. See, accuracies are all pretty much the same going across so we don’t get too excited about this. But when we go to sensitivity. We’ll see that “Oh, the generalized linear model and deep learning seems to be the best,” and our bias is to use the linear model versus deep learning because I can actually kind of explain it to somebody. Deep learning is just a little bit more challenging.
14:20 So we go through this and then I’ll show the domain experts because we’re kind of working in a team here, what these models look like. And if you haven’t worked with this already here’s the model itself. The simulator is really kind of cool because now you can vary the incoming features. They really like this, by the way, domain experts like this, they want to see how the model behaves if you adjust the features, so that’s all I’m doing here. In the performance predictions, other stuff to look at. Let’s see with the time I think I’m going to– I won’t do regression. Yeah, let me skip ahead. I want to show you then that once we come up with a suggested method, so let’s say it’s GLM, then we’ve incorporated learning curves as a process. And if you’re not familiar with learning curves, they’re really good at explaining performance, so I’ll just run this and show it to you, okay.
15:33 Anybody work with learning curves? Kind of familiar, okay. What learning curve show is in the X-axis for a model, a particular model, you can vary the sample size or the number of attributes, number of features, you can put hyperparameters down on the X-axis, and then you can show your performance on the Y-axis, okay? Then the blue line, in this case, is the cross-validated performance, whereas the green line is the test performance and the idea here– first, what this is telling us is that when we get to about 2000 examples, this is when the model starts to settle out, that the performance, we’re looking at R squared in this case or explainability of the model. So we’re up somewhere around 40% right here. So that’s the first piece of information we get out here is like, “Yeah, we probably need this much data to have a decent model.” Then as you go across here and you see these lines crossing up and down like this, this is more of an indication of variance, okay? So there’s two types of error, bias, and variance. There are certain remedies that you have if you have high variance– I mean, what is high variance? Well, it’s kind of– yeah, that’s more of a discussion to have. But just looking at this, this doesn’t look like this is a highly variant model in this case.
17:00 Then the other question would be, “What about the bias?” And bias can be represented more of, “What is your expectation of this performance versus where you are today?” So the R squared is at 40%. If the end-user expects that the explainability, R squared of the model, needs to be like 90% or 95%, that tells you got a big gap between what your model’s doing and what your expectations are, and sometimes that’s referred to as bias. And with bias, there’s different kind of mitigations to fix that. Then down here, we just have another indicator of performance, RMSC. So it’s another thing to look at. But we look at that right away and see, “Okay, the test and the cross-validation performance is about the same. It looks like a good model from a variance standpoint.” But then again, what about from a bias standpoint? What are the expectations versus what it’s actually doing? All right.
18:01 That might be an interesting feature to put into RapidMinor, right? Learning curves? I’ll show you what it looks like for those curious. It’s a looping operator, so we’re just feeding in learning data, setting the roll, conditioning the data a certain way, and then we open up the loop right here. We’re bootstrapping data to create larger sample sizes, and then you’ll see that we split data, the cross-validation, and testing and then just simply doing some logging of the data and writing that out to an example set that we can then graph. And if you haven’t done this already, down here on the bottom left, if you can’t see that, you can save these configurations– yes?
18:50 How are you relating the R squared to the bias?
18:55 To the bias?
18:57 Like, I don’t get that.
18:57 Yeah, to the bias. Well, there’s an expectation, I can talk to you after, too, about it, but there’s an R squared expectation. What is it? Well, for our industry, linear assets, what is the R squared? Maybe for pharmaceuticals and for the airline industry, it’s set, you kind of know what those numbers are. For our business, the business I’m in, we don’t know. So it comes back–
19:21 How is that r squared every time in?
19:25 It’s coefficient of determination. Yeah, yeah, yeah. But if you haven’t seen– if you want to save these visuals right here and not go through and set them up every time you can save these configurations. I put a little folder in RapidMiner to sort things on a kind of project by project basis. Just makes it really convenient so you don’t have to spend time making these things over and over again. I think two other things I’d like to show you, because this is kind of cool, at least I thought it was. Let’s do classification– yeah, right here, okay. I’m going to start running this thing right now. All right. So you see this explained predictions operator right here? What we’re interested in knowing is, given the prediction data that’s coming in, the data that we’re applying the model to, we want to see what the levels of importance are for all the features as we walk this asset. And if you’ve work with the explained predictions operator you’ll see it’s kind of a tall table. What we have to do is turn this to the side and look at the data kind of as it falls along linear along the pipeline. So it’s 86%.
20:41 So there’s some work that needs to be done in relating the IDs of the importance values with the IDs of the example sets that you’re working with so that you can get something that looks like this. Okay, and I’ll zoom in. All right. So probably too much detail to look at, but this is actually kind of useful to the people we work with, which is just showing walking along the linear asset, it’s been dynamically segmented. And at the bottom here, you can see the color coding for the levels of importance of the different data, how the features actually influence the predictions. And there’s the predictions are up here. They’re not showing right now, but the predictions would be up at the top, would show you the levels of confidence or the regress value. Now, you could zoom in and you can see, “Well, what is influencing that prediction?” So it’s providing some transparency into what’s really driving the results, okay? And this stuff Jeff and I worked on it at a RapidMiner. All right.