Using RapidMiner to Support Capital & Maintenance Decision Making for Linear and Networked Assets

Michael Gloven, Managing Partner, EIS

This session outlines how to use RapidMiner to support investment & maintenance decision-making for linear and networked assets such as pipelines, roads, electric transmission lines, water distribution systems, etc. Join as Michael demonstrates how machine learning can predict undesirable events and monetized risk for linear and networked assets.

These results may then be used to support specific risk mitigation strategies and budget plans. The objective is to put in place a more strategic data-driven approach to resource decision-making, which should improve the risk profile and profitability of the asset owner. Key to the presentation is demonstrating important considerations for the application of machine learning to these types of assets.

00:04 [music] Okay. So in less than 15 minutes, I guide you through my work, my data, my RapidMiner solutions and scientific output. The face of Maria, her virtual name, speaks a thousand words. There’s something not right. She’s not healthy. Even a layperson notices that. But what is actually wrong? What we see on her face is a representation of an underlying disease without any label yet. We can, at best, make some good guesses. But in order to fully diagnostically investigate her, several additional tests are required. As you all know, treating chronic diseases require multiple visits over an extended period of time. In contrast, split‐second decisions are sometimes needed to differentiate life from death in situations with more uncertainties, without a diagnosis, without a treatment plan and outcome estimation. This is the real‐world environment the physician‐‐ so myself‐‐ faces on a daily basis.

01:21 Luckily, the human mind excels at estimating the motion and interaction of objects in the physical world, and inferring cause and effect from a limited number of examples, and extrapolating those examples to determine plans of action to cover previously unencountered circumstances. Computers, on the other hand, are not good at coming to decisions. And classical approaches to artificial intelligence do not easily capture the idea of a good enough solution, which is, in the majority of decisions, needed. But the data generated and required to make a correct diagnosis for a patient at a certain point in time and dynamically adapt to treatment in each new situation in the evolution of one or more diseases in the patient, well, the amount of data exceeds the computational power and memory of a human brain. Also mine.

02:29 For most of human history, the practice of medicine‐‐ and maybe that’s a surprise, but the practice of medicine was predominantly heuristic and anecdotal. Traditionally, quantitative patient data was relatively sparse. Decision‐making was based on clinical impression, and outcomes were difficult to relate with much certainty to the quality of the decision made. The transition, however, to evidence‐based medicine, to the integration of AI and ML, is quite an endeavor in the medical sector. One can even question what was actually meant by evidence‐based medicine, taking into consideration all steps needed in machine learning to reduce bias, validity correctly, etc. In one of the papers I published, I looked for answers if RCTs, randomized clinical trials, are still the gold standard. And I put the gold, the G, between brackets.

03:34 The proportion of patients with two or more medical conditions simultaneously is, however, rising steadily. Some of these multimorbidity clusters will occur by chance alone. Many, however‐‐ and Martin can assure that or admit that‐‐ will be non-random because of common genetic, behavioral, or environmental pathways of disease. Identifying these clusters is a priority and will help us to more systematically approach and treat multimorbidity. Clinical trials that resulted in a common standard of good clinical practice often excludes patients with more than one clinical condition. Qualitative vertical integration exists from bench to bedside for a single condition or disease. But there’s little or no horizontal integration between diseases that often coexist. Additionally, as bartenders will tell you, “Never mix, never worry.” Many patients take more than one medication, and not everyone reacts the same way to the various combinations of drugs. So the drug interaction when you take more than two drugs is notoriously difficult. This high‐dimensional context will require an intellectual shift in research, training, practice, and virtually every discipline.

05:08 In the next slides, I will be covering a variety of complex but also low‐hanging fruit of use cases I used RapidMiner for in the last decade. It’s essential to understand that many companies mentioning biomed or health on their website frequently provide solutions only related to logistic problems or related to hospital admissions or optimization of patient flows. As a use case or, well, as a simple example, we used Auto Model for our emergency department. But in contrast, only a few enterprises have the guts or the resources able to really dive into medical problems and provide solutions resulting in better care and outcomes. There are several explanations for that, with the data privacy, thus the lack of proper data, and the complexity of medical science repeatedly reported. The latter can only be solved by reducing the gap between the data science community and the medical community. This will also be my take-home message. Physicians might be weird people, but they’re essential to get them actively on board, on your team, when entering biomed‐related domains.

06:33 Luckily, the last decades, data becomes available. We have algorithms to identify medical records, and even with the European GDPR regulations and HIPAA rules, secondary analysis of electronic health records can be used for and can be seen, in fact, as analyzing the historical footprint of how medicine is performed. However, providing solutions able to demonstrate a medical action is causing an effect; thus, actually integrating causality in a model at any moment is quite a challenge. One of the reasons is the fact that the data, the patient’s state, changes when your intervention had any effect. Moreover, the genotype/phenotype divide has limited the practical value of genomic science in treating disease since people with the same genetic mutations experience different symptoms for the same disease. And in contrast, in some instances, experience no symptoms at all.

07:52 The medical literature has been quick to embrace big data, but out of data privacy laws, competition based on profit motives, a culture wary of innovations and collaboration, and disparate data representations continue to hinder efforts to truly benefit from the fruits of the big data revolution. An interesting database we have been using in several papers is a MIMIC database originally generated here in Boston in a joint effort between the Beth Israel Deaconess Hospital and MIT. Currently available for use on two platforms, Google and Amazon. Our first access and research on the MIMIC data set was dated already from 2015 using RapidMiner. I was invited to present this at MIT, which as a physician, I felt honored to present this to an ICT community. And recently, we could connect to the BigQuery cloud version of MIMIC‐III. Now, MIMIC‐IV is coming within a few weeks, as demonstrated in the following slides. Excuse me. Without any doubt, more to follow in the near future regarding this excess.

09:32 As a RapidMiner community member, I feel responsible to spread the word what RapidMiner is capable of. Example by using ensemble methods using the MIMIC database. The paper was quite popular. I don’t have so much points as number one here in the community, but my paper was quite successful with more than 8,000 views, and I find that very, very comforting‐‐ or that gave me a lot of positive influence. Let’s see. Besides data from electric health records, tons of data become available. And initiatives such as SEBI portal provide interesting tools with web APIs and interactive dashboards where genomics can be integrated in your RapidMiner research. I must admit, the dashboards they provide are state‐of‐the‐art and are very fast. I really invite you to take a look at them. We also tested whether we could access these web APIs from RapidMiner, and that was no problem using the Read URL operator. Next, the data could be easily extracted, and analytical processes could follow. However, the dimensionality adding genomics or the omics environment to the clinical environment, the dimensionality is enormous. And the changes of tumor cells, resistance for chemotherapy, changing over time, make the use for the Auto Model simulator not suited yet, although there’s a significant need to use simulators in order to predict therapeutic impact for the patient like mind. Meaning patient similarity is a topic which will only gain importance in the near future.

11:53 With the growing need for precision medicine, the next generation of electronic health records will need to support dynamic clinical data mining. In particular, the DCDM‐‐ so the dynamic clinical data mining‐‐ would enable examination of any single medical encounter within the context of similar encounters where similarity is defined by some metric for grouping, which is quite unknown, not actually. To illustrate the complexity of precision medicine, a recent paper suggests that the microbiome‐‐ so the germs in your belly‐‐ governs the cancer‐immune set point for cancer‐bearing individuals, and that manipulating the gut ecosystem circumvents primary resistance to therapy may become feasible.

12:55 Explainability and interpretability form a significant barrier to implement AI and ML tools when physicians need to be convinced that the investment in new technology leads to a higher standard of care, as we experienced using papers as these in meetings with my colleagues. There’s a reason to be optimistic, however. Across the globe, governments are partnering with universities and industry to build a machine learning roadmap or multiple roadmaps. Programs and events bring together clinicians, computer scientists, and engineers to create collaborative ecosystems that can leverage the power of data science. The explainability and interpretability become even more an issue as presented as just an example in a paper where they didn’t use RapidMiner, but it’s an ensemble method where they use LSTMs to predict and to show superiority in a framework to handle the diversity of clinical data. This is becoming very difficult to explain to physicians without any educational background in AI or machine learning. However, in order to induce some smooth adoption of AI/ML, I recently uploaded a webinar where I explained how Auto Model could be used based on medical data. I hope in the future to continue doing this. And finally, thank you for listening, and please, contact me. Collaborate, collaborate, collaborate. Any questions? [music]


00:03 [music] So thanks for coming to this session. I don’t know if I can meet the bar from the last sessions here, they were just tremendous I thought. I am not a data scientist, I’m not a developer, I’m more of a domain expert and I’m going to show you a use case. It’s going to be kind of ad hoc, too, as far as showing Rapidminer. So it’ll be a little chaotic. Hopefully, you’ll get something out of this thing. To get started here the topic is machine learning-based risk management for linear network assets. And what we’re trying to do is use RapidMiner as a tool and a set of practices to try to figure out where these things might happen, okay? This is a linear asset, a natural gas pipeline. A failure in Northern California cost this company about $3 billion, okay? I’m going to walk you through an example of how we can kind of measure these things. Here’s a faulty electrical transmission line. You may have heard about the wildfires also in California. These are the kind of things we’re looking for with machine learning.

01:06 What are pipeline failures? You may have seen this picture in the news before, this is a fire truck responding to a main failure and just happened to fall into the sinkhole that the main created. It’s interesting to note that water pipelines if you care, they have about one failure every four miles per year. So every four miles of pipe, water pipeline, fails every year. And one reason that the failure rate is so high, it’s a very fail to maintain, it’s a fail and fix type of mentality. But what we’re trying to do is bring in some machine learning so we can proactively predict where these high susceptibility areas are so these things don’t happen. And then here’s a sinkhole for a wastewater line. So just a giant hole in the middle of the road, and you may have seen these before.

01:55 So I want to show– I want to frame up this use case then and talk about risk first before we dive into the hands-on here. I’ll show you what’s going on underneath it here. But if you’re not familiar with risk, what we’re doing is we’re monetizing risk by trying to determine the number of these unwanted events that may happen. So frequency of unwanted events, trying to count these events, trying to determine their susceptibility. And we use both classification and regression to come up with these numbers. Once we get those, we can multiply that times an expected range of consequences. So once you monetize the risk, now you’ve got some visibility, business visibility into what’s really happening with these assets. And you can start to nominate your proactive-type measures to mitigate the risk.

02:44 This is Tableau here. So it’s the first kind of like how-to, this is a user session. So it’s more the like technical stuff to show you. We do use Tableau. Tableau is really easy to work with if you’re not working with it already. We simply just write results out of RapidMiner into a SQL database and we show them here and then we use some free imagery from Mapbox, which is really cool. It gets down to one in three-meter resolution so you can really hone in on what’s happening with the system. So this is showing a natural gas pipeline system, about 130 miles, and we’re going to zoom in on this little area here. And this is an output out of RapidMiner. So I thought I’d start with the end first and I’ll work through how did we get to here for the time that I have.

03:29 So what this is showing is for a linear or a distribution type network, what the risk values might look like as you walk across it. So look at this is a road. It could be electrical transmission line, could be a water line, whatever you choose. And each one of the spikes here represents a net present value of risk and risk is calculated by first determining through machine learning these unwanted event rates, multiplying it then times the consequence to come up with the risk value, a monetized risk value. Strategically, what we do is define what’s called an action threshold. It’s very difficult to figure out that’s a whole nother topic. But this level here where you see that red line that would be anything above that line might tell you that that’s where we want to address our spend. How do we want to mitigate risk? This risk above that line is not tolerable. So we define that is what we call an area of interest, and this is a really cool– this is something that RapidMiner does that I’ll show you is we use the explain predictions operator to figure out what is driving that risk. So what you see here are in this area of interest, and you may not be able to read the writing at the bottom, but each of these bands, each of these colors is an influencer, it’s a level of importance to the risk values that we just looked at. So you can see how this can guide us towards what kind of mitigations we should propose, okay?

05:01 All right, then the last slide here is once we’ve learned a model with all the observations of the levels of failures that we’ve had, we can then take that model and we can push scenarios through it. Kind of like a prescriptive analytics type approach, what these different colors are showing are the different types of projects like A through D, not very descriptive here, but showing you what projects can drop us below the action threshold, okay? So it’s very much around the business of we have this risk, this profile, what’s acceptable? What’s not acceptable? How do we use machine learning to drive down that risk to some level of acceptability? That was a whole mouthful of stuff here, but that’s framing up the use case. That’s where we’re at.

05:46 Okay, for linear and network assets, we have to deal with this thing called dynamic segmentation. Has anybody heard of dynamic segmentation before? Okay, so we’ve got a few. So whenever you’ve got assets out in the public domain that are spatial, the asset may lay on top of different attribute layers that are spatial layers. And we have to have a way of creating examples, okay? And we do that through something called dynamic segmentation. So you can see at the bottom each one of these layers here may have different information, but we have to figure out what information belongs to a specific piece of that asset because these examples are not like the Titanic. They’re not people, they’re not machinery, like pumps or compressors. How do you define a 100-mile piece of pipe and break it down into smaller parts to work with? That’s what we’re doing here. So we use this thing called dynamic segmentation, which happens to be a tool that we’ve developed.

06:46 And this kind of gets us going, “Okay, now, how do we figure out what attribute layers to use?” And I put this little link down here, I think Fabian might be here, but there’s a really good couple posts on feature weighting, feature engineering. And I’m going to show you what that looks like. I borrowed your processes and kind of fixed them my way a little bit, but this is a way to now take all these different attribute layers that are somewhat driven by the domain experts and figure out, “Well, what data is really important that we need to work with?” Okay, so that is that. Let’s get into RapidMiner.


07:37 So step one is it’s an iterative process, we’ll sit with the domain experts regarding what kind of threat or what kind of target of interest we’re interested in modeling. And I failed to mention it, but let’s say we’re looking at degradation, like corrosion on a steel asset that’s sitting in the ground. We would sit down with the domain experts and come up with what their suggestions are on the data that needs to be collected, identify where we can get that data, and then we go through that dynamic segmentation process to create the record sets. Once we collect that data, this is an iterative process. We will create a data set and then we get to here, okay, so we’ve created an example set based on dynamic segmentation. If I run this, this is what it looks like. Okay, so we have roughly in this case, only 800 plus examples. This is showing us our label, so the Green column is saying that “Yeah, there was some observation of degradation on that piece of pipe in this case,” and then these are the selected attributes that the domain experts came up with. And the first thing that we want to find out is how do these correlate or how do they relate to the target of interest? And there’s different ways that you can figure this out. One is through a weighting type method where it’s kind of you’re looking at a particular feature and seeing how it relates to the target. Then there’s wrapper methods that you can use where you can integrate the performance objective into the analysis.

09:11 But if we run this thing, and then I’ll show you the process– let me show you the process first, the process is pretty straightforward. We’re taking the data, we do some preparation on the data, typically normalize it, we also put in a random variable in the process because we want to see how a random variable behaves along with the selected data, and that’s always quite interesting. And then we loop through all these different weighting operators. So if you go to the weighting operator– weighting folder, I think, you’ll see that there’s probably 12 or 20 different types of techniques to use to figure out what features are most important to the target of interest. So I ran this process pretty quickly with those 800 plus records, with the given target of interest. And we get this in front of our domain experts to say, “Okay, you’ve given us all these observations on where the problems are along with the data. Does this make sense as far as the data driving the prediction of interest?” And so we iterate through this because most of the time we’re like, “Well, why is this piece of data so high up here?” So what these bars are, these are just normalized values for the different weights, given these different techniques here.

10:27 So you’re probably familiar with like Gini Index and information gain and et cetera, et cetera. We just picked a handful of those that we thought were important, and we want to see how they all relate together and see if the right data is popping up on top. So it’s just a process we go through. And this is something, Scott, I could share, if you want, on the repository, it’s kind of useful. This is a classification, we can also go through the same process for regression. So, again, it’s just a looping process where we’re just grabbing the data. In this case, since we’re doing regression, our target of interest is– let me run it, I’ll show you. Our target, in this case, is not true-false, it’s a numerical value. And then there’s different techniques. See, the methods here are a little bit different than you would have with classification, so the weighting just depends on whether you’re doing classification or regression. All right. So that’s that.

11:32 Now we go through this iterative of process, collect some data, condition it, kind of feel good about, “Okay, we think it’s going in the right direction,” then we pop over to auto model. And the question that we want to answer is, “What’s the best method?” So I’ll pick one of these here, we’ll do classification, okay, you’ve already seen all this. Pick our prediction, okay, it’s unbalanced. We can deal with that later. And normally what we do is even though we have this red light, green light scoring system for the data condition, we pick everything because the domain experts have already told us that, “Hey, all this stuff’s important, don’t just arbitrarily remove stuff.” So the first run, even if it’s a red light or yellow light, we’ll still include it in the analysis. There’s a couple reference variables that need to go away. Okay, and then we just run it. And I know it’s going to be faster, too, with the RapidMiner Go. Okay.

12:53 Run that on RapidMiner Go.

12:54 Yeah. So we use this as a first cut to kind of see where we are with the data, where we are with the best method. And what’s interesting here is we don’t necessarily just go for the best accuracy, we put this in the context of what the business needs. And in the case of asset failures, sometimes we’re more concerned with sensitivity. And if you’re not familiar with sensitivity, that’s more about misses. Because if we have a model in place, and remember the first slide I showed you with the natural gas exploding pipeline that cost $3 billion? Yeah, that’s a lot of money. That’s a lot of impact. So we want to focus on that. We do use the cost matrix, but kind of outside of auto model right now. But that’s a consideration. A false positive doesn’t cost that much more money than– well, it doesn’t cost anything compared to a miss. So we’re always leaning the models more towards making the sensitivity high. So here you can look through and see– let’s look accuracy. See, accuracies are all pretty much the same going across so we don’t get too excited about this. But when we go to sensitivity. We’ll see that “Oh, the generalized linear model and deep learning seems to be the best,” and our bias is to use the linear model versus deep learning because I can actually kind of explain it to somebody. Deep learning is just a little bit more challenging.

14:20 So we go through this and then I’ll show the domain experts because we’re kind of working in a team here, what these models look like. And if you haven’t worked with this already here’s the model itself. The simulator is really kind of cool because now you can vary the incoming features. They really like this, by the way, domain experts like this, they want to see how the model behaves if you adjust the features, so that’s all I’m doing here. In the performance predictions, other stuff to look at. Let’s see with the time I think I’m going to– I won’t do regression. Yeah, let me skip ahead. I want to show you then that once we come up with a suggested method, so let’s say it’s GLM, then we’ve incorporated learning curves as a process. And if you’re not familiar with learning curves, they’re really good at explaining performance, so I’ll just run this and show it to you, okay.

15:33 Anybody work with learning curves? Kind of familiar, okay. What learning curve show is in the X-axis for a model, a particular model, you can vary the sample size or the number of attributes, number of features, you can put hyperparameters down on the X-axis, and then you can show your performance on the Y-axis, okay? Then the blue line, in this case, is the cross-validated performance, whereas the green line is the test performance and the idea here– first, what this is telling us is that when we get to about 2000 examples, this is when the model starts to settle out, that the performance, we’re looking at R squared in this case or explainability of the model. So we’re up somewhere around 40% right here. So that’s the first piece of information we get out here is like, “Yeah, we probably need this much data to have a decent model.” Then as you go across here and you see these lines crossing up and down like this, this is more of an indication of variance, okay? So there’s two types of error, bias, and variance. There are certain remedies that you have if you have high variance– I mean, what is high variance? Well, it’s kind of– yeah, that’s more of a discussion to have. But just looking at this, this doesn’t look like this is a highly variant model in this case.

17:00 Then the other question would be, “What about the bias?” And bias can be represented more of, “What is your expectation of this performance versus where you are today?” So the R squared is at 40%. If the end-user expects that the explainability, R squared of the model, needs to be like 90% or 95%, that tells you got a big gap between what your model’s doing and what your expectations are, and sometimes that’s referred to as bias. And with bias, there’s different kind of mitigations to fix that. Then down here, we just have another indicator of performance, RMSC. So it’s another thing to look at. But we look at that right away and see, “Okay, the test and the cross-validation performance is about the same. It looks like a good model from a variance standpoint.” But then again, what about from a bias standpoint? What are the expectations versus what it’s actually doing? All right.

18:01 That might be an interesting feature to put into RapidMinor, right? Learning curves? I’ll show you what it looks like for those curious. It’s a looping operator, so we’re just feeding in learning data, setting the roll, conditioning the data a certain way, and then we open up the loop right here. We’re bootstrapping data to create larger sample sizes, and then you’ll see that we split data, the cross-validation, and testing and then just simply doing some logging of the data and writing that out to an example set that we can then graph. And if you haven’t done this already, down here on the bottom left, if you can’t see that, you can save these configurations– yes?

18:50 How are you relating the R squared to the bias?

18:55 To the bias?

18:57 Like, I don’t get that.

18:57 Yeah, to the bias. Well, there’s an expectation, I can talk to you after, too, about it, but there’s an R squared expectation. What is it? Well, for our industry, linear assets, what is the R squared? Maybe for pharmaceuticals and for the airline industry, it’s set, you kind of know what those numbers are. For our business, the business I’m in, we don’t know. So it comes back–

19:21 How is that r squared every time in?

19:25 It’s coefficient of determination. Yeah, yeah, yeah. But if you haven’t seen– if you want to save these visuals right here and not go through and set them up every time you can save these configurations. I put a little folder in RapidMiner to sort things on a kind of project by project basis. Just makes it really convenient so you don’t have to spend time making these things over and over again. I think two other things I’d like to show you, because this is kind of cool, at least I thought it was. Let’s do classification– yeah, right here, okay. I’m going to start running this thing right now. All right. So you see this explained predictions operator right here? What we’re interested in knowing is, given the prediction data that’s coming in, the data that we’re applying the model to, we want to see what the levels of importance are for all the features as we walk this asset. And if you’ve work with the explained predictions operator you’ll see it’s kind of a tall table. What we have to do is turn this to the side and look at the data kind of as it falls along linear along the pipeline. So it’s 86%.

20:41 So there’s some work that needs to be done in relating the IDs of the importance values with the IDs of the example sets that you’re working with so that you can get something that looks like this. Okay, and I’ll zoom in. All right. So probably too much detail to look at, but this is actually kind of useful to the people we work with, which is just showing walking along the linear asset, it’s been dynamically segmented. And at the bottom here, you can see the color coding for the levels of importance of the different data, how the features actually influence the predictions. And there’s the predictions are up here. They’re not showing right now, but the predictions would be up at the top, would show you the levels of confidence or the regress value. Now, you could zoom in and you can see, “Well, what is influencing that prediction?” So it’s providing some transparency into what’s really driving the results, okay? And this stuff Jeff and I worked on it at a RapidMiner. All right.