In a recent webinar, Sebastian Land of Old World Computing, explores how RapidMiner’s predictive analytics platform allows you to create predictive models to prevent and reduce IT disruption going forward using text analytics and root-cause analysis.
Over the last 20 years, IT infrastructure has become very complex, and has come with many dependencies. The problem is not the complexity itself, but that a single element of the infrastructure can cause a catastrophic failure. Thanks to built-in redundancy, most of these failures will not directly affect the quality of service, but some may. We want to detect when the quality of service is affected before a customer complains about it, not afterwards when its too late. Reducing IT disruption is paramount to keeping customers happy and coming back.
But how do we detect if there’s a problem in our infrastructure before it happens? A lot of hidden details are found in unstructured sources like log files. If this number of errors in our logs grows suddenly, then there is certainly something wrong. Unfortunately, you can’t see that error because it is hidden by the sheer number of error and logs.
Using Text Analytics to Detect Errors
Machine learning can be used to reveal very complex patterns inside the data. One way to easily distinguish patterns is by using customer feedback as the data source. It is very likely that a customer will complain if their quality of service is affected. For example, a flood of tweets timed closely to a pattern of errors in a log file. We can estimate how long it would take the customer to complain, let’s say a day. We can mark the area in the log where there are possible errors. Afterwards, we can compare that to another time frame where there have been no complaints. This of course is assuming that if there are no complaints, there are no errors.
However, what if the failure occurs while the customer is asleep or not paying attention? This would mean that the original control group we used is likely to contain log errors. To combat this, we will have to use a slightly different approach, an iterative process. At one point in time there is a customer complaint, in this candidate zone, the probability that there are error log entries is higher than everywhere else.
We are using these as candidates as a control group to build a machine learning model that can distinguish between these two. Scoring the same log again with this model will deliver predictions. If we just use the scores with highest confidence as new candidates, we are getting a better view on the real errors. So then we use that again and train another machine learning model on that, constantly iterating and predicting until the results are stable.