Machine Learning is one of the biggest value drivers in modern analytics. Besides the popular news items like Google winning in Go and Starcraft, we have a ton of use cases already present in our daily lives.
When comparing the value of ML in businesses you can see that there is a major difference to Business Intelligence Applications, for example. While BI focuses on the human understanding of businesses and the derived actions, ML focuses on the machine understanding. The value of ML is not just generated from the initial boost of insights, but from a continuous deployment of these. To harvest the fruits of your analytics initiatives you need to put your model in production. While doing so you will naturally encounter a few problems. In this article, we will talk through the most important ones and how the combination of Talend and RapidMiner help to overcome them.
From Static to Dynamic
Model fitting is usually done on a static data set. The data was captured once and put into a nice format like a (No)SQL database or a crude excel export. These table(s) are then consumed in RapidMiner to generate models. When we put models into production we don’t have static data anymore. We are moving into a fluid world of data in motion and need to handle the stream.
Is the data available at all?
The first problem we face is – do we have consistent access to the data? It’s surprising to me, how often the data extracts data scientists are working on are created in a very manual fashion. The most extreme case is the guy with a USB stick running through the campus to get the data over. This might be okay to generate a prototype but not it’s not feasible in deployment. In deployment you need to ensure that all data is available in a programmatic and trusted manner.
With the RapidMiner integration into the Talend Pipelines you can use your existing data merging processes and natively use the insights generated by machine learning in them.
Having access to the data is one thing, but you also need to have the data at the right time. I am currently working on a predictive maintenance problem, where we want to act in near-real time (<1min). The databases get their updates every 6 hours. That makes it impossible to use this data set for the application.
A similar problem can arise in situations where the measurement of a property is key for the prediction. If the measurement by itself takes hours or days – for example when breeding bacteria. If the data is not available during prediction you are not allowed to use it. Data Scientist can work around this tough problem, for example by predicting the measurement first, but they need to be aware of the issue. Having deeply integrated platforms like RapidMiner and Talend are key to success. The platforms allow the whole team – business experts, data engineers and data scientists to work with each other and understand the others needs and thoughts. Having an easy-to-access software package like RM Studio or Talend Data Integration Studio is essential. Talking together on a level playing field allows us to identify problems early and either prevent them – by adapting the update cycle – or work around them – by using a proxy.
Having data is a great first step in data science – but the data needs to be consistent and high quality. Data Scientists are used to working with dirty data. Despite all of our efforts, we will always need to cope with this nuisance. A colleague of mine recently put it like this:
Without clean data you are just another source of noise.
A lot of the dirtiness in data is common in many analytics scenarios, not just machine learning but I would like to highlight a few things here.
Dirt as a feature
Let’s say we have not just one entry for each customer in our CRM system but five. In this case a common thing is to do Master Data Management to find the Gold Standard or a merged one. This is in many cases a useful thing for machine learning – everybody loves a unique identifier. On the other hand, knowing that there are multiple records might be valuable. If someone contacted our customer facing team on various (sadly un-synced) channels, it might indicate something. If we do customer satisfaction scoring, we might suspect that the number of different records is a proxy for how hard he or she tried to engage with us.
This demonstrates something important; Data Scientists use data differently sometimes. We need to have access one step deeper than other people using it, and the combination of Talend and RapidMiner enables us to do this.
Consistency in Model Deployment
Speaking about data quality we also need to have a look at the small things which can break our model. When deploying our model into real life, we assume that the data we apply the model to is representative to the data we learned the model on. I’ve recently been working on a project, where we used the title of the person as an indicator. We have two categories, Mr/Mrs and Dr in our training set. For some reason our deployment pipeline had the value Dr. in the title. In our case, this meant that the algorithm was unable to score this item and threw an error. In a good data pipeline you would take care of such problems as early as possible.
So, we need to ensure to have a working proven pipeline to get this done. If an inconsistency happens, we need to be able to track down the change in source systems. Talend’s Data Lineage components are crucial to prevent this from happening and an integrated solution is an obvious choice.
In this article we covered why a good data pipeline is quintessential for the success of a machine learning project. We covered common issues you will encounter when building and most importantly deploying machine learning models in production.
It is important to understand that data engineering and data science go hand in hand and are not two separate subjects which can be separated by teams and technology. The partnership and integration between the market leaders Talend and RapidMiner allows you to use the best tools in both realms to minimize friction.
To learn more about this partnership, register for our upcoming joint webinar Tuesday, April 30.