Many manufacturing production processes are done in batches. Two items of one batch are produced with the same production settings. Those two items are thus either exact duplicates, or very similar to a duplicate—a pseudo duplicate. Pseudo-duplicates are items that are very similar in their attributes, and perhaps even identical in some cases.
This batching process speeds and simplifies manufacturing workflows. However, if you’re working within such a setup and want to use data science and machine learning to inform your decision making, you need to be very careful how to use your data.
There are two main challenges that arise when thinking about applying machine learning to batch manufacturing data. The first is making sure that you’re validating your models correctly, and the second is handling the actual model training using batched data.
Let’s take a look at each in turn and give you some ideas on how to account for these issues.
Step one — Fix your validation
The biggest problem with batched data is validation. The most common validation schemes—cross validation, split validation, and bootstrap validation—assume that each of your examples is independent from all the others.
But if you’re manufacturing in batches, this isn’t the case.
Let’s take a look at how you can still validate your model, even when your observations aren’t independent from one another. Keep in mind that validation is the main reason that data scientists believe in their analyses. That’s why it’s of paramount importance to get this right.
To do this in a batch manufacturing environment, we need to make sure that all the examples from one batch are always either on the training or the testing side of any data validation split that we do.
If we don’t do this, the model we’re training would essentially be able to cheat—it would know the correct answer to some of the test rows, because an identical (or very similar) row exists in the training data. The validation process would thus say that the model is good, but once it is into production and can’t cheat anymore, its performance would go down, possibly way down.
In RapidMiner this can be accomplished using the “Split on batch attribute” option of the cross validation operator. If checked, the ExampleSet provided needs to have a batch attribute included for this operator to act on. This field manually defines what belongs to one batch (or one-fold of cross validation).
A common solution to derive the BatchId attribute is depicted above. Here the modulo function is used to generate a BatchId from the numerical ID, which, in this case, is an item ID. The mod function calculates the remainder of a division, thus mod(11,10) returns the remainder of dividing 11 by 10. The answer is the 1 you can see here.
Using this option, we prevent one occurrence item 11 from appearing in the training set and the other one appearing in the test set.
If you’re using split validation instead of cross validation, you can get around this issue by first sorting by your ID column with a Sort operator, and then use filter examples range. This splits the data set into 2 parts. But this time it is not with a random sample, because you’ve sorted by your ID prior to splitting, so related items are not next to each other in your data set.
After doing this, you can use the first x rows training, and the remaining y rows can be used for testing, as the training parts are now distinct from testing parts. (The possibility of a small amount of spillover between datasets at the point you split the data can usually be ignored due to its small size.)
Now that we’ve taken care of validation, we need to address the potential confusion that batched data can have
Step two — Handling the data itself
Using your data in a normal learner may be a bit confusing to the model. Remember that, because of the batched nature of the data, your learner may encounter 10 items and only one of them is “True”, even though they are very similar to each other.
Depending on your algorithm, this may have a serious impact on how your model eventually works.
Because learners are commonly biased towards predicting the majority in cases where they aren’t sure, putting in lots of very similar data points might bias the model in this direction.
For tree-based algorithms, you’ll have to modify your pruning strategy, as a setting like “don’t split this branch if there are less than four examples in it” may not be appropriate when considering batched data.
Due to this, it may make sense to treat batch data in a special fashion during your model training process. Here are three ideas for how you can overcome the challenges of batch data during model training.
As is often the case in machine learning, one of the ways to tackle the problem is to use a sampling approach. We can randomly take one of our items from the batch process and use it for model training and testing. This nicely addresses the issue of keeping the class balance in classification problems, as well as the label distribution in regression problems.
It is also a simple technique to do. The downside is that you lose examples and thus might also lose information. It really depends on whether or not additional information is gained by using more than one item from a given batch.
And there may be, as each individual item may have been created with same materials and only the production settings are different. But there also might not be anything to be gained by including more than one example of an item from each batch—that’s something you’ll have to decide based on your particular use case.
2. Weighting and changing the problem
Just like unbalanced classification problems, you can also use a weighting technique here to overcome this problem. Instead of removing examples, you average the items from a single batch and take the number of rows as a weight for your learner and performance.
This isn’t possible for all learners, but quite a few of the common ones support it. To figure out if your learner supports it, you can right click on an operator and say have a look at the operator information.
Of course, this strategy has a few downsides as well.
The first is that averaging all your attributes may not be the best thing to do. If your values differ a lot, then you again lose a lot of information when you average. It, again, boils down to these questions: How much of a duplicate are your rows? Are they just dependent pseudo-duplicates or real duplicates?
The other question at hand is what to do with the label? For regression problems—for example, a measured concentration—you can of course take the average. But is this what you need? Are you maybe more interested in the maximum? Or the 8th decentile? As always, you should align your data science approach with the business needs at hand and think carefully about the choice that you’re making.
For classification problems, you run into the issue that the mode (most frequent class) is maybe even less suited. In this context, Classification problems are often real challenges, where you asked if something passed a quality test or not.
Here you may use the old computer science trick and change the problem. One may consider moving away from your item-based problem and move to a batched based problem. The question you may answer here is: What fraction of my batch will not pass the quality test?
This is a batch level regression problem, which can be turned into a classification problem if needed.
One interesting way to tackle the batch problem is to use a sampling approach, but with an ensemble method. The idea is that you build multiple learners. Each learner sees only one item of a given batch.
This item is randomly chosen. The fact that you do this multiple times ensures that you don’t lose information. The idea can be thought as a random forest, where the random sampling is replaced with a stratified sampling.
4. Just leave it as it is
After talking about a few advanced ideas, I would like to emphasize that just using the data “as it is” is also a valid option.
Machine learning algorithms are designed as robust systems. They are used to working with noise. Thus, they can in principal also work with those pseudo-duplicates.
Random forests even produce similar duplicates on purpose in their bootstrapping sampling. Since random forests are very robust learners, and the bootstrapping is a random process, this does not harm the forest.
As usual with machine learning, there’s no free lunch and no easy answers. The ideas listed above are solid starting points. However, the only way you’ll truly know what works well for your data is if you try a few things and see what works best.
Remember that model training is inherently an iterative process, so see what works best for your particular use case.
Summon your perfect AI team!
What’s the key to a successful machine learning project? Assembling the right team can make all the difference when it comes to moving the needle. Download this ebook for expert advice on how to approach the often-difficult task
Learn how Daimler and Miele used RapidMiner to accelerate the product design and assembly planning phases in their factories to reduce time and cost.
Learn how RapidMiner and Tableau provide a complete solution for analytics teams. See how these two platforms can be applied in manufacturing.
How Data Science Will Play a Pivotal Role in the Future of Equipment Maintenance with RapidMiner and MapR
According to Gartner 72% of manufacturing industry data goes unused due to the complexity of today’s systems and processes. Learn how to take advantage of IoT data in this webinar with MapR.