

Did you know that you can automatically select and extract useful features as part of RapidMiner’s Auto Model?
RapidMiner Founder, Dr. Ingo Mierswa outlines how RapidMiner is incorporating a novel approach for automatic feature engineering in this webinar where he discusses:
- The trade-off between feature complexity and model performance
- The process of feature engineering for better model accuracy and model understandability
- How automatic feature selection improves models by extracting meaningful features
- How to enable automatic feature selection in RapidMiner
Welcome to this webinar on automatic feature engineering, of course, with RapidMiner Auto Model. And in today’s webinar, I’m really excited to talk about this topic because it’s very close to my own previous research work, but I’m not going to bore you with all the scientific details. But I would like to give you or at least some introduction into the basic concepts of feature engineering, what it is. Some of you might not even be aware that you actually often do some feature engineering already just by using certain machine learning model types. And let me turn it over to the automatic approaches and effects on model agnostic approaches. So you can combine this with pretty much all the machine learning models out there. And we’ll discuss what other benefits, see some product demonstrations there, how to do that, and yeah, I hope you will enjoy this.
So let’s just dive in here into the topic. As I said, automatic fuel engineering is important, but why does it actually matter? So let’s first start with my own personal experience, but I’ve heard it from dozens and dozens of data scientists and all around the world. Often times, the difference between a pretty good but a truly exceptional model is not the model. It’s selection itself. So what type of model are you using? Or even not the model parameters, the hyperparameters you might tune. Sure, those are all important things, and you will get good models in many cases. But sometimes, to really get to the next level of model quality, it really requires you to transform the input representation of the data. So how does the data look like which is going into your machine learning model? And this transformation is really what makes the job for the machine learning models so much easier. So if you do it right, the model can actually pick up better on the underlying patterns and can better extrapolate from them and will be actually better in a predictive sense. So that is really often the nut to crack. And this whole process is also called feature engineering, the topic of today’s webinar. And in fact, in my own experience, but also many others, this is the most time demanding part of the whole model building process. And just to underline, this is not something just I experience myself, but many other much more famous people, as well. So for example, here’s a statement from Andrew which I really like. He said, at one point, that coming up with the feature– so basically, the manual approach for feature engineering– figuring out and using some domain knowledge, what to do to transform this feature space, that is really a difficult, time-consuming task, and it requires also expert knowledge. And I like the next part here, apply to machine learning. So not just the background stuff, sometimes or the research stuff, but really what you need to do to actually make some breakthrough for applying machine learning in reverse scenarios, that is basically feature engineering. because it takes such an important part of it and takes up so much time. And also Pedro Domingos, a great researcher in the machine learning space, he observed– but also lots of practical experience. So he observed at one point, well, of course some machine learning process products can fail and some succeed. And his observation, the difference is really defined by, okay, what other features you have used? Did you make the life of the model simpler or not? And I fully agreed with both those luminaries in our field here that this is such an important topic.
All right. So now we have established it is important. So let’s now discuss, at least on a high level first, and we go a little bit more into detail on some of those things here, what actually feature engineering is. So first of all, I would like to come up with this big division here in the center. So there are certain approaches to feature engineering which are model agnostic. So it doesn’t really matter what the type of machine learning is you use. And you can automate most of those things. And then there is a model-specific approach. And let’s actually start with this just for a second, because that’s interesting and model-specific feature engineering is basically feature engineering which happens inherently while you’re building a machine-learning model, while you train it. So of course, right now a very, very famous example is a multi-layer neural networks or deep learning where you can imagine the different layers in a neural network. The first layers pick up maybe very basic parts or patterns in your data, features in your data, and then you combine those very basic features to more complex features like with this picture here with the dogs. At first, you see some shapes or colors or things like that, and then you combine those very primitive features then to more complex features, let’s say, a paw, a nose. And then based on those more complex features, you can actually then have even a higher-level feature. Basically, the classes you would like to learn, let’s say. Is this a dog or is this a wolf or is this a whatever, a raccoon or some other animal? So that is an implicit feature engineering. It happens as part of the model building. You just don’t need to do anything explicitly. And while this is fantastic and is part of the reason why deep learning is so successful for voice recognition, image recognition, and all those tasks, it also comes with a bit of a problem. We’ll discuss some of them later. First of all, training times also go up. But the models are really hard to understand, and often they’re also relatively complex in the end. So it makes them also much slower to apply on the new data points, so it slows down the scoring, and for many use cases that is also not acceptable. So while this is fantastic, sometimes it’s just not enough, and you might needs to use something else.
Another thing, which I used quite a lot in my own practice, is support vector machines or other large margin methods. And there’s the well-known kernel trick, and I’m not going into the details here. But again, the idea is that if your data in the original data space cannot be separated by some linear function, maybe you can use some non-linear mapping into a new data space where actually you can then separate, and in case of a classification problem, those two data sets. So that’s shown on the bottom right. So the model-specific feature engineering is very interesting. It’s the reason why these kind of models often also are performing very well on a wide range of use cases and data sets. But then the model agnostic ones, they’re really interesting as well. And the first one which I would like to highlight is just well, the feature selection. And of course, also feature selection transforms the feature space. It takes away some of the columns, for example, let’s say you have a structural data table as an input. And that can actually make the training simpler, because if you remove some noise, then it’s easier to pick up the actual patterns so the signals and the noise. And also it actually makes both the training time of the model, later on after you’ve made the final selection of features– it makes the training time faster because you train on less columns, but again, also the scoring time. So feature selection is really something which is always worth considering, in my opinion.
Feature generation, we’ll see this later as well, is now the technique taking the existing features and the existing attributes in your data table and generating new columns based on those information. Maybe summing up two of them, creating a ratio, or the difference between columns, or applying sometimes even more complex functions. We will see some examples later. We won’t discuss today text factorization, so basically transforming unstructured data text documents into a structured form using tools like TF-IDF, et cetera. But technically, those would also belong to the feature engineering bucket. The same is true if you, for example, detect specific features and extract them from time series data, certain shapes, for example, in the time serious to improve the robustness of forecasting methods or regression methods or just classification models, clustering, whatever it is. That is also feature engineering, but again, probably a topic for another time. So for today, we are going to focus on a fully automated approach for both feature selection and feature generation. They are model agnostic, so you can combine them with pretty much every model which is available. And the approach we’re discussing today is actually quite novel. It’s something which is less known than maybe other, let’s say, brute force-oriented approaches out there in the market. But it comes with a couple of huge advantages and we’ll discuss them. And in fact, that’s why I’m a little bit excited about this topic, because it is very close to my own research.
Let me show you my own PhD title. That was Non-Convex and Multi-Objective Optimization for Statistical Learning and Numerical Feature Engineering. And if you don’t really understand anything of that right now, that’s totally fine. Even I started at this point. So let’s focus on only a couple of things here. So let’s forget about the convex stuff, et cetera. So the key words here are multi-objective optimization and feature engineering. So I’ve been working on this already sometime ago and for a long time. But I did it because I was convinced that this was really one of the most important things, more important even than model selection and type or parameter optimization in many, many real world use cases. Well, let’s dive then into this then. Multi-objective optimization, feature engineering, we will approach both of those concepts.
And let’s actually start with the multi-objective first. What does multi-objective means? Whenever we do feature engineering, we typically do this to improve something. We either want to get less features or less complexity or we would like to get a better model accuracy. Let’s say, if you have a classification model or if you have a regression model, maybe you would like to reduce the relative error of the model or something like that. So on this chart here, you see two of those dimensions here. The error rate of a model– let’s go with the error rate because that would work for both classification and regression. And on the y-axis you see the number of influence vectors. And that’s certainly one of the most simple ways to calculate a measurement for feature space complexity. And what we can see here is a couple of results. We can see five different options for feature sets here. Every feature set comes with a different number of features. So in the bottom right corner, we have a feature set with only one column and then in the top left, it goes up to there where we use five different columns. And you see this shape, this curve here, and that’s a very typical shape for all the results here. And the reason why we have a shape like that is because feature engineering is actually a multi-objective optimization problem. And so that means we have two different objectives– reducing the error rate of the model, but also reducing the complexity of the feature space. And unfortunately, those two objectives compete with each other. So you can not optimize one without making some sacrifice to the others. And whenever you have this situation, you kind of need to make a trade-off between both at one point. And this trade-off is exactly described by the shape of this curve here. So if you really want less complexity, only go, let’s say, with one or two, then the best error rate you can achieve is 0.4 or 0.25, in this case. But if you’re actually willing to allow for more complexity, let’s say, five, then you can bring down the error rate actually to 0.15. So this is a very typical situation for feature engineering. You have those trade-offs. Both things compete. You can not get better with your models without actually getting a little bit more complexity in your feature space, but you need to decide how much complexity is still okay for you, as well, and what is an acceptable error rate. But the good thing, if you do it that way with this trade-off, that has a huge advantage because you can actually make this final selection with what model to go and what feature space to go at the ends after inspecting all of this. So that’s the basic idea of this whole multi-objective feature engineering approach.
And in fact, this is not entirely new in the space of machine learning. If you’re familiar with some of the more theoretical concepts of machine learning algorithms, you’re probably familiar also with the concept of regularization. And the idea there is that more complex models are somewhat penalized to reduce the risk of over-fitting. And in fact, we all know this concept as well, under a slightly different name, but as Occam’s Razor. The idea there is, if you have two different explanations for the phenomenon, often or most times, the most simple explanation is the correct one. So getting a simpler solution, while it’s still explaining things correctly, is typically preferred before because the simplicity turns out, yeah, to be more frequently the correct one. And also in machine learning, we get more robust models. Basically, if the input data changed a little bit, it’s not immediately thrown off. So that’s regularization and that is also the penalty for complexity while you want to optimize for model accuracy. And that’s exactly what we did with our trade-off here, although we don’t really penalize it. We just see all those trade-offs in our result, which I personally think is even better.
Okay, so how can we achieve that? So let’s first start with feature selection. It goes a little bit simpler than feature generation, but we will get there. And some of you might have seen a previous webinar which I did more on those concepts on multi-objective optimization. So the next couple of slides, you might see some of them again, but please bear with me. I’m not repeating all of that. I just would like to show some of the basic concepts in the next couple of minutes here. Okay, so let’s say we have a feature set with 10 features and you’d like to figure out– that’s the goal for feature selection– which combination of features work best? And in order to do that, we can represent all our candidates for testing how well they perform with those binary vectors here. And a one means we use this feature, and a zero means we don’t use it. So we have 10 of those binary values here. And for this first example, we only use one feature, the first one. And then we can use trainer model on this one and for example , use techniques like cross validation or other validation techniques to measure how well this model performs in this case. For example, let’s say there is a classification model. We get a 68% accuracy rate or 32% error rate. So we get that resolved. That’s fantastic. So let’s check the next one. So we just stay still with only one feature. But now, let’s try the second one and we can again measure the accuracy. Let’s say this one is a little bit lower. And so if I would need to pick a feature set with only one feature, I probably would go with the first one because it looks like it’s performing a little better. And so on and so on, we try only the third one. It’s even worse. And then at one point, we tried all the 10 different feature sets with only one feature activated. So let’s then try to go with two features. Let’s start with the first two. Oh, look at that. It’s actually a little bit better. Shouldn’t really surprise that’s a little bit better because the first feature turned out to be pretty helpful already in the previous one. Adding a little bit more information often is helpful, unless this information is noise. Then it can actually hurt. But in this case, it looked like it helped a little bit. So we have 70% accuracy and so on until we go through all those combinations and end up with 62% accuracy for the fourth feature set. And there you see it already. Why is this a good idea to actually do some feature selection? Because if you just go with the original space, I actually get only 62% accurate model. But we saw, for example, the one in the middle here. If you only go with two, I get a better model. So why would I use all of them? That doesn’t make sense. So now you can think about this for a second before I actually tell you the answer. How many combinations for 10 features do we need to go through? So you saw this block of the first 10 here for only one feature and then how many are there with two features and so on and so on and so on. So in fact, the answer is 1,024. It’s actually 2 to the power of 10 and to be really exact it’s actually 1,023 because the one combination where we don’t use any features, so basically only zeros– that doesn’t even make sense because then there’s nothing to learn from. So obviously we don’t need to do that. So we have 2 to the power of N minus 1 combinations for N, or in this case, 10 attributes. So how does it look for 100 attributes? Well, I can’t even read this number any longer, but you see already how this whole thing is flawed.
That’s the problem with exponential functions. They just draw too quickly. So while for a small feature set, like for 10, you could even go with a brute force approach and simply try out all those combinations. For anything a little bit bigger, that no longer works. I can’t go through all those models here and the huge number at the bottom right corner. And then also, let’s say, I make it a K fold cross-validation so I need to train K models for each feature set. So it’s absolutely no longer feasible. So that doesn’t work. So the way we are solving this is by using an evolutionary algorithm for feature selection and that has a lot of advantages over things like backwards elimination or forward selection or other feature selection techniques you might be familiar with. The main point is that all those greedy techniques, they tend to get stuck in local optima and evolutionary rhythms do not. And then there’s another huge advantage. That is exactly this multi-objective approach which is good because it delivers those trade-offs we have discussed before. So the approach– I’m not going through all those details here– but tries to mimics the ideas of biological evolution. So you have initialize a population of different feature sets. Then you select random parents for creating some offspring by performing some crossovers. So basically takes some parts of one parent and cross them over with some parts of the other parent. And then, of course, there could be some mutations. So for example, the mutation for feature selection would be just flipping randomly one to zero or the other way round. And then we can actually calculate the performance. We’ve seen this before. And at the end, we just pit the different feature sets against each other, and those with a higher model performance have a higher chance to survive. And in fact, that’s the huge difference between– if you only would go for moral accuracy or a lower error rates then, that’s really it. But with this multi-objective approach, I actually try to select not only those which are performing well but also those which have less complexity. And then I’ll go– and we’ll see this in the next couple of slides and we go round and round and round. At one point, we figure okay, it’s no longer getting much better and we deliver the resulting feature sets, this set of trade-off as a result, to the user.
Okay, so those things apparently, because we need to try still a lot of different combinations, take some time. So I will show you know quickly how we enable feature selection in RapidMiner’s Auto Model. Then we will run it in the background while I still explain some more concepts to you. And then, after five minutes or so, this should be done, and we can we can have a look into the results. So if you’re familiar with Auto Model, you will be aware of this model selection step where you have on this right column the different model types you can pick. And then there’s this new data preparation group on the left side where you can enable different types of feature engineering. And for activating feature selection, first, all you need to do this is turn on this little blue parameter box for automatic feature selection. You can ignore this little combo box for now. Typically it’s a good idea to go with an accurate feature set in most cases. Okay, so that’s how we activate it. So let’s actually go into this, and then also, I quickly run through Auto Model. But before I do this, let me show actually the data we are going to use. This is the Sonar data set shown in Turbo Prep in RapidMiner right now. So we have 60 different columns here, all numerical. And then here, the last column is actually a class. Actually every column represents a certain band in a frequency spectrum, and the goal is to separate between rocks and mines here for this class column, which is the one we want to predict. Let’s quickly visualize this whole thing. If I go with a parallel plot here, let’s use the class as color. You will see it’s very hard to tell the difference between the blue and the red lines. I’m not going through all the details, but every line here represents one row in the data set before and you see already it’s hard to see the difference here. But in fact, here with this deviation plot, you see a little bit more at least. So for example, there are certain regions like around attributed 10, 11, 12, they differ a little bit between the two classes. Then again in the low-20s in the mid-30s and in the mid-40s. So let’s remember this a little bit. So 10 to 12, low-20s, mid-30s, mid-40s. So those are the areas which are more helpful, and feature selection should actually identify those areas for us. I mean, in this case, we can just look at it. But in many cases, we can’t.
So let’s actually hand this over to Auto Model. So the data’s already selected because I did this from within Turbo Prep, so just clicked on the model here. Otherwise, I would need to load in the first step, and let’s now pick the class we want to predict. Yep, that’s it. I can go quickly through the other steps. That’s actually all okay. We don’t need to take care of this. Well, in order to speed things up a little bit, maybe I’ll just disable some of the little bit longer running model types. I will comment a little bit on runtimes at the end of this webinar. But just for this demo here, it might be just faster. So I’ll go with knife-based, generalized linear model and decision trees here. And of course, as I said before, we need to turn it on here, and now we can actually run it. So that will run for some time. Actually for knife base, it’s probably relatively quickly, so it might even be stuck waiting. But you see right now Auto Model runs through all those combinations and there are actually quite a lot. I mean, this evolutionary approach goes through a lot, a lot of different feature sets. It’s not complete random, but it tries out different things, and some of them are promising, some a little bitt less. So knife base, the first results– and here is actually where you find the results later on. But I will explain this on slides before I go back into RapidMiner there in a second to discuss this. Before we do that, let me quickly show, and while we’re waiting for the other results– quickly show you what’s going on in the background. So first of all, I said, “Well, we need to measure the complexity.” And for feature selection, the really simplest thing to do is just counting the number of features. But later on, for feature generation we can do different things. For example, also counting how many functions we have to use or we could even use different complexity levels depending on the function type. Let’s say, a plus function is less complex than a sine function, just as an example. So we put all those feature sets on the space here. We have the error rate on the x-axis as before and the number of influence factor. So the first thing, obviously, you would like to achieve is you would like to go to the bottom left corner. So we would like to get to, if possible, to error rate zero with only one feature. That would be fantastic. Unfortunately, that’s pretty much never the case, but that’s the goal. So that is the general direction we’d like to get.
So how are we doing this? So I talked about this the selection process and the evolutionary element before. So we’re not just going with the ones which are performing well in terms of error rate. But we actually try to find those which are performing well according to both criteria here, the error rate and the complexity. So let’s have a look at this red point, for example. It also has only one feature but actually a higher error rate. So is this a good feature set to keep? No, why? We already have one, which also only has one feature, but it actually has a lower error rate. So this red one is not good. This one here is even more terrible, because actually we have a couple of feature sets here which have less complexity. For example, the one with two. End is better, equal complexity is better, and even a little bit higher complexity is better. So it’s really pretty much– there’s no point keeping this red point there. The same is actually still true for this one here. Yeah, it’s not having more complexity than the one here we already have in the top right corner, but again, it’s not really improving at all. So we could, for example, also have a point– let’s say on this 0.15 line on top of this highest orange point here. So again, same error rate but more complexity. And again, we wouldn’t use this one because why would we accept this additional complexity without getting better error rates? So it doesn’t make sense. So this is true, actually, for all those red points and many, many more. So they don’t make sense. They’re averse, to some degree, at least according to one of the dimensions, than the orange points. We also say that the red points are dominated by the orange points, and that’s a very important concept. So while we’re running through our evolutionary optimization, what we’re going to do is, we are focusing on those points which are dominating the others. The orange ones, in this case. So we keep those. Let’s take them out for a second. And then out of the remaining points, again, we can identify some which are dominating the others and we can keep those as well. So we get rid of the rest and you see already by doing this type of selection, we move naturally towards the bottom left corner. And that’s exactly how this approach works in a nutshell. The result then, we call a Pareto front. It’s really containing all the optimal trade-offs between the error rate and the feature space complexity.
Okay, so let’s go back into– because later I can show you just life and the products into the results here in RapidMiner. So, I guess, yeah, the others are still running. But let’s have a quick look here already in knife base. So this chart here on the top left is exactly this trade-off chart we have seen before on the slides as well. So we have one feature set which is the original feature space. It has a pretty high complexity of 60, containing all the 60 input columns, but it’s actually not even performing very well. So it has all the data, but is not really doing a good job. So whenever you click on a point here in this Pareto chart, the feature set is shown down here in this little table. So right now here, I clicked on this. It’s marked with orange, and I get all the features here. And we see it has an error rate of like 30- something percent. So that’s not really good. So instead of doing that, you could go down to only one feature, attribute 12, in fact. And by doing just that, and get rid of all the noise, we get a much simpler, smaller feature space, obviously. But the model also performs better. So there is no reason to go in with this very complex one if a much simpler feature space delivers even better results.
There’s other advantages. For example, you can train the model faster on only 1 column over 60 columns. Often those models are also more robust, and definitely they are easier to understand because if the model only uses 1 feature or 60 features makes a huge difference if you’d try to inspect the models and understand that. And model understandability, obviously, is a pretty important topic right now in our space. But if I add a little more complexity. So for example, here click on the next point item with two features, we get well, two features. And also note that now we get attribute 11. Before we got attribute 12. But the interesting thing is, you remember that I mentioned the ranges between 10 and 12 are important, low-20s, mid-30s, mid-40s. So here, with this one, we cover already two different ranges the 10 to 12, and the mid-30s. Let’s go to the next one. Now we cover the 10 to 12, mid-30s and mid-40s. Next one. Again, 10 to 12, mid-30s, mid-40s, and so on. And the models get slightly better. Let’s go to the one which is actually used here, which has 10 features, those here. And they really cover all the relevant areas in those different feature ranges. And that is exactly the power of this multi-objective approach. Even if you wouldn’t have been knowing this before, because sometimes you can’t just see that in plots, you can still build this understanding by looking into this and see like, “Huh, what are actually important features?” or in this case, even feature ranges. And which ones of those need to be represented? So the other models are done as well. Linear regression actually was running for quite a long time, but let’s have a quick look at those results as well. As you can see, for linear regression at least, it also starts with attribute 11, not 12, but the same range here. But in this case, actually it performs a little bit less good than the original one containing all the 60 features. But already, the one here, the next one, mid-40s and 10 to 12 again, is already outperforming the original space. And then the final one again is a set of eight features or so covering all the different areas here. And the same is true for the decision tree as well, starting with only one, 10 in this case. And actually it really ends up with a much simpler model in this case, which still it performs very well in this dataset.
Okay, so we have seen that. That’s good. So let’s now actually move on to feature generation. So feature generation is something different, obviously, than feature selection. We generate new attributes out of the existing ones. And you can combine them to do both and at the same time, and that’s actually what is also the shooter branches of this novel approach I’m presenting to you today. So why is this, for example, useful? Well, let’s say you are looking for good opportunities for buying land, a very simple example here. And you have the attributes – length of the lots you would like to build houses on or something, the width of the lot, the price, and if you bought them in the past. So basically, a class. Is this a good deal for you or not? So length, width, and price. Well, those three features alone actually are not super helpful. So you’re better off creating new features based on this table. So first, you could create a new column called, let’s say, area, if it is length times width. And then you could create another column, let’s say, price per square foot or something if it is the price divided by area. The other one we already generated before. And if you do that, you actually end up with a different feature space, which makes it much simpler for the model to actually make this decision here. So I know it’s a very simplistic example here, but these kind of things actually happen. Not always, but if they do happen, and if you can’t come up with this idea yourself, maybe because you’re lacking the domain expertise, maybe because you have the domain expertise and still didn’t come up with this idea, then the model will have a harder time. So again, it’s not always useful for all use cases, but it’s often worth a try to see if there’s maybe another nut to crack, and then we can really get to much better results then. Okay, but adding new columns again adds complexity to the feature space, and adding new columns by using even more complex features, even more so. And if you do that, you would– and actually, this was a problem in the early days of machine learning. Many people try to use genetic programming for that, and it’s often failed. And the reason was something called feature bloat happened where just generated too many things which looked helpful, but it just have been blowing up the complexity of the feature space. But for this multi-objective approach I’m describing here, that actually cannot happen.
Okay, so let’s have a look at a little dataset here. And then again, I will start it in the background so it can run in the background while I explain a couple of concepts to you while we’re doing the calculations. So this is the data. I would like to– it’s only basically one dimension, attribute one here, which is shown on the x-axis. And the label, so this is a regression problem, this case, is the curve you see on the screen here right now. This is the one I would like to predict. So you see it’s definitely not a linear function. There is this kind of strange peak in the center and it goes up on the right, and it goes first up but then down on the left side again. And yeah, so that’s the regression function, strange peak in the center, and there also seems to be some noise. Plus, we have three more columns. You see this in the selector on the left side. You have random, random one, and random two, and those are actually also confusing the model. So that’s the dataset we would like to predict, and if we would just use a linear model, for example, in this dataset, without any feature engineering, then while the linear model is doing the best it can, it puts the linear bell curve in across the data points here. But it obviously doesn’t get the shape of the curve, and it also gets confused a little bit by the addition of noise, and this all leads to the situation that the relative error, in this case, is 39%, relatively high. So shouldn’t come too much as a surprise here that a linear model can’t do that. Well, some might argue, “Well, then just go with deep learning or decision trees or something else.” Well, sure, we could do that, but the problem with those models now is, they will perform better, but the outcome of the model is way too complex. So at the top here, that’s not a graphical error. That is actually the decision tree used on this relatively small data set at the bottom here to predict this. And yeah, it’s doing a quite good job actually. It doesn’t get the peak in the center, but otherwise, it actually is doing good job. It’s a little bit noisy still, but it only has 6% error rate. That’s fantastic. But look at the model at the top. So training a model like that on the final feature set takes longer. Every scoring of every data point takes longer. And good luck with understanding this model. And if you think, “Oh, well then let’s go even with gradient boosting trees to bring down from 6% to 5%,” fine. Then you have 200 of those models there at the top and understandability is completely going down the drain, so. And the same is pretty much true for deep learning and some other complex model types. So if runtimes, understandability, if this is all now an issue, of course, by all means, try those more complex model types. But sometimes, you’re better off actually trying to do feature engineering with a more simpler model like a linear model and gets to better results.
So here, I just have been running this before. I’m not running this now live in the webinar, but you see the different outcomes for the different model types. You see some are doing actually pretty good job. Decision tree, surprisingly well. Gradient boosting trees then, a little bit better. Deep learning, maybe you can actually get it bounded to the 5, 6% range as well by tweaking the network architecture, but. Same, too, for support vector machine and some corner parameters. So I’m sure we can get some of those models even better, but my main point is some models perform well, but then you can no longer understand them, and they get really complex. And the ones you could understand, they’re just not performing well on on a data set like that, and that’s without feature engineering. So how do we enable feature generation in Auto Model? Well, after you have enabled feature selection with this checkbox on the left, there will be another checkbox now enabled, which you can turn on, which then called automatic feature generation. And again, for now, just ignore there’s a little combo box below there. Typically, I would recommend to go with the medium function complexity. That delivers quite good results in both cases without going completely use it with a couple of strings, I don’t know. Let’s calculate the sine function of the logarithmic of the whatever. I mean, you can do that and try, but often medium is a good balance. Okay, so let’s do this in RapidMiner. Let’s actually restart the Auto Model run here, and I have some there. There’s a data set. Here, there’s nonlinear function. That’s exactly the data set we’ve seen before. So we have this attribute one, the three random ones, and then the label to predict. And you see there’s also a couple of cases where we don’t know the predictions yet. So we would like to know what the true value is at, the true label. Yet, you would like to apply this model later on. All right, let us go through those settings here quickly. Okay, so first you need to enable automatic feature selection and then also automatic feature generation, because well, it’s not always necessary. Just again, to speed up things here even more, let’s just go with the GLM and decision trees for now, and let’s run this here. So that will take, again, a couple of minutes. So let’s go to the slides in the meantime, so I can explain to you what we can expect. And what we can expect but also how we doing this.
So frankly, there’s only one change to the whole approach. So we have seen the evolutionary algorithm for feature selection, and the only change we are doing is actually for the mutations. Before, for feature selection, the only mutation we had was turning on and off features, but now we add another form of mutation to this. And that’s also generating new features. So we basically end up with two different forms of mutation. Turning on and off, but also generating new ones based on the currently selected one, typically. And that then, of course, increases the feature set sizes but then the selection mutation and crossover and everything else will actually also bring them down again to the right levels. That plus the multi-objective approach. Okay, so the result will be, by the way, exactly the same– look the same as in the pure feature selection case. So we’ll get this Pareto front, those trade-offs between complexity and error, but if you click on the different solutions now, and we will do this in the product in a second, you will see that we are no longer only getting the original features but also some functions applied to those features. So for example, this is the result for the generalized linear model on the data we are currently running the demo on, as well. And here, the most complex feature set is actually only having two columns, but since we apply a couple of functions here, each column has a higher complexity. Just, again, to prevent this feature bloat known from genetic programming so that we are not just creating really crazy function chains here. So we are increasing the complexity for those things here, as well. So, as you can see, the original one is actually no longer a part of this. So all the random features are gone. That’s good. Just attribute one– wasn’t super helpful anyway. We saw that. But we calculate two new ones, and then, obviously, the most simpler ones will also have more simple functions. So let me actually see if this one is– no, it’s still running. Well then, I’ll show it on the slide first, and then we’ll go back into the result. So as we have seen, for the best result, we got absolute function on this attribute one. And then also attribute one times of the absolute function of attribute one. So here’s the absolute function of attribute one times attribute one. And here’s the absolute function of attribute one. So why are those– here I plotted the shapes of those two functions so that we can get a feeling of how they look like. I mean, that shouldn’t surprise you. But yeah, that’s how they look like. So why are they helpful? Because now, the linear model can actually take those two new functions, or this new view on the data, and overlays– almost like a super position and give both those new functions a certain weight, overlays both of them, and generates what you see here at the bottom. And that’s fantastic because that should already remind you a little bit of the function we wanted to train. And if I actually use this prediction, the red curve is now the prediction coming out of the linear model. You see it’s not confused by the noise. I mean, the noise was no longer going into the model. It’s roughly hitting the shape of the curve. Not perfectly well, obviously, but it’s getting pretty close. The error rate was dropping big time. I mean, it didn’t get the peak in the center. Maybe if you would let this run a little bit longer, it would also get the peak, and probably the shape of the curve would actually be a little bit better. I’ve seen some examples for longer run times which get there. But this is already good enough in my mind. It’s now 10% but the huge advantage is, this is actually a model I can understand versus, let’s say, the decisions we have seen before. And even the decision tree gets a little bit better. So in this case, we see the error rate for the GLM was dropped from 39 to 10%, and the decision tree before was like 6% or a little more. And now it’s 5%. Well, the runtime’s apparently longer, but especially in the case of the GLM, you get a more simpler model, it’s more robust, it’s faster to train, it’s faster to apply, and use it for scoring. And the most important thing, it still stays understandable. So here are the results with unknowns. So that’s what we just saw here, 10% and 5% accuracy and plus the runtime. The feature sets, as I said before– so this is the original one including the noise. Well, if you only go with one feature set, well then, just go with attribute one. That’s better than using the noise. Okay, good. That’s already important to know. But then, again, and that’s why I love this trade-off navigator so much. I mean, typically your columns are not called noise. I mean, you would know that this is just noise. Okay? So you can just navigate through this and see what is important. Okay, that’s the most important one. Okay, fine, but then I can actually calculate the square root, let’s say, of this and that already helps a little bit. Or I go over the absolute function and just the original one and that helps it a little bit, as well. Or then the final result we have seen here. And then if you have a look at the model, again, it’s a much simpler linear model people can actually still understand, while that would not be the case in other cases.
You see the data here. Let’s actually go quickly over to Turbo Prep again just so I can create a chart of this. I mean, you saw the result already before, but. So there’s the function you want to predict. This is the original attribute that’s actually no longer used. That’s the absolute function. That’s absolute function times the original attribute. That’s the prediction. And that’s the prediction in red. And the label, we want to predict in blue. So that’s exactly the result we wanted to get.
So before I wrap, I mean, you saw a couple of cool things, I guess, already. So there’s a trade-off navigator. The fact that you get those trade-offs, the newly generated features, that’s all fantastic. But I would like to summarize some of the things which are happening under the hood but make this whole approach really special. So the first thing is, yes, it takes some time, okay. I mean, feature engineering is, as Andrew and Pedro Domingo have said, and we saw this on the slides, is the most time consuming thing. And it simply can take– it can take months of your personal time if you do it yourself. So yes, it also takes time if you follow or use an automated approach, but it’s still much faster than doing it yourself. But here’s the thing, this novel approach I have presented to you here now, actually delivers all those trade-offs in only one single optimization run. You could obviously also just define like, “Hey, give me the best results for one feature or the best results for two features, for three features and so on.” But then you would run multiple of those optimization runs, and it would take even longer. And you wouldn’t even know if one is a good number or five. Or maybe you should stop at 10 or maybe at 50. You can’t know that. But there’s a side effect of this multi-objective approach. You get all the results, and you can inspect them after this one single run at the end, so you wouldn’t even need to decide what level of complexity is the right one for you. You will get this actually as a result of this approach. And that’s truly exciting and game changing here for automatic feature engineering. Well, sometimes, often times, actually, you will see that the models are not much better than not using any feature engineering at all. And we saw this a little for the generation example here, that the GLM plus feature generation was not much better than, let’s say a decision tree, without or deep learning or others. And that’s okay. But after you went through this exercise, often the situation that you say like, “Well, look my linear model now, much simpler, it’s easier to understand with a couple of features is not much better. Maybe also not much worse but it’s maybe equal to a more complex model.” But after you figure that out, training that model, using the model for scoring will actually all use less computation time. And it also will make your model often more robust, because the feature space is simpler. And that is another really, really, really nice side effect of feature selection or sometimes even feature generation. Even if the model accuracy or the error rate doesn’t change a lot, the complexity of the model, or the lower complexity, is actually still beneficial. So if, let’s say, time allows, if you want to give it a try and have an hour to spend, it’s often good to do this just to see well, maybe you can get equal performing models, sometimes better performing models, but at the same time, simpler models. So it will pay off in terms of computation time.
The thing I didn’t show here, and it’s actually not in Auto Model yet, but I did some research on that in my past and it will be part of Auto Model very soon as well, is that you can use this approach also for unsupervised learning. And that is surely blowing many people’s mind, because this generally is considered kind of like an unsolved problem, but we actually could solve that as well, so. And it will be part of Auto Model very soon, so stay tuned. So while we saw it takes longer times, though, I would like to make some comments. I said already before like, “Look, if you do this manually, that can easily be a month of your personal time.” So doing it automatic and waiting an hour, for example, that’s still worth doing it because that whole time, you can do something else. But I would like to explain to you why that is the case, So normally, if you do an Auto Model run without feature engineering, we still calculate like hundreds of different models for you and show you the best performing models and model variance, and then we go through parameter optimization. We do all of that for you. But the second you turn on the automatic feature engineering, we actually go easily through 100,000 and more models in one Auto Model run. And that’s the difference. If I go through 500 models or through hundreds of thousands of models, it just takes longer. So all those models need to be trained and validated and that simply takes time. Again, still faster than you doing it manually but compared to, let’s say, an Auto Model run without feature engineering, obviously it take longer. So we actually put in some time limits, because it’s at certain points, it’s just not worth running it much, much, much longer because it doesn’t change much more at a certain point. So typically, it will never run more than approximately one hour than the Auto Model run without feature engineering turned on. So if you get the results, let’s say, in 10 minutes without feature engineering– you turn it on, it should not run much longer than an hour and 10 minutes, typically. So that’s what you can expect in additional runtimes, but you will always see much longer run times the second you turn on automatic feature engineering. But we still believe it’s absolutely acceptable, this additional hour, compared to the manual time it otherwise needs. And also keep in mind that from our usage data we can see that the average Auto Model run is only 6.5 minutes. So it is already a very, very fast solution to get to good models anyway. Now, if you think automatic feature engineering might do the trick for actually getting simpler models or better models for you, after you did some prototypes, figuring out what model types work at all, in those first six and a half minutes on average, it might be worth to add this additional hour. So that’s what you can expect and should deal with.
So yeah, as I said, especially for smaller data sets, that can be significant. I mean, if the normal Auto Model run is already, without feature engineering, let’s say, 10 hours and now it’s 11. I mean, that’s not much of a difference, but for smaller one, you will tell the difference. We saw that in the demos before. So be prepared for that. Get some coffee. But at the same time, additional insights, all those interactivities between the features which is great. You will get more robust models, faster scoring times in the end. I mean, there’s a lot of things worth looking forward to. But yeah, that’s what it is. Okay, I hope you enjoyed this webinar on automatic feature engineering. It’s definitely a topic which I think is very, very important. I think we’ve created a fantastic solution for this based on a lot of research work from our own team members in the past. It’s tightly integrated with RapidMiner Auto Model. As always, with RapidMiner Auto Model, you can actually open the underlying process, so you can see how you can also use this automatic feature engineering approach in your own processes. If you prefer that over using Auto Model, that’s possible, as well. There’s a new operator available to you. So I hope you enjoy it. Give it a try, and I hope that all your models will be much better in the future or at least simpler or in best case, both. Thank you.