Better Machine Learning Models with Multi-Objective Optimization

The search for great machine learning models is about overcoming conflicts. We want accurate models, but we don’t want them to overfit. We also want more features to improve accuracy, but not too many to avoid the curse of dimensionality. So simultaneously optimizing multiple conflicting criteria seems like it should be a standard solution in the data science toolbox.

Join RapidMiner Founder and President, Dr. Ingo Mierswa for this webinar where he discusses:

  • Multi-objective optimization: the secret to great modeling
  • Methods for applying it in machine learning and feature engineering
  • How to apply these methods in RapidMiner

Hello, everyone, and thank you for joining us for today’s webinar, Better Machine Learning Models with Multi-Objective Optimization. I’m Hayley Matusow with RapidMiner and I’ll be your moderator for today’s session. I’m joined today by RapidMiner founder and president, Dr. Ingo Mierswa. We’ll get started in just a few minutes. But first, a few quick housekeeping items for those on the line. Today’s webinar is being recorded, and you’ll receive a link to the on-demand version via email within one to two business days. You’re free to share that link with colleagues who are not able to attend today’s live session. Second, if you have any trouble with audio or video today, your best bet is to try logging out and logging back in which should resolve the issue in most cases. Finally, we’ll have a question and answer session at the end of today’s presentation. Please, feel free to ask questions at any time via the question panel on the right-hand side of your screen. We’ll leave time at the end to get to everyone’s questions. I’ll now go ahead and pass it over to Ingo.

Excellent. Thanks, Hayley. All right. So today’s topic really is about how can we actually get to better models. And the idea is not really just to figure out how can we make a better model selection, it’s not let’s say gradient boosted tree is better than deep learning, or is, let’s say regression maybe is the model of choice here, that is an important topic as well but that’s not really what we’re going to solve. So it’s really more about if you already have some good model candidates and some good results, how can you really enhance it? How can you actually get better models? And one of my go-to techniques for that is actually using multi-objective optimization, and I spent quite a bit of time on this topic actually when I was also still a researcher, and so I’m really, really happy to share some of the things I figured out back then with you guys and hope you can actually make good use of this tool to improve your machine learning results. So before I get actually into the topic, just one or two sentences about myself. Yeah, as Haley said, I founded RapidMiner, many, many, many years ago, I’m still a data scientist at heart, I still really tried to solve most of the things we can solve by using data and using data science methods to really get to the ground of things and figure out what needs to be done and how can we improve our own business, but of course, this is also with a lot of organizations in the past. I was working as a consultant many years, actually before it even was called data science and before people actually even talked about machine learning, so that was a little bit ahead of the curve. My main gig was probably for like five years in the pharmaceutical industry although I’m by no means an expert even after those five years for that industry. But anyway, so yeah, a little bit of practical experience, research experience, and then, of course, since I founded the company, RapidMiner, I’ve been hearing a lot about how our users and customers are solving their data science problems.

So that’s about me. And then maybe let’s just dive into this. As I said, I spend a lot of time of my own research on this topic, and then at the end I’ll add to my PhD. So when I was still in Germany, I was working on this whole thing. And the title of the PhD is on the screen right now. It’s Non-Convex and Multi-Objective Optimization for Statistical Learning and Numerical Feature Engineering, and I totally get if you really don’t have any clue what this whole thing is about. At this point. And that’s just the typical thing we stupid researchers are always doing, coming up with a very fancy-sounding PhD title to actually, well, yeah, figure out or actually impress the people and show off a little bit. So forget about the title. But you saw it, it’s this multi-objective topic in there, and that is really something I realized quite early when I started using machine learning, this is an ongoing thing. When you create machine learning models, and it’s not just for the machine learning itself, it’s actually also for most of the preprocessing tasks we have and all the things we do around machine learning. So if you have a look on the screen right now, you have two major topics in machine learning. I mean, of course, we have different model types, like for regression or association rules, but hey, let’s stay simple here for a bit and let’s talk about classification models first, on the right, and also one of the things you often want to do is to figure out what are the best input vectors for your machine learning model. And by reducing those input vectors, hopefully, you can actually get more accurate models by focusing on the signals and a little less than the noise but also typically as in other advances, the models are simpler, they’re more robust, and they also can be trained in faster times. So those are two important topics and I guess every data scientist is dealing with those topics, kind of on a daily basis.

But as I said before, if you do feature selection or if you learn a classification model– in this example here, let’s say a linear last margin method let’s say like a support vector machine, so when we try to find this linear separating hyperplane, this white line in the middle to distinguish between the green and the orange points, what do those two things really have in common? And that was maybe the first big whatever step to your power or aha moment I really had to really realize actually they all have one thing in common, and that is really that you never only have one goal, but you always have multiple goals. And unfortunately, that makes this problem of machine learning so hard; they always compete with each other. It’s not that simple that while you’re improving on one of your targets and one of the goals, it’s easy to also improve on all the others. Typically, there’s some form of trade-off. So for example, for feature selection, if you have more features, you give more information to the models you do the feature selection for it and that often leads to more accurate models. But as I said before, the models are getting more complex and they take longer to train, and sometimes you are at more risk of overfitting as well. So it’s not that easy to then say like, “Well, then let’s give everything to the model,” you need really to make a good trade-off between the simplicity and robustness of the model and the understandability of the model on one hand, and fewer features help you there, and then on the other hand, there’s the power, the accuracy of the model on the other hands, and this is the trade-off you have there.

And for classification models, things are really similar. And that’s true, basically, for all the model types as well. So it’s not that simple than just optimizing for the training error of the model, in fact, that’s the last thing you should optimize for, but of course, most machine learning models need to do this to some degree. So of course, you don’t want to find a linear separating hyperplane which sets goals across all those classes and doesn’t separate anything, that is not the goal, you want to have a good model which separates correctly most of the orange at the green points. But it’s okay to do some errors. Let’s for example say this would not be a linear line but like some curve which really carves out the green point which is part of the orange group and the orange point which is part of the green group, yes, sure, that would be more correct in terms of training error, but at the same time, the model gets more complex. It’s more likely that you will make more errors on the testing error. Again, it’s really about this trade-off, and it’s really about figuring out what is the right trade-off to get to the best results in the future. So it’s all about trade-offs, it’s always about multiple objectives, it’s always about those conflicts. And that is true for basically all machine learning models and funnily enough even true for lots of things like the meta modeling we do around the machine learning, like feature selection, parameter optimization, and so on. So that’s the problem that it has. And are we doing a good job of solving that? Well, sometimes.

So you probably are familiar with the term of regularization. In case you’re not, you really should become familiar with this because this is one of the main concepts in machine learning, which helps us to actually build better models, which generalize better to unseen data points. So the basic idea is really what we call– well, it’s based on what we also know as Occam’s Razor. So the basic idea is if you have multiple solutions for the same problem, often it is the simplest solution, the most simplest solution, which is the correct one. It’s just like something you probably saw in life as small is you really need to bend over backwards to actually get through to some point or explain something, and then if you really take a step back and think about this again, you find a simpler solution often that is really the correct one and the best way to go forward. And that is just something which is long known, for thousands of years, known as Occam’s Razor, and regularization in machine learning really takes the concept and formalizes this in a mathematical way. The whole idea is, for example, if I go back here to the classification model, that having this curve which wobbles around all your data points just to make it all perfect and reduce the training error while you get more complexity so the more complex model, and often this is just not the correct one, so you should avoid that in most of the cases. So regularization is what allows machine learning models to really– well for lack of idea and to optimize for both goals at the same time. And that’s fine. We know that since decades probably now and most of the machine learning models have some flavor of regularization built in, but most of those meta-modeling heuristics like feature selection or parameter optimization, they don’t. And even in machine learning, it often requires from the user to actually define the importance of those conflicting goals ahead of time, and you can’t really do it. So what I did on the PC, and that was the whole topic, was making this trade-off, making this conflict explicit, and trying to optimize for both goals or multiple goals at the same time and getting all the possible solutions for all reasonable trade-offs the same time, than one optimization run of a model or a feature selection would take.

So this is for like 10, 12 different topics including machine learning thing which didn’t work but– I have to be honest here, so I don’t spend a lot of time on that, but if you’re interested in learning more about that, feel free to download my PhD and read a little bit about this. But the short summary is doing this for machine learning in most of the times, it’s not super helpful, especially not for large margin methods, not really super helpful for decision trees, to be really honest, the whole neural network, deep learning craze wasn’t that big when I was doing my research. That might be a topic which is interesting again, so if you want to actually use what you learn today and apply to machine learning directly, looking for doing multi-objective optimization for optimizing network architectures might be something that is worth looking into. But that’s not the topic for today. For today, I really thought because of time, I focused on one major problem which still actually is a major problem even in the time of deep learning and gradient boosted trees which are very thoughtful and do not need the same level of feature engineering than many other methods, and that is feature selection.

So what we’re going to do now is we discuss feature selection, we discuss the current state of the art, I’m going to show you how you can use multiplex optimization to get to a much better result and much more insights in the same amount of optimization time, so that’s cool. But it’s almost as a side effect. We also can then solve a problem which typically people consider to be not solvable or unsolvable, whatever the rights word. So point really is, for unsupervised learning like clustering, there is– well, there are people who try to do feature selection, and it doesn’t really work very well. And I would like to take a little bit on the journey there to make the same experiences I did but actually was proven out why it’s just not working in general. But then after we went on this journey, at the end, I will also show you how you can still solve it. And that is pretty awesome because frankly, most of the times, whenever you do unsupervised learning, you can’t do it at all, and you will get all kinds of clusters, unfortunately, they have zero meaning, just because they kind of model the noise in your data. And it’s really hard to figure that one out.

So that is the topic for today. And let’s go into this, let’s discuss with the state of the art, and then we take from there and approve the state of the art, and you can then send for your learning spaces also to other areas of machine learning. All right, so it’s super simple. Let’s start with the feature set of 10 features or attributes, as we call them in RapidMiner, or columns if you think more like tables in Excel. So let’s say there’s about 10 features, and because of course, you use all of them, but as I said before, even for more modern machine learning methods, that is often not a good idea here because if you hand over too much noise to the machine learning methods, then it starts modeling the noise as the risk of overfitting rises, and yeah, it just takes longer, it’s harder to understand the model if it can use more features. So let’s finds a subset of those 10 features in the example which works best in the terms of what is delivering the best accuracy. So can we figure that one out? Well, how can we do that? Let’s, for example, start with only using one feature. To make things simple, we start with the first one, okay? If you start the first one, I can train a model using only this feature and I can also validate this model let’s say with a 10-fold cross-validation, and I can actually see how well this model performs. And since we started talking about classification, let’s just pick any measure, like performance criteria, and in this case, for example, accuracy to measure this. So if it only improves one feature, and we feed this data set using only this one feature into a cross-validation using any machine learning methods, you could, for example, measure that you get a 68% accuracy. All right, that’s good. Let’s move on.

Now we can, for example, try a different feature, let’s say number two. We can validate this model again only using this feature, and we see, for example, that the accuracy is only 64%, which means it’s stacked with a little bit less goods than the one only using one feature. And you can go on like that. So only using the third feature, for example, delivering, for example, a model with 59% accuracy, so boohoo, that’s not really good, so let’s forget about this one. The first one so far was definitely the best. And I can go on and on until I’ve tried all the 10 different options, and after I did this, well, of course, that’s only the options with only using one feature, well, what about using two features? Sure. Let’s start with the first two, let’s measure that one, oh, look at that, 70% accuracy a little bit better, so far our best candidate let’s say, so that’s good. Let’s just? try the other combinations of two features, and so on, and with three and four features until we finally tried everything using all 10 features. And as you can see, the accuracy for that model would only be 62%. And that tells you right there, well, it would be a good idea to only use a subset, for example, the second from the bottom using two features, which is a much higher accuracy and it uses less features, again, model training is faster and the models are simpler, it’s easier to understand them.

Okay, so far so good. But as you can probably guess already, that might take some time because I need to go through all those combinations of, well, different feature set sizes. And based on the 10 features I have, how many combinations are there? Well, a data scientist should know this answer right away. If you have 10 features, and you can turn them on or off so you have 2 options for every feature, that gives you 2 to the power of 10 combinations, which sums up to 1,024 different combinations for 10 attributes. And frankly, that’s not exactly true because the one combination where you don’t use any features, or basically all zeros, that’s not really a good idea because you can’t build any model there. So actually, it’s 1,023 combinations, but yeah, whatever. So point is, 2 to the power of 10 combinations, and that’s a lot. And think about, well, for every combination, we need to do a 10-fold cross-validation. That means we need to learn 10– or train 10 different models, so we end up with more than 10,000 models in this case, and that’s for only 10 features. So what about 100 features? And of course, the easy answer is, well, 100 features, that delivers about 2 to the power of 100 combinations. But that’s a number which is not easy to think about if you write it out. For 100 features, the total number is 1,267,650,606,228,229,401,496,703,205,376 and of course minus one again. Well, that doesn’t really help here because this number is just too large. And then times 10 again, for 10-fold cross-validation, there is no way that for a reasonably sized attribute set like with 100 attributes or features, you could ever get to the result until the end of time.

Okay, so apparently, going through all the combinations is not an option for any realistically sized data set. So we can’t do that. That’s why smart people came up with a smart idea. So many people think like, “Well, maybe we can use some heuristics then to figure out shortcuts. So we don’t need to go through all the combinations because this kind of brute force approach doesn’t work, well, maybe you can take shortcuts to get through results faster.” Well, good idea. So far, two of the most common shortcuts people are using are called forward selection and backward elimination. They are both greedy heuristics basically always trying to get from one good position into the promising direction of other potentially good solutions. And well, except they are greedy, so they don’t ever go down in this, what we see here on the screen, on this fitness landscape. So whenever they feel like “Well, here’s some hill, I can climb,” they are going to do this. So the problem with this approach though is, typically, we have something we called a multi-modal fitness landscape. So there is not just one extreme moment like one maximum and the rest is kind of flat and everything leads to this one hill so that actually those greed hill-climbing approaches like forward selection and backward elimination have to end up there, that’s just not how reality looks like. It could be a uni-modal optimization problem but feature selection is not like that. In reality, you have really this one global maximum you would like to reach and then multiple local extrema and local maximal. And if you made it start somewhere, let’s say at the position of this red dot here, and then you’ve just climbed on the nearest hill, you will end up on one of those local maxima, but you will never reach the global maximum because this greedy hill-climbing approach which– well it’s on the top somewhere and there’s no way out, it would need to climb down again, and they’re just not doing this. So frankly, and I will show you some experiments on that on a data set so you get a bit of a feeling how problematic that actually really can be in practice. So brute force does not work because the number of combinations is way too large. But unfortunately, greedy optimization schemes like forward selection and backward eliminations, they don’t work either.

And I’m actually getting a little bit tired that this is still the state of the art, which is taught by many, many colleges to the students, that this is really like one of the worst standard methods for feature selection. To be honest, you should stop using them because there are better options out there, and I will show you one in a couple of minutes. So forward selection, backward elimination, yes, sure they’re certainly better than doing nothing, but they won’t deliver anything near the optimal results in practically 99% of the cases. So really, you shouldn’t really do this. So well, then what can you do? I would like to suggest one solution. I mean, there’s been a lot of research in this area, but this one solution here has a couple of advances. I would suggest to use evolutionary algorithms in general for feature selection and you will see that you can do this really, really simply. It’s not really that hard to do in reality, but it comes with a couple of advantages, and one of the biggest advantages we’re going to discuss today is you can turn this into the multi-objective optimization problem quite easily, which is not really possible with the other solutions.

So since many of you might not know how they work, I will just give you like a super brief introduction on how evolutionary algorithms work, and then I will show you on a data set how well the results could look like if you don’t do any feature selection, then what you can achieve with forward selection and backward elimination, and why there is a problem, and then I’ll show you the results you can achieve with evolutionary algorithms, and then you move to multi-objective from there. So how do they work? So in general, they try to mimic as many things in machine learning claim to mimic some natural processes, evolutionary algorithms doing the same, like all these whole survival of the fittest or like having a population of individuals which, who knows. And then they create offspring and then there’s maybe some mutations, so there’s a lot of those ideas you would probably remember from your biology classes. But again, it’s a metaphor, but it’s not really the same thing, I guess. I mean, really, I’m not an expert for that part, but it’s the same idea. So you start with some population, you create a population of random individuals, and then the individual is really this kind of this binary bit vector we have seen before and then one basically means, well, I use this feature, and the zero means I don’t use it, and then if let’s say you start with a population of 10 individuals, then you start creating new ones. Let’s not go into the biology aspect here, but yeah, you do technically crossover. And some part of the DNA of this bit vector is taken from one parent and other parts are taken from the other parent and the offspring share DNA from both parents, I guess.

Okay, so good. So the easiest thing you could do here is basically select randomly one cutting point, like after the fourth feature here in this example, and then you create the two offspring by taking both yellow part and create a new one, a new child, and even if you want, you can render the second child while you’re on this by also combining the two green parts, so that’s one way. There’s many ways, so you can do crossover and– well, let’s not go into the details here. So now you should generate a couple of children, that’s good, but in biology, what often happens is that those children also have some mutations. So in here, what it really means is it’s just randomly flipping. It’s like single spots in this DNA, in this bit vector, okay? Now, let me explain to you why both steps are important. Crossover helps you if you have to feature bit vectors in different spaces of this universe. If I go back to this one, just like different spots in the fitness landscape, crossover helps you to basically, well, jump to completely different areas, which hopefully are promising again, in this fitness landscape. So that helps you to get out of those local minima– or sorry, local maxima, local extrema, in general, and jump to other places in the fitness landscape. When you make changes, and they really, if you already are in some good area, help you to slowly climb further towards the hill hoping if you’re close to the global maximum, that some of the mutations then help you move in further towards the top of the hill. So I mean, it’s a bit of a rough explanation, but you get the idea hopefully behind both points. So after you did this, now you have the parents, you have a couple of children, some of them mutated, and often you use some parameters, for example, the probability for mutation is one divided by the number of features you have so that, on average, you can expect having one mutation for every individual here. Okay?

So then after you have this new population, you can evaluate them. And you can do this exactly the same way we did this for the brute force discussion we just had by just taking the data sets, taking only the features, which have a one fitness data set then into a cross-validation, and measure, for example, the accuracy. And you do this for all the individuals in the population which brings me to the next step, the selection. So this here really is about the whole survival of the fittest aspects. And one of the most frequently used selection schemes you could use is called tournament selection. And later on, you will see that this is in fact, the only thing that seems to turn this into a multi-objective optimization approach. So tournament selection, the idea here is you pick specified number of individuals, let’s say five, and then you just throw them all into one pit and just let them fight against each other and the winner takes it all. So let’s, for example, say your original population is 40 individuals, then you just pick five random individuals over and over again, and for every sample you take, you take the winner and put them into the next generation’s populations. And the others, they can still stay in the old one and they can be picked again. So of course, if you make a big tournament with a lot of samples, then of course, you increase the selection pressure, so only the fittest ones will survive. But if you make a smaller tournament, then by chance, you might also just pick a couple of weaker individuals and then even the weaker individuals have a chance for survival. And that sometimes is a good idea because, well, sometimes you start off with a little bit of weak, but you actually run when you first have to crossover some of those weaker individuals who might actually jump into some very promising areas of the fitness landscape as well. So don’t set the tournament size equal to the population size because that would really mean only the fittest individual that would survive, and that is typically not a good idea. So you want to know actually, well, some form of evolution happening over overtime. Okay, anyway, so that’s how this whole thing works. And after you did this for a specified number of generations, for example, you will just deliver the best solution you found and that is then the feature sets which is the winner, okay? Awesome.

So then let’s actually do all of this, let’s do no feature selection, let’s do some forward selection, backward elimination, some evolutionary feature selection on a data path, which I think is a pretty good one for feature selection. And some of you might know it already, it’s the so-called Sonar data set. And the idea of this data set is that you have a frequency spectrum for each data point. So if a value for 60 different frequency bands, and then for each spectrum, you have a label which tells– a target which tells you if this spectrum of an object– or persona, that’s why the data set is called Sonar, if this spectrum represents the spectrum of a rock or if it represents the spectrum of a mine, and of course, that’s an important question, let’s think about military boats or something who want to find those mines for whatever reason, so that is, of course, important for them. I mean, I’m a sailor, so I wouldn’t care as long as I see something which resembles the rock or a mine, I wouldn’t just go there, but I guess it’s an important application or thought. So let’s actually go into RapidMiner, and I would like to show you how this whole thing works in practice. So many of you might know RapidMiner already, so I’m not going to explain a lot here. But for those of you who see for the very first time, the whole idea of RapidMiner is to build analytical processes or workflows in this big white area here in the center, and then basically using data sets coming from those repositories here on the top left, and using operators here from the bottom left to basically transform the data or build models or validate the model and do everything you need to do for machine learning and data science, in general, or just data prep, and even ETL for that matter.

So well, that’s kind of the basic idea. And you will probably get it by just looking at this. I mean, you can add your own custom-built code like with R and Python and everything else, but typically, the major working mode in RapidMiner is actually not to code, it’s really to build those visual workflows which is a much faster way of getting to some good results, it also was easier to share with others, to collaborate with others, and yeah, to also maintain those models and referrals in the future. So I prepared one of those workflows for you here. All these workflows do one thing is– it’s a Sonar data set. I added this little red breakpoint here which means I stopped after this data set is loaded so we can actually inspect the data set quickly before we move on, then this data set is delivered to this cross-validation, you can see here, there’s a lot of data on the right side, and they’re always– in all those experiments we do today, I will do a 5-fold cross-validation, and typically you would do a 10-fold, where, of course, it just takes double the amount of time. And since I don’t want us to waste our time or waiting for cross-validation results, I just go with a 5-fold cross-validation today. Another important thing for those validation here is that I always make the data split in exactly the same way because I want to show you the effects of the feature selection and the model building; I don’t want you to show the effects different random splits of the data can have. So again, you can change all of this, if you like, but– I want to prove some points here, so that’s why I go with this fixed data splits and using this– what we call local random seeds, okay, just to make it, well, repeatable, but also to make it fair so that all the different methods use exactly the same data splits.

Okay, so and then inside of this cross-validation, one of the specialties of RapidMiner is that you always can dive into those operators. So I made a double click here on this one. So whenever you see this icon here, that means you can dive into this one here. And then you can see basically what happens in the training phase, and the application, the testing phase, so here, I just bring a nice base model, it doesn’t really matter because as I said, it works for all model types, which is such a great, good baseline model, so let’s get started with this one here. And I train this model here and I apply the model here on the test data which I get delivered from the parent operator and calculate the performance which really is– yeah, just making a couple of smart choices here. In this case, one of the things we calculate is the accuracy. This model approach has a lot of advantages. So we could, for example, also put some preprocessing methods inside of the cross-validations, that’s a very important thing to do, and one of the major mistakes I see most data scientists doing over and over again, mostly because it’s so hard to do with other solutions, or even if you program it yourself, it’s a lot of work. If you’re interested in how you correctly validate those models, I’ve wrote actually a couple of blog posts on this and this is another webinar, I guess, on the topic. Did I? well, definitely I did it– yeah. Well, one thing you could check out is How to Ruin your Business with Data Science webinar I did last year, it’s also a fun one, so if you’re interested in correct validation, check it out. It’s really important to do this right. Anyway, so that’s the whole idea of RapidMiner. So let’s get started. I can execute processes here and attach. I have a breakpoint, so let’s have a look at the data first. It’s only 60 columns here, and every column, every attribute, every feature represents certain frequency bands of our frequency spectrum. And then we have the two classes rock up here, and mine down there. And to being obviously, although there’s only two on a row but I mean, these are all numbers. I mean, there’s no way that I could see any patterns here. And even if I start visualizing this, if for example, let’s say a standard plot, like, “Well, yeah, how can I distinguish rocks against mines? It’s overlapping. So there’s not a lot of pattern I can see here.

If I use different visualizations, for example, this parallel plot of the same data set, now each line represents a rock or a mine, the mine are red. And I see basically my 60 columns here at the bottom, so now, still, I don’t really see any patterns, it’s really hard to see, maybe here on the left, road attribute 11, the reds looks like it’s a little bit higher than blue, and the same as here in the mid-40s, same story there, red looks a little bit higher than blue, but it’s kind of chaotic. And different visualizations give me the idea that probably my first attempt was right around attribute 11; 12, red was a little higher, mid-40s, red is higher, low-20s, red is higher, mid-30s, blue is higher. So I see certain areas, certain regions, which probably are more important than the other areas in the frequency spectrum which doesn’t really– they don’t really help me to distinguish between those two groups. All right, so we know certain frequency bands are more important than others. So let’s actually first just build the models on all of those frequency bands to get a benchmark, basically. And if I do this, for the 5-fold cross-validation now, I get 67.8%, so almost 68% accuracy. So let’s remember this number because that’s going to be our benchmark. So no feature selection at all, 68%. Good. All right, so we know that one. So now let’s move on to forward selection band, if I find my mouse at least. So similar setup, I retrieve the data sets, I use the forward selection operator which is just one of the operators you will find here. If I type in forward selection, you will find it here, okay? And then inside of the forward selection, you see this little icon again. Again, I need to calculate somehow the performance. And I do this exactly the same way, again, with our 5-fold cross-validation, same local random seed as I mentioned before, and it just looks the same. So I just reuse this building block I just had before and just feed it into this forward selection and let’s just run this one here.

So this one now takes a little bit– it’s not as bad as doing a brute force approach, as the calculators, that would be 2 to the power of 60 different combinations; there’s no way that I could do brute force here. If I do forward selection, it’s definitely delivering better results. So instead of the 68%, I get 77.4%, so that’s much better. And most of the features actually get a zero here, so they have been deselected and only four features have been selected. And in fact, I can see this in the data here as well, in fact, all the features which have been selected are in this attributes 11, 12 region. But we didn’t select any features from the low-20s, mid-30s, or mid-40s; it has been those other areas in the frequency spectrum which seem to be important. Here, we really feel forward selection kind of, it just focus on this one, didn’t get any better by adding a fifth feature and just stop there. And yeah, granted, it’s already much better, almost 10% more accuracy, great, much smaller feature set, so that’s good, but is it good enough? Well, let’s see. So let’s then move on to backward elimination. Now we turn it around, and basically start with all the features, so same set of them before is inside the cross-validation, all the same, we start with all the features and omit features until we no longer get any better. So I run this one again, and it takes a couple of seconds but then I see like, “Oh, well, it’s again a little bit better than using all the features. You remember the benchmark of 68% but not as good as doing the forward selection, so I’m somewhere in the middle.” And frankly, look at that, only, what, eight features have been removed. All the rest is still in there. So the data set is still huge. So it’s not really good, I didn’t really get rid of a lot of different features, and yeah, but I certainly covered all the four spectrum but also covered all the noise everywhere else, which is exactly the reason why it’s just not good.

All right, so the next one now is the evolutionary selection. So again, you just would basically look for the operator here, you will– oh, sorry, that’s the parameter optimize selection, evolutionary. And it’s the same setup as in the cross-validation. The only thing I changed, I also added some logging so we can see how we basically get better over time, but we’ll do the optimization, but everything else here is the same. So let’s run it and let’s go quickly to the results because then you can see that we basically add all those results here. So– oh, look at that, they’re already getting pretty good, 83% almost. So now we are done. And we, in fact, end up with like an 80% accuracy. This is the data we’ve been looking, unlike the accuracy– as does the accuracy for each generation, I can also plot it like this. Let’s just switch the color here. So you can see when we started the optimization run, we actually have been hardly any better than doing no feature selection at all, you remember the 68% benchmark, but then quickly, we had been becoming much, much, much better, and then kind of asymptotically we get closer to the optimum here towards the right. So after like 30 generations, we end up with the best feature set. But now let’s have a look at the features actually. So this one here, selected roughly, what, 10, 11 features here, and look at that, 11, 12, low-20, mid-30s, and also the mid-40s. So in fact, it covers all the relevant areas of our data set. And if I look now into the chart, you see already how everything actually is distinguished in a much better way, and that’s why this model is so much better. Okay, that’s great. So evolutionary algorithms are awesome for getting much more accurate models. But that’s still not the end of this story because we didn’t really focus also on reducing complexity. And that is in fact, which brings the true conflict because so far, we only have been optimizing for the accuracy. That means, often, we tend to take more features because then the model gets more information, but we don’t really optimize for using only fewer features, which reduce model complexity.

And those two goals, again, are conflicting, at least they start to be conflicting at some point. And this, as I said in the beginning, is one of the major problems in machine learning. Here, this is a formula which everybody of you should know. This is kind of one of the base formulas in the statistical learning. And that is the formula, or one of them, kind of basis for how to calculate what we call the regularized risk, and regularization is something I mentioned at the beginning, so the whole idea is you basically take the empirical risk, which is just a fancy term for a training ROI most of the cases, and you take the structural risk, structural risk, for example, for an SVM would be the wider the margin is to your data points, the smaller is the structural risk. Here, in our case for feature selection, structural risk, again, is just a fancy term, and it just means the number of features. If you have more features, you have a higher structural risk, and then you turn this into a regularized risk by introducing some trade-off factor C. But how on earth– you don’t even know how good your model is going to perform, nor do you know if now 5 out of those 60 features or 50 features are the better range. So how could you ever define– before you know anything about the data or your models, how could you ever define a trade-off factor C? In fact, you can’t. That’s why, first of, if you do this with an SVM, you need to run a parameter optimization for C. And by the way, this was the one part of my PhD which also worked, it’s just not worth the effort because you can avoid this by actually doing this as an explicit multi-objective optimization as well. But for feature selection, since it’s a multi-modal fitness landscape, you can actually do this. Instead of defining a trade-off, you can optimize for both things at the same time and how.

And that is actually something an Italian economist was very well known for, Senor Pareto. Most people will probably know him for the very well-known 80/20 rule, okay? But he came up with some other concepts, and that’s called a Pareto front, and here we have such a Pareto front. So you see, for example, if the number of influence factors on the x-axis here, and the accuracy of your models on the y-axis, now you can take every feature sets you’re evaluating, and this is, for example, part of the population as part of the evolutionary algorithm, and you can plot it on this two-dimensional space for your two conflicting criteria. So this point here on the left means– for example, I only use one feature and I get a 60% accuracy. The next one here means I use two features and get roughly 75%, then three features with 80, and so on. So all those data points are kind of equally good because you can’t really say, well, the one with three features and 80% is better than the one with two features and 75%. Yeah, it’s more accurate, but it also uses more features, so those are two conflicting goals. There is no really any way to say one is better than the other. But what we do know is that we want to move this whole front towards the bottom– or sorry, the top left corner. So we want to use as little features as possible and get as high accuracy as possible, okay? So the whole goal is while we are optimizing, we want to move the whole Pareto front towards the top left in this chart. If you use different criteria, of course, you might move somewhere else but maximize and optimizing both criteria at the same time, so that’s a bit clear. So how can you do this? And then this is the one change you need to do in the evolutionary algorithms. You change the selection scheme into something we call non-dominated sorting.

And the basic idea is really simple. If you, for example, look at another data point like this blue one here, is this a good data point? I mean, really think about this. It’s three features with like 53% accuracy. Well, no, it’s not really good because if you use three features– actually, I have another feature set, which already has 80% accuracy, so definitely not good there, and I can’t even reach 60% using even fewer features like the one with only one feature on the left, so that is really a bad point, so let’s get rid of it. And the one here on the right, it’s not that clear but the same story. Yeah, it is an 85% accuracy, but you can reach 85% already with one featureless, so why not go over the one with these features and 85% accuracy, saves over a year. So yeah, it’s only one feature. That’s nice. But actually, I have a better attribute set with only one feature, so again, those three points do some good. And that is actually true for all of the blue points you see on the screen right now. There’s actually a term for that. We say that all the orange points are dominating the blue points, okay, or the blue points are dominated. And this whole idea of non-domination or non-dominated sorting is– well, maybe we should first focus on the points which dominate all the others and then we put them into the next considerations population and then we find the next rank of dominating data points and put them into the next generation and so on until our desired population size is reached. So those five orange points are the first rank, so we put them into the next generation, and then define the next set of dominating data points, those five, for example, let’s say our target population size is 12, so we add those five points as well, now we have only two spots left that goes to them from the next rank. And that is the whole idea of this new selection scheme, okay?

So that’s the only change we do in our feature selection. And the result will be this, and it will naturally move towards the top left. All right, let’s do it then, let’s go back into RapidMiner and let’s have a look how this whole thing looks like. So in order to turn this single objective optimization into a multi-objective optimization, all you need to do is three minor changes, the first change is in the selection operator itself. Somewhere down here, you have this selection scheme which typically is tournament selection, that was the one I explained first. You need to change it to non-dominated thought, so that’s the first change, then I go inside here and I go inside of the cross-validation– and sometimes you don’t need to make this change but you should make sure that your performance criterion is only delivering exactly one performance criteria because otherwise, the multi-objective optimization might get confused if you, for example, set two here, let’s say accuracy and error, the relative error, or should I now do the multi-objective optimization for accuracy and the error here, or should I do it for the number of features. So make sure you only pick one here, and then you add the second one, which is the third little change, which is just another extra operator here which is called performance attribute counts. So you add this one here, so that means you will end up– if I make a breakpoint here again, and then just run it until there, you will end up with a performance vector which has the accuracy first, and then the number of attributes, in this case, six attributes, okay? So that’s the change you have to make. And then I just make a little bit more, so I also show this population here so we can just see how this Pareto front is moving over time. So here we have it, those you see via the optimization is running, we move this kick to the top right, and that’s just because we optimize the negative number of attributes, so if you want to minimize the attributes, you could also maximize the negative number of attributes; it’s the same thing, okay?

So you want to move to the top right, and we did after like hundreds of generations. This is the Pareto front, so I see if I only have one attribute, I can get like 73% accuracy, and this goes up to 82, 83 percent if you use nine attributes. But there’s no point in using more than nine attributes because we won’t get any better. And I also can solve this. For example, have a look into the data, and this is now– since the runtime is pretty much the same as doing the single objective one, this is the big advantage why I’m saying you’re not just getting better models, you get also more insights. Because if you only go with one feature, go with attribute 12 then, but if you can allow yourself two features, you see that attributes 12 is no longer the best one, it’s actually changing over to attribute 11 and 36, and that is exactly the reason why forward selection will fail in many cases because this is the nature of this multi-modal fitness landscape that, for more bigger feature sets, it’s not just like a superset always from the previous one, okay? And also, you learn something about this; you now have an understanding like, “Okay, well, if it shifts between– if it’s smaller feature set size and bigger feature set sizes, then you see the different attributes then contribute to the bigger size and we have actually a second shift somewhere down here, but in the interest of time, I’m not going there. And then at the end, you can pick whatever it is that– if you want to go for accuracy, you surely can do that. But again, you get a diverse optimization run, you get a lot of different results, you get very accurate models, you’ll learn more about that, and you’ll also see, how is it really worth adding three or four more features and make your modeling let’s say double as slow by just doubling the number of features? Is it worth doing it if you only gain 0.5% accuracy? I know the answer will be no. So that’s why this is such an important thing.

All right. So in the remaining couple of minutes, I would like to show you and take you on this little journey for unsupervised feature selection because I personally think that is a very, very exciting side effect. And instead of discussing all the theory now, I will really just show it to you right away and show you what the problem actually is. So I generated the data set, let’s have a look at these data quickly, which is a random data– well, not exactly random, to be honest, somewhat random, and it’s only two columns, attribute one and attribute two here, and if I plot those two columns, you clearly see, I hope, that we have four different clusters. So any clustering scheme, let’s say we use k-means and, say, I would like to find four clusters, should come up with those four clusters. That’s exactly what this proxy is doing. Of course, since it’s a distance-based method, we normalize the data first and then I say like, “Yeah, I will like to find four clusters, let’s run this whole thing.” and I see, yes, here’s the results. If I visualize the clusters, okay, the colors change a little bit, but that’s not the point, the point is, yes, we found four clusters. Brilliant. Now, let’s add some noise.

I just quickly show you the data again. If I add some noise, I still have the same two columns than before, but I added five total random columns. Again, if I plot this in the original two columns, I have my four clusters. But if I change one of them to random, things look a little bit ugly already. If I go with two random columns here, then there is simply no pattern. So of course, our hope would be if you cluster this data, maybe the clustering scheme itself will figure out what are the important attributes. But in fact, it doesn’t. So even if I look at the data after I cluster this, so I now show the cluster again, in the two original attributes, you can see, yeah, red and blue all overlapping, green and yellow all overlapping, and things get even worse if I go into the random here, like, “Yeah, it found some ‘clusters’, but they’re just not good.” so without feature selection, there’s no way that we can find them.

Well, let’s just add feature selection then, what’s the worse that’s going to happen? Well, okay, let’s just jump to the evolutionary feature selection. And let’s just do a single objective fun first, and let’s actually try to find the best feature sets. So, I would ask you now, what are you going to expect? And most people would say like, “Well, I want to see my two features, attribute one and two, and all the noise features are gone.” But that’s actually not what is going to happen. So if we see the results in a second now, what we’re going to see is while I optimize for something that’s called the Davis-Bouldin index, which measures the average distance of the data points in the clusters, but in fact, I only end up with one column, actually two, and all the rest gets deselected. Well, why? I mean, why is it omitting the other one? Well, frankly, that’s actually kind of hard to visualize, but if– I mean, because all the data is condensed down to only one dimension, that means if you only go with one dimension, they are even more dense than if you go with two dimensions, it’s like you have a sponge and then you just press it, the density of the sponge will increase. So it’s good for the classroom to go with as little features as possible.

And that actually is a problem. Well, okay, so that means whenever you do feature selection for clustering, it will always use the smallest number of features which is one, and it will always pick the one feature which provides the highest level of density. Well, okay, so single objective optimization apparently doesn’t work. Let’s go with multi-objective optimization. So same set of data, non-dominant sorting, let’s run this one here. Okay, NOW I have a Pareto front here, let’s have a look. Oh, wait, what happens here? Why I only have one data point. This is one feature problem. I can’t say this, but, shoot, why it happened that now my final result is my whole Pareto front collapses down to one point. And again, it only has one feature. And the reason is because those two goals are not conflicting, minimize the number of features and maximize the density of your clusters are not conflicting. So that’s why it’s not going to work. So what you have to do is one small trick. And that’s the only thing I changed here. I no longer minimize the number of features, I maximize the number of features. I let this whole thing run again.

And while it is running now, we see indeed a Pareto front, we see that actually, we get results between– or using only one feature here, now I’m actually maximizing the features, that’s why it’s not negative any longer. I’ve only one or two features here on the rights up to using all the features here on the left, or almost all, like six out of the seven, so I get the full Pareto front. And if I use this now, here, you see– well, sure, if only use one, then I actually should go with attribute two, but then I have my desired solution here which to what you wouldn’t find otherwise and then observe this huge jump in the Davis-Boudlin index between those two solutions here and then the next one which is at random. So that’s a clear indicator something is going on. So you should have a look. Somewhere in the results, like around this huge jump, you see in here in the Pareto front, those who are– you found right, but then you have this huge jump when you start adding the noise features. So you would now observe actually the different features, select all the different feature sets, and then pick the right result based on that. And that is what the true power of multi-objective optimization is; you will actually be even able to solve unsupervised feature selection as well. But keep in mind, in this case, you’re not minimizing but maximizing the number of features otherwise it goes on conflicting.

So that concludes the whole topic about how important multi-objective optimization is, how much you can learn from it, and how much you actually can improve your models with this. And also, I hope, as a side effect, you really got the idea of how great it actually is to do all this real data stuff but in the fast, simple way. I mean, of course, I love coding, I sometimes do code actually quite often still, and I’ll find that’s part of many data scientists work. But it doesn’t have to be. You can also go really with the visual approach, have some fun while doing this, and get some good results, which is why many, many, many people are loving us. And I would like to get to some questions here and not spending a lot of time here. But if you’re not RapidMiner users already, please check it out. It’s really a very powerful platform for actually solving your real data science tasks. So a couple of key takeaways, multi-objective optimization, in my opinion, really should become a standard tool here to build, especially for feature selection both for supervised or even unsupervised feature selection. And you will not have get better models, but you also will understand more about the relationship of your features in your data, which I think is maybe even more important. Then, of course, please use RapidMiner to do all of this.

Before we get to Chris’s last comments, I wrote actually a couple of blog posts on this topic. They’re all online, you will find them on our blog, so check it out on RapidMiner.com/blog. There’s a series of four different blog posts, and you can actually also download the data and the processes and run those processes yourself in RapidMiner if you like. And that’s it. But I think we have a couple of minutes for questions.

Yes, so thanks, Ingo. That was a great presentation. As a reminder to those on the line – I have been getting a couple of questions – we will be sending the recording of today’s presentation within the next few business days to everyone that went ahead and registered. So like Ingo said, now it’s time to go ahead and answer your questions. I’ve seen a lot of questions come through. If you guys have questions, please go ahead and input those in the questions channel on the right-hand side of your screen. I’ll go ahead and take a look through those now. First question here is what are some good parameter values for the evolutionary algorithm?

Yeah, oh, that’s excellent. I would say, well, it definitely matters that much to be honest, but a good best practice is, for the population size, I often go with something like 10% of your feature set size up to like 30%. So for example, for the Sonar data set, we had like 60 features, I often go with something between 10 and 20 individuals. So you basically can spend a lot of different areas of your fitness landscape. For the number of generations, it’s really more or less around– as long as you have time. So I typically stopped at a small number, sometimes even only 5. or 10, to see how far I can get with only one or more generations. If they’d run quickly and have some time to spend, well, let’s double it and let’s run it longer. They will have some time left, oh, well, we’ll double it again. So 10 to 30 percent of the feature set size for the size of the population, and then start with 10 generations of something in that range, and then depending, of course, how long your modeling takes, increasing from there until you really figure out that you’re getting closer to this optimization, like every year, big, it’s probably not worth running it much longer. So that’s kind of, I would say, the rule of thumb there.

Okay. Next question here is this person is asking, once the evolutionary algorithm feature selection is finished, how do you operationalize it?

Oh, good. Well, you see that you, for example, get this attribute weights objects as a result, and what you can do then if you have– let’s assume you have some data prep processes here in RapidMiner, so you basically run with the application set, so you want to actually make the model for, you run this to the same kind of data preprocessing you do that’s for the training set, I mean, and there’s things like this which are called preprocessing models in RapidMiner to make sure that you’re basically really do exactly the same things on your application set as well. But then, the next step is to basically apply those selection or those weights here, and there’s an operator for that, which is called select by weights. So you basically take those attribute weights here, and the data set, which is the result of preprocessing, and then it will select the same feature subset. And then afterwards, you take the model- – which I didn’t deliver, but you can just deliver the model or retrain even the model on the same feature set and apply this then. And there’s a lot of different helper methods in RapidMiner than what you can do with this model as well. So I keep at that, but there’s a lot of operators and functions in RapidMiner to put the model then into production and doing the right things in an efficient way.

Great. Thanks for answering that. So this person is asking, is it possible to get a copy of the RapidMiner processes as well as the videos? I did mention we’re going to be sending the recordings, but I believe the processes are also included in the blog post as well, so.

Absolutely. So you can check out this URL here, you will find a couple of posts on the multi-objective optimization, four in total, and at the end of each of those blog posts, you will find the links to a zip file which comes with all the processes, so you can import them into RapidMiner and run them there.

Great. Next question here is, does the exact implementation of the evolutionary algorithm matter?

No, not at all. I mean, I am really– I’ve been doing at least 10 different kinds of implementations and different type of crossover, and it really doesn’t matter that much. I mean, if you want to do like hardcore optimization, of course, you would like to actually improve on that front, but it’s a relatively simple optimization task, actually, compared to many other much more complex tasks, and it’s just not worth the efforts in trying to optimize those implementations are really that much. Yeah, no.

Great one. Another question here is, this person says that they normally deal with categorical data, is this approach good for dealing with that type of data as well?

Yeah. It doesn’t matter at all if it’s categorical or not. I mean, of course, it depends a bit on the modeling you’re using. So let’s say your model needs to do some preprocessing first, now the big question is because maybe the model doesn’t work with categorical data, then you do like dummy encoding and these kind of things, so now the big question is, do you do the encoding before you feed into the feature selection or are you doing this internal feature selection? I tend to do with internal of the future selection but both are possible, and it definitely is in both cases.

Great. Another question here, this person is asking, do the different models misclassify the same items, and can we get insights from analyzing any of the misclassified set?

Yeah, I guess you can, although that’s kind of not connected to the actual feature selection problem. So yeah, you can, of course, run different, but it’s kind of connected. Sorry, I take that back. You could run multiple different optimization methods, you could get different feature bit vectors, you can then train the different models and basically see how they behave. Well, one of the things– I’m not sure if I did install it, yeah, super quickly, I mean, I’m not going to spend a lot of time but just super quickly, one of the things which is, in part, of the next upcoming RapidMiner 8.1 release, which is coming out in a couple of weeks– and it might be interesting here, which I would like to show you is this new operator explained predictions where after you did all the things training and feature selection and everything else, you can actually apply the model to new data sets or even the same one if you want to, and then you get more insight in how those predictions actually are being be created by those models. So for example, here, is a data set, so here, this is the yes prediction despite the fact that this is a man, and for this data set, man have a smaller likelihood of survival, mainly supporters, that’s why it’s green by the passenger class. So you get this kind of insights and for comparing different models and see if you can see if they actually make sense. There’s something else in there which is called the model simulator, I’m not showing to you now, I need some time, but you can also play around and see how the model behaves. That’s definitely something if you’re interested in this kind of thing which is worth checking out. The beta release is open, so you can actually download the beta version at this point.

Great. Thanks. So it looks like we’re just about time here, don’t want to take up anyone else’s time. But if we weren’t able to address your question here on the line, we will make sure to follow up with you via email within the next few business days. So thanks again, Ingo, for a great presentation, and thanks again, everyone, for joining us.

Absolutely. Thanks. Bye.