## Feature Selection for Unsupervised Learning

In part 1 of this blog series, we established that feature selection is a computationally hard problem. We then saw that evolutionary algorithms can tackle this problem in part 2. Finally, we discussed and that multi-objective optimization delivers additional insights into your data and machine learning model.

There’s one thing we haven’t discussed yet which is multi-objective feature selection. It can also be used for unsupervised learning. This means that you can now also identify the best feature spaces in which to find your clusters. Let’s discuss the problem in more detail and see how we can now solve it.

## Unsupervised Feature Selection and the Density Problem

We will focus on clustering problems for this post. Everything below is valid for most other unsupervised learning techniques as well.

Ok, let’s discuss k-means clustering for now. The idea of the algorithm is to identify centroids for a given number of clusters. Those centroids are the average of all features of the data points of their respective cluster. We then assign all data points to those centroids which – after some points have been re-assigned – will be recalculated. This procedure stops after either a maximal number of iterations or after the clusters stay stable and no point has been re-assigned.

So how can we measure how well the algorithm segmented our data? There is a common technique for this, namely the Davis-Bouldin index (or short: DB index). It can be calculated using the following formula:

with *n* as the number of clusters, *c _{i} *as centroid of cluster

*i*, σ

*as the average distance of all points of cluster*

_{i}*i*to their centroid, and

*d(c*as the distance between the centroids of clusters

_{i},c_{j})*i*and

*j*.

As we can see, the result of a clustering is better if points within a cluster are close to each other. That of course is exactly what we wanted, but this also means that the DB index prefers clustering results where the clusters have a higher density.

But this preference for higher densities in the clusters is posing a problem for feature selection. If we do feature selection, we reduce the number of features. Less features also means higher densities in the remaining dimensions. This makes intuitive sense, because we map the data points from the higher-dimensional space into a smaller number of dimensions, bringing the points closer to each other.

Traditional feature selection cannot be used for clustering. Imagine that we have one feature as part of our data that is pure noise, but completely random. Say we flip a coin and use 0 for head and 1 for tails. We want to use k-means clustering with k=2 to find two clusters in our data and then decide to use forward selection to find the best feature set for this task. It first tries to find the best clusters using only one feature. Of course, our random feature described above will win this race: it will produce two clusters with infinite density! One cluster will have all the points at 0 and the other one all the points at 1, but keep in mind that this was a complete random feature. As a result, clustering only using this dimension would be completely meaningless for the original problem and data space.

Even in cases where we don’t have a pseudo-categorical random feature, we will always pick only one feature. Namely the one which delivers the k densest clusters. Adding more features would only reduce the density again so we will not go there.

## The Solution: Multi-objective Feature Selection

Can multi-objective optimization help with this problem? Let’s do some experiments and see how this works. The data set below has four clusters in two dimensions: *att1* and *att2*. There are also four additional random features called *random1*, *random2 etc*. The random features use a gaussian distribution of values around 0. Please note: the data is normalized so all data columns have mean 0 and standard deviation 1. This process is called standardization.

Links to each of these processes can be found at the bottom of this post if you want to follow along. Here is how the data set looks if you only look at the dimensions *att1* and *att2*: