Glossary term

Feature Engineering

Producing a successful or unsuccessful model isn’t always about the type of model you choose. Sometimes, it’s not even about the raw data used to create the model. Often, it’s about how you transform the data before you input it into your ML model. Which variables are the most useful in the predictive model? Which will have little to no impact on the outcome?

This process of extracting, transforming, selecting, and creating variables from raw data to ensure you get the best results from the algorithms is known as feature engineering.

In this post, we’ll break down what feature engineering is, why it’s important, and how to use it in the business world.

What Is Feature Engineering?

Feature engineering refers to the pre-processing steps that convert raw data into optimal features for machine learning models. It can produce features for both supervised and unsupervised learning to speed up and simplify data transformation while enhancing the accuracy of the model.

During feature engineering, data scientists can create new features, transform existing ones, and rank the order of variable importance. Creating new (and prioritizing existing) features helps you understand your data better and use it to create the best models possible.

Why is Feature Engineering Important?

On average, data scientists spend about 80 percent of their time on data preparation, cleaning, formatting, etc. This tedious work is essential as it sets data scientists (and the models they create) up for success.

However, raw data often contains missing values, outliers, and incorrect inputs. If data scientists and analysts use this unprocessed data, they’ll likely end up with a less efficient model. Feature engineering helps them extract useful features from raw data using math, statistics, and domain knowledge, resulting in models that are more accurate and easier to understand.

Feature engineering is an essential piece of a good data prep process. By ensuring you’re highlighting the most important features, you create more flexible and easy-to-interpret models that require less maintenance in the long term.

Feature Engineering Approaches

In machine learning, feature engineering consists of various processes that include:

Feature generation

The process of feature creation involves identifying the variables that will be most helpful for predictive modeling, creating them using the original data, and using them to inform the model. Feature generation can also mean eliminating features that serve no purpose to the model.

An example of feature generation would be adding a feature for price per square foot of an apartment, given that you had the price of the apartments on the market and their square footage.

Feature selection

Feature selection involves analyzing and ranking various features to determine which irrelevant or redundant features should be excluded, and which should be prioritized. You can also use a Benchmark Model to see if your new model outperforms or underperforms a recognized benchmark, giving you a better idea of whether you need to perform more feature selection. As feature selection reduces the number of features, it leads to faster training times and more robust models that perform better on new data points, rather than models trained on noisy features and apt to overfitting.

Say you have a dataset with a list of products, their prices, total past monthly sales, and where each product was made. If you’re trying to determine someone’s likelihood to buy a given product, you’d likely value past sales and prices over the location where it was made.

Feature extraction

In feature extraction, analysts essentially shrink the dataset to make it more concise, therefore making model creation quicker and more straightforward. By reducing the number of features, the data becomes more manageable and reduces the risk of overfitting.

Some popular feature extraction techniques are Bag-of-Words and TF-IDF, both of which are used in text analytics to convert raw text into numerical values so the computer can understand and interpret them.

Transformations

Not all features are created equal, and when the model doesn’t come out quite right, transformations step in. Transforming features helps improve accuracy, makes the model more flexible, and helps avoid computational errors.

For example, if a model isn’t performing to expectations, and you realize it’s because some features are in centimeters and others are in inches, you can transform those variables to be more uniform.

Examples of Feature Engineering

Now that we have a clear understanding of feature engineering how it works, let’s look at some examples:

Continuous data

Continuous data refers to any value within a given range. For example, it can be the temperature on a given day, the price of a product, or the age of a person. In this case, feature generation relies on domain data—new attributes are generated from existing ones, then combined.

If you have continuous data in the form of the weather forecast for the next 10 days, you can use feature generation to determine the average daily temperature.

Categorical features

As opposed to continuous data, categorical data can only come from a limited set of predetermined values, rather than a range. For instance, if you have a categorical feature like “Color of Item,” it would be limited to “blue,” “green,” or “red.”

One-hot encoding is often used for categorical features. It involves assigning every text option with a number—in this case, “blue” would be 0, “green” would be 1, and “red” would be 2. These new features can help create a more efficient, easier-to-visualize model.

Date and time

Date and time can vary by format, and it might be difficult for the model to use in its native form. Analysts can reframe the date and time to help models uncover potential relationships between date and time and the rest of the dataset.

For instance, date and time can be written with DD.MM.YYYY or MM.DD.YYYY, it can also be written out in words, only contain the day and month, etc. You might start by creating separate columns (aka features) to represent the day, month, and year. Beyond that, you might create a feature to indicate if a certain day was a Monday or a Friday.

This makes it easier for algorithms to recognize and interpret links within the data.

Level Up Your Feature Engineering Efforts

There’s no denying it—feature engineering is a complex process. However, when done right, it can make a great impact on both the speed and accuracy of your machine learning models.

Automatic feature engineering with RapidMiner helps extract, select, and generate meaningful features for you so that you can generate actionable insights faster. With the push of a button, you can run feature engineering processes of various complexity and compare which features (if any) yield the best results for your model.

Want to know how feature engineering can increase the performance of your models? Request a demo with our team of experts to get started!

Related Resources