Automating feature engineering with Featuretools

In this article we will see what is the Feature Engineering, the interest of automating this step and a comparison of different libraries dealing with the subject.

Introduction

In the field of predictive models, raw data must be cleaned and transformed in order to be used correctly by an algorithm. This is where Feature Engineering, an often overlooked but crucial step in any machine learning project, comes in. Feature Engineering consists in creating new characteristics from raw data, thus improving the representation of information and increasing the performance of models.

In this approach, Featuretools is a framework that allows you to automate Feature Engineering. He excels in transforming temporal and relational data into feature matrices for Machine Learning.

Feature Engineering

Feature Engineering is the process of selecting, manipulating, and transforming raw data into features that can be used by a predictive model. This work is based on exploiting existing data and applying domain knowledge to create new variables that are relevant and do not initially exist.

Feature Engineering is a fairly broad subject that includes various stages. Among them, mention may be made in particular of:

The Feature Extraction which consists in taking raw data as input and transforming it into a set of characteristics that can be used as input to a model, for example calculating the moving average on time data
The Feature Selection Who studies what are the most relevant characteristics to keep in order to train the model
The Feature Construction that creates new variables from existing ones using mathematical functions or other transformations, for example making the ratio of two variables

Feature Engineering is a very important step in a Machine Learning project. In fact, good preparation and careful selection of characteristics make it possible to improve the results of the model. It is also useful to use simple and appropriate characteristics to use models that are less complex and faster to execute, and easier to understand and maintain.

Automating Feature Engineering

Automating Feature Engineering has several advantages. Already, it saves time. In fact, cleaning and preparing data represent 60% of the work of a Data Scientist. By automating the creation of features, it is possible to reduce the time spent on this task. It also makes it possible to avoid manual errors and to reuse the code in several problems while a manual approach must be redone for each data set.

‍

Répartition du temps des activités du Data Scientist

‍

Why use Featuretools

Featuretools is an open source Python library whose objective is to perform automated Feature Engineering. It includes functions that can be stacked to create features, using a method called Deep Feature Synthesis. This library works great with the other tools in the pipeline, including panda dataframes and Scikit-Learn models.

How Featuretools works

Concepts

Featuretools is based on a few key concepts:

Entities and Entity Sets: tables and a data structure to monitor them
Relationships: How can tables be connected to each other
Primitives feature: Aggregations and transformations that are stacked to build features
Deep Feature Synthesis: Method that uses primitive features to generate thousands of new features

Deep Feature Synthesis

It is an algorithm that automatically generates features for relational data sets. Essentially, the algorithm follows relationships in the data back to the base and then sequentially applies mathematical functions along that path to create the final feature. These mathematical functions can be standard mathematical operators (Min A, Max B) or combinations of features (A * B).

Stages

Featuretools is used in 3 main steps:

Reading data: they consist of one or more tables, linked together or not.
Creation of features: It is in this stage that Featuretools comes in. There are default options but this step can be more or less customized. For example, it is possible to ignore certain tables or variables, to choose which aggregation or transformation functions to use, to specify the maximum allowed depth when the functions are stacked, and to create Seed Features which are characteristics defined manually to provide specific knowledge of the domain (for example, consider that a variable with a value greater than 100 has a high value).
Model training with the matrix of new features.

‍

Etapes de traitement des données avec featuretools

‍

Other tools to automate feature engineering

AutoFeat

AutoFeat is another open source feature engineering library specialized in classification and linear regression models. It makes it possible to automate the creation of new features, the conversion of categorical variables, the selection of features as well as the training of a classification or linear regression model with the sectioned features. The creation of new features is done in several steps, first non-linear transformations are applied to the input variables (1/x, log (x)...), then they are combined with various operators (+, -,.).

AutoFeat does not work like Featuretools, it is not intended for relational data but was created for scientific use cases. The data is stored in a single table and it is possible to specify the units of the input variables in order to avoid creating physically absurd features.

Its limitations are not to manage relational databases and to create features that are less sophisticated than Featuretools.

TSfresh

TSFresh, which stands for “Time Series Feature Extraction Based on Scalable Hypothesis Tests”, is an open-source library specific to time series. It allows the automatic extraction and selection of features. TSFresh makes it possible to extract more than 60 characteristics from a temporal data variable, which makes it possible to enter information from the most basic to the most complex: average value, maximum value, number of peaks, fast Fourier transform...

TSFresh is very useful for time series data and can be easily integrated with Featuretools, but cannot be used for other data types.

OneBM

OneBM, or One Button Machine, allows you to create features from a relational database. It joins the tables incrementally by following the relationships between the tables. Then, it automatically identifies the type of variables (categorical, numerical, time series, etc.) and then applies a set of predefined operations corresponding to the type identified. OneBM also allows you to do a Feature Selection to remove irrelevant features that were built in the previous step.

The downside is that there is no open-source implementation for OneBM.

Conclusion

Feature Engineering is an essential step in any Data Science project. Even though it is time-consuming and often underestimated, it plays a key role in improving model performance.

While the task of Feature Engineering can be complex to automate due to the need to understand the specific context of each project's data, tools like FeatureTools, AutoFeat, and TSFresh have proven to be very useful in speeding up this process and minimizing the risk of human errors.

An article will be published to show how these tools work in a concrete case. Follow Aqsone's LinkedIn page https://www.linkedin.com/company/aqsone/ so as not to miss our next articles!