Automation of feature engineering with featuretools (~7 min)

In this article we will see what the Feature Engineering, the benefit of automating this step and a comparison of different booksellers dealing with the subject.

Colombe becquart Profile Picture
Becquart dove Data Scientist

Introduction

In the field of predictive models, raw data must be cleaned and transformed in order to be correctly exploited by an algorithm. This is where Feature Engineering, an often overlooked but crucial step in any Machine Learning project, comes into play. Feature Engineering involves creating new features from raw data, thereby improving the representation of information. and increase the performance of the models.

 

In this approach, Featuretools is a framework that allows you to automate Feature Engineering. It excels in transforming temporal and relational data into feature matrices for Machine Learning.

 

 

Feature Engineering

Feature Engineering is the process of selecting, manipulating and transforming raw data into features that can be used by a predictive model. This work relies on leveraging existing data and applying domain knowledge to create new, relevant variables that do not initially exist. 

 

Feature Engineering is a fairly broad subject that includes different stages. Among them, we can notably cite:

  • THE Feature Extraction which involves taking raw data as input and transforming it into a set of features that can be used as input to a model, for example calculating the moving average on temporal data 
  • THE Feature Selection which studies which are the most relevant characteristics to keep to train the model
  • THE Feature Construction which creates new variables from existing ones using mathematical functions or other transformations, for example making the ratio of two variables

Feature Engineering is a very important step in a Machine Learning project. Indeed, good preparation and careful selection of features can improve the results of the model. It is also useful to use simple and appropriate characteristics to use models that are less complex and faster to run, simpler to understand and maintain.

 

Automate Feature Engineering

Automating Feature Engineering has several advantages. This already saves time. Indeed, data cleaning and preparation represent 60% of the work of a Data Scientist. By automating the creation of features, it is possible to reduce the time allocated to this task. This also makes it possible to avoid manual errors and to reuse the code in several problems whereas a manual approach must be redone for each dataset.

Répartition du temps des activités du Data Scientist

Why use Featuretools

Featuretools is an open source Python library whose objective is to carry out automated Feature Engineering. It includes functions that can be stacked to create features, using a method called Deep Feature Synthesis. This library works very well with other tools in the pipeline, including pandas dataframes and Scikit-Learn models. 

 

How Featuretools works

Concepts

Featuretools is based on a few key concepts:

  • Entities and Entity Sets: tables and a data structure to keep track of them
  • Relationships: how tables can be linked to each other
  • Feature primitives: Aggregations and transformations that are stacked to build features
  • Deep Feature Synthesis: Method that uses primitive features to generate thousands of new features

 

Deep Feature Synthesis

It is an algorithm that automatically generates features for relational datasets. Essentially, the algorithm follows relationships in the data back to the base, then sequentially applies mathematical functions along that path to create the final feature. These mathematical functions can be standard mathematical operators (Min A, Max B) or combinations of features (A * B).

 

Steps

Using Featuretools is done in 3 main steps: 

  1. Reading data: they consist of one or more tables, linked together or not.
  2. Creation of features: This is where Featuretools comes in. There are default options but this step can be more or less customized. For example, it is possible to ignore certain tables or variables, choose which aggregation or transformation functions to use, specify the maximum depth allowed when functions are stacked, as well as create seed features which are characteristics defined manually to provide specific knowledge of the domain (for example considering that a variable with a value greater than 100 has a high value).
  3. Model training with the matrix of new features.
  4.  
Etapes de traitement des données avec featuretools

Other tools to automate Feature Engineering

AutoFeat

AutoFeat is another open-source Feature Engineering library specialized for classification and linear regression models. It allows you to automate the creation of new features, the conversion of categorical variables, the selection of features as well as the training of a classification or linear regression model with the sectioned features. The creation of new features is done in several steps, first non-linear transformations are applied to the input variables (1/x, log(x)…), then they are combined with different operators (+, -, . ). 

AutoFeat does not work like Featuretools, it is not intended for relational data but was created for scientific use cases. The data is stored in a single table and it is possible to specify the units of the input variables in order to avoid the creation of physically absurd features.

Its limitations are that it does not manage relational databases and that it creates less sophisticated features than Featuretools.

 

TsFresh

TsFresh, which stands for “Time Series Feature extraction based on scalable hypothesis tests”, is an open-source library specific to time series. It allows automatic extraction and selection of features. TsFresh allows you to extract more than 60 characteristics from a temporal data variable, which allows you to capture information from the most basic to the most complex: average value, maximum value, number of peaks, fast Fourier transformation, etc.

TsFresh is very useful for time series data and can be easily integrated with Featuretools, but cannot be used for other data types. 

 

OneBM

OneBM, or One Button Machine, allows you to create features from a relational database. It joins tables incrementally by tracking relationships between tables. Then, it automatically identifies the type of variables (categorical, numerical, time series, etc.) and then applies a set of predefined operations corresponding to the identified type. OneBM also allows you to make a Feature Selection to remove irrelevant features that were built in the previous step. 

The downside is that there is no open-source implementation for OneBM.

 

Comparative table

AutoFeat

TsFresh

OneBM

FeatureTools

Open Source

Open source

No open source implementation

Open source

Does not support relational data

Does not support relational data 

Supports relational and non-relational data

Supports relational and non-relational data

Lack of important features

Can be used with Featuretools

Generates simple and complex features

Generates simple and complex features

Created for scientific use cases

Specific to time series

No specific domain

No specific domain

Conclusion 

Feature Engineering is an essential step in any project related to Data Science. Although it is time-consuming and often underestimated, it plays a key role in improving model performance.

Although the task of Feature Engineering can be complex to automate due to the need to understand the specific context of each project's data, tools such as FeatureTools, AutoFeat and TsFresh have proven to be very useful in speeding up this process and minimizing risk. of human errors.

 

An article will be published to show how these tools work in a concrete case. Follow the Aqsone LinkedIn page https://www.linkedin.com/company/aqsone/ so you don't miss our next articles!

A must see

Most popular articles

Do you have a transformation project? Let's talk about it !

Add your title here