In this article we will see what is the Feature Engineering, the interest of automating this step and a comparison of different libraries dealing with the subject.
In the field of predictive models, raw data must be cleaned and transformed in order to be used correctly by an algorithm. This is where Feature Engineering, an often overlooked but crucial step in any machine learning project, comes in. Feature Engineering consists in creating new characteristics from raw data, thus improving the representation of information and increasing the performance of models.
In this approach, Featuretools is a framework that allows you to automate Feature Engineering. He excels in transforming temporal and relational data into feature matrices for Machine Learning.
Feature Engineering is the process of selecting, manipulating, and transforming raw data into features that can be used by a predictive model. This work is based on exploiting existing data and applying domain knowledge to create new variables that are relevant and do not initially exist.
Feature Engineering is a fairly broad subject that includes various stages. Among them, mention may be made in particular of:
Feature Engineering is a very important step in a Machine Learning project. In fact, good preparation and careful selection of characteristics make it possible to improve the results of the model. It is also useful to use simple and appropriate characteristics to use models that are less complex and faster to execute, and easier to understand and maintain.
Automating Feature Engineering has several advantages. Already, it saves time. In fact, cleaning and preparing data represent 60% of the work of a Data Scientist. By automating the creation of features, it is possible to reduce the time spent on this task. It also makes it possible to avoid manual errors and to reuse the code in several problems while a manual approach must be redone for each data set.
Featuretools is an open source Python library whose objective is to perform automated Feature Engineering. It includes functions that can be stacked to create features, using a method called Deep Feature Synthesis. This library works great with the other tools in the pipeline, including panda dataframes and Scikit-Learn models.
Featuretools is based on a few key concepts:
It is an algorithm that automatically generates features for relational data sets. Essentially, the algorithm follows relationships in the data back to the base and then sequentially applies mathematical functions along that path to create the final feature. These mathematical functions can be standard mathematical operators (Min A, Max B) or combinations of features (A * B).
Featuretools is used in 3 main steps:
AutoFeat is another open source feature engineering library specialized in classification and linear regression models. It makes it possible to automate the creation of new features, the conversion of categorical variables, the selection of features as well as the training of a classification or linear regression model with the sectioned features. The creation of new features is done in several steps, first non-linear transformations are applied to the input variables (1/x, log (x)...), then they are combined with various operators (+, -,.).
AutoFeat does not work like Featuretools, it is not intended for relational data but was created for scientific use cases. The data is stored in a single table and it is possible to specify the units of the input variables in order to avoid creating physically absurd features.
Its limitations are not to manage relational databases and to create features that are less sophisticated than Featuretools.
TSFresh, which stands for “Time Series Feature Extraction Based on Scalable Hypothesis Tests”, is an open-source library specific to time series. It allows the automatic extraction and selection of features. TSFresh makes it possible to extract more than 60 characteristics from a temporal data variable, which makes it possible to enter information from the most basic to the most complex: average value, maximum value, number of peaks, fast Fourier transform...
TSFresh is very useful for time series data and can be easily integrated with Featuretools, but cannot be used for other data types.
OneBM, or One Button Machine, allows you to create features from a relational database. It joins the tables incrementally by following the relationships between the tables. Then, it automatically identifies the type of variables (categorical, numerical, time series, etc.) and then applies a set of predefined operations corresponding to the type identified. OneBM also allows you to do a Feature Selection to remove irrelevant features that were built in the previous step.
The downside is that there is no open-source implementation for OneBM.
Feature Engineering is an essential step in any Data Science project. Even though it is time-consuming and often underestimated, it plays a key role in improving model performance.
While the task of Feature Engineering can be complex to automate due to the need to understand the specific context of each project's data, tools like FeatureTools, AutoFeat, and TSFresh have proven to be very useful in speeding up this process and minimizing the risk of human errors.
An article will be published to show how these tools work in a concrete case. Follow Aqsone's LinkedIn page https://www.linkedin.com/company/aqsone/ so as not to miss our next articles!