Tech

How automated feature engineering works

PU
Dr. Patrick Urbanke on

Introducing the most efficient feature engineering solution for relational data and time series.

Motivation

The deep learning revolution has enabled automated feature engineering for images and sound data. Yet, for relational data and classical time series analysis, feature engineering is still mainly done by hand or using very simple brute force methods. Our mission is to change that.

In this blog post, we will introduce the basic idea of a novel algorithm that automates feature engineering for relational data and time series.

Objectives

Unfortunately, the term “feature engineering” is ambiguous. In the Kaggle community, “feature engineering” is often meant to describe simple numerical transformations of existing features or encoding techniques for categorical features.

However, we mean more than just that. Under our definition of feature engineering, we assume that the raw data is stored in several relational tables and our task is to bring these tables together and engineer features out these relationships.

When we talk about automated feature engineering for relational data and time series, we imagine that the general procedure should look like this:

Objective 1: The user is required to set up the schema of the relational data that he or she would like to have analyzed. In particular, there needs to be some sort of target variable(s), which the users wants to predict. For time series, the schema would typically be a self-join. Other than this general information on the data schema and the user’s intent, no input from the user should be required.

Objective 2: Features are often of the type COUNT the number of transactions in the last X days, where X is some sort of fixed value. We would like our algorithm to identify appropriate values for X without any input from the user. The user should not be required to suggest possible values for X.

Objective 3: Features can also take the form of COUNT the number of transactions for which the transaction type is ‘type_1’ OR ‘type_2’ OR ’type_3’ OR … (obviously, aggregations other than COUNT should be possible as well). We would like our algorithm to find appropriate conditions based on categorical data without any input from the user.

Objective 4: We would like our algorithm to handle combinations of conditions. So features of the form SOME_AGGREGATION( over some column ) WHERE ( condition_1 AND condition_2 ) OR ( condition_3 AND condition_4 ) OR condition_5 should be possible. Again, no input from the user should be required.

Objective 5: We would like to be able to express our features in SQL code. Even though automatically generated features will always be less intuitive than hand-crafted ones and could be quite complex, we want the user to get an understanding of what is going on.

To give you an idea, here are some features that were actually produced by our algorithm:


  CREATE TABLE FEATURE_1 AS
  SELECT COUNT( DISTINCT t2.UCC4 ) AS feature_1,
         t1.BASKETID,
         t1.TIME_STAMP
  FROM (
       SELECT *,
              ROW_NUMBER() OVER ( ORDER BY BASKETID, TIME_STAMP ASC ) AS rownum
       FROM EXPD
  ) t1
  LEFT JOIN EXPD t2
  ON t1.BASKETID = t2.BASKETID
  WHERE (
     ( t1.UCC NOT IN ( '330410', '280120', '630110', '320905', '660000', '380315', '370211', '390310', '400220', '360330', '360420', '390223', '410120', '410901', '670902', '380901', '610110', '690230', '004190' ) AND t2.UCC5 NOT IN ( '56031', '28014', '34090', '67031', '67090', '20021', '59023', '38034', '32033', '32013', '47022', '60031', '49030', '56021', '53041', '41090', '31014', '38041', '67011', '29031', '62041', '38011', '68022', '23090', '64012', '55011', '99990', '62061', '68011', '60090', '00410', '00419' ) AND t2.UCC NOT IN ( '190212', '160310', '180612', '200512', '200532', '610110', '470112' ) AND t2.UCC IN ( '200410', '690114', '680903', '600420' ) )
  OR ( t1.UCC NOT IN ( '330410', '280120', '630110', '320905', '660000', '380315', '370211', '390310', '400220', '360330', '360420', '390223', '410120', '410901', '670902', '380901', '610110', '690230', '004190' ) AND t2.UCC5 NOT IN ( '56031', '28014', '34090', '67031', '67090', '20021', '59023', '38034', '32033', '32013', '47022', '60031', '49030', '56021', '53041', '41090', '31014', '38041', '67011', '29031', '62041', '38011', '68022', '23090', '64012', '55011', '99990', '62061', '68011', '60090', '00410', '00419' ) AND t2.UCC IN ( '190212', '160310', '180612', '200512', '200532', '610110', '470112' ) AND t1.UCC IN ( '170210', '330510', '100410', '140420', '150110', '180710', '120210', '320903', '590230', '110310', '330610' ) )
  OR ( t1.UCC NOT IN ( '330410', '280120', '630110', '320905', '660000', '380315', '370211', '390310', '400220', '360330', '360420', '390223', '410120', '410901', '670902', '380901', '610110', '690230', '004190' ) AND t2.UCC5 IN ( '56031', '28014', '34090', '67031', '67090', '20021', '59023', '38034', '32033', '32013', '47022', '60031', '49030', '56021', '53041', '41090', '31014', '38041', '67011', '29031', '62041', '38011', '68022', '23090', '64012', '55011', '99990', '62061', '68011', '60090', '00410', '00419' ) AND t1.UCC5 IN ( '11041', '17021', '33051', '10041', '14042', '15011', '18022', '18071', '26011', '12021', '36035', '32090', '50011', '59023', '38034', '64022', '11031', '53021', '39021', '38021', '34053', '44012', '38043', '29031', '23090', '39023' ) )
  OR ( t1.UCC IN ( '330410', '280120', '630110', '320905', '660000', '380315', '370211', '390310', '400220', '360330', '360420', '390223', '410120', '410901', '670902', '380901', '610110', '690230', '004190' ) AND t2.UCC5 NOT IN ( '13031', '18032', '19021', '33041', '36021', '17053', '18022', '18051', '18071', '19031', '12021', '37021', '14021', '62021', '47021', '64031', '17041', '39012', '60021', '52053', '55031', '34011', '49031', '20051', '61032', '24021', '32033', '20052', '52054', '36031', '36042', '36051', '60031', '62012', '13012', '37031', '53031', '62011', '69011', '38021', '24011', '62092', '34021', '38041', '67011', '32022', '37090', '62081', '38032', '28022', '29043', '61014', '34091', '69023', '36012', '01012', '02011', '01032', '03081', '00400', '00210', '00419' ) AND t2.UCC IN ( '280210', '320370', '190322', '180520', '360350', '190324', '270210', '280140', '320903', '320420', '320902', '200532', '390110', '390310', '640220', '200410', '320130', '470220', '430110', '580000', '560210', '310316', '530903', '230110', '340520', '410901', '660210', '670902', '320232', '520110', '320110', '620410', '380110', '550330', '680220', '680903', '560330', '230900', '290320', '290120', '320901', '320522', '220210', '999900', '290410', '290440', '620320', '310232', '620710', '310210', '680110', '320511', '310340', '630900', '520902', '320120', '140410', '610230', '410140', '620930', '310241', '310331', '310314', '030410', '004100' ) )
  OR ( t1.UCC IN ( '330410', '280120', '630110', '320905', '660000', '380315', '370211', '390310', '400220', '360330', '360420', '390223', '410120', '410901', '670902', '380901', '610110', '690230', '004190' ) AND t2.UCC5 IN ( '13031', '18032', '19021', '33041', '36021', '17053', '18022', '18051', '18071', '19031', '12021', '37021', '14021', '62021', '47021', '64031', '17041', '39012', '60021', '52053', '55031', '34011', '49031', '20051', '61032', '24021', '32033', '20052', '52054', '36031', '36042', '36051', '60031', '62012', '13012', '37031', '53031', '62011', '69011', '38021', '24011', '62092', '34021', '38041', '67011', '32022', '37090', '62081', '38032', '28022', '29043', '61014', '34091', '69023', '36012', '01012', '02011', '01032', '03081', '00400', '00210', '00419' ) AND t2.UCC NOT IN ( '180320', '190211', '330410', '180220', '180710', '190212', '140210', '640310', '170410', '170533', '360420', '370212' ) AND t1.UCC NOT IN ( '630110', '380315', '400220', '390223' ) )
  OR ( t1.UCC IN ( '330410', '280120', '630110', '320905', '660000', '380315', '370211', '390310', '400220', '360330', '360420', '390223', '410120', '410901', '670902', '380901', '610110', '690230', '004190' ) AND t2.UCC5 IN ( '13031', '18032', '19021', '33041', '36021', '17053', '18022', '18051', '18071', '19031', '12021', '37021', '14021', '62021', '47021', '64031', '17041', '39012', '60021', '52053', '55031', '34011', '49031', '20051', '61032', '24021', '32033', '20052', '52054', '36031', '36042', '36051', '60031', '62012', '13012', '37031', '53031', '62011', '69011', '38021', '24011', '62092', '34021', '38041', '67011', '32022', '37090', '62081', '38032', '28022', '29043', '61014', '34091', '69023', '36012', '01012', '02011', '01032', '03081', '00400', '00210', '00419' ) AND t2.UCC IN ( '180320', '190211', '330410', '180220', '180710', '190212', '140210', '640310', '170410', '170533', '360420', '370212' ) AND t1.UCC NOT IN ( '330410', '320905', '360420', '380901', '610110' ) )
  ) AND t2.TIME_STAMP <= t1.TIME_STAMP
  GROUP BY t1.rownum,
           t1.BASKETID,
           t1.TIME_STAMP;

  CREATE TABLE FEATURE_2 AS
  SELECT COUNT( * ) - COUNT( DISTINCT t1.TIME_STAMP - t2.TIME_STAMP ) AS feature_2,
         t1.BASKETID,
         t1.TIME_STAMP
  FROM (
       SELECT *,
              ROW_NUMBER() OVER ( ORDER BY BASKETID, TIME_STAMP ASC ) AS rownum
       FROM EXPD
  ) t1
  LEFT JOIN EXPD t2
  ON t1.BASKETID = t2.BASKETID
  WHERE (
     ( t1.UCC NOT IN ( '180320', '330410', '270000', '390120', '600210', '170533', '430120', '340410', '130121', '370314', '390210', '410120', '410901', '330610', '610110', '690119', '600420', '620112', '030610', '004100', '004190' ) AND t2.UCC NOT IN ( '190324', '270210', '210110', '380333', '360312', '360420', '010120', '020410', '002100' ) AND t2.UCC IN ( '360350', '350110', '130121', '320430', '520110', '620420', '380311', '560400', '620510', '600900', '020620', '001000' ) AND t1.UCC NOT IN ( '110410', '110510', '120310', '120410', '130212', '140110', '170210', '190111', '190211', '320140', '330510', '100210', '100510', '110210', '150110', '180220', '180710', '330210', '190112', '190113', '470111', '630110', '190114', '190212', '190321', '560110', '190311', '100110', '330310', '170310', '640310', '380315', '520531', '180611', '540000', '200532', '640410', '320232', '520110', '560400', '320521', '010210', '090110', '090210', '020110', '020210', '020310', '010320', '020510', '040410', '070110', '004000' ) )
  OR ( t1.UCC NOT IN ( '180320', '330410', '270000', '390120', '600210', '170533', '430120', '340410', '130121', '370314', '390210', '410120', '410901', '330610', '610110', '690119', '600420', '620112', '030610', '004100', '004190' ) AND t2.UCC IN ( '190324', '270210', '210110', '380333', '360312', '360420', '010120', '020410', '002100' ) AND t1.UCC IN ( '180310', '320410', '330510', '620912', '650210', '170520', '180510', '280120', '320233', '190313', '270310', '190322', '240320', '550410', '320904', '340110', '280140', '320345', '340901', '640410', '590230', '380340', '610320', '130211', '140340', '320130', '470220', '360311', '620330', '530311', '590110', '660110', '380901', '340530', '520110', '180620', '670210', '310232', '620710', '630220', '620930', '090210', '020710', '070230', '050310', '030410', '030510' ) )
  OR ( t1.UCC IN ( '180320', '330410', '270000', '390120', '600210', '170533', '430120', '340410', '130121', '370314', '390210', '410120', '410901', '330610', '610110', '690119', '600420', '620112', '030610', '004100', '004190' ) AND t2.UCC NOT IN ( '110410', '120410', '180320', '180410', '190211', '320140', '320410', '330410', '330510', '100210', '100510', '110210', '180420', '320233', '320370', '190112', '330110', '470111', '190313', '190114', '560110', '190311', '180520', '320905', '640310', '660000', '370125', '390120', '340110', '190312', '200532', '390310', '640410', '200210', '380340', '200310', '200410', '360311', '370220', '370314', '340520', '440210', '410120', '410901', '380901', '620926', '300218', '340210', '320232', '610110', '380410', '180620', '240120', '320380', '640120', '620420', '010120', '040110', '010320', '040510', '050310', '004000', '004190' ) AND t1.UCC NOT IN ( '600210', '340410', '410120', '690119', '600420', '620112', '004190' ) AND t2.UCC IN ( '280210', '370213', '620214', '160110', '360350', '470211', '160320', '520531', '430120', '690120', '130110', '590230', '620310', '140340', '380333', '360312', '620121', '620111', '230110', '660110', '660210', '660900', '380210', '550320', '320430', '680903', '380320', '370110', '620213', '320521', '290440', '620710', '380510', '690230', '020610', '020620', '020510', '030810', '040410', '070110', '030210' ) )
  OR ( t1.UCC IN ( '180320', '330410', '270000', '390120', '600210', '170533', '430120', '340410', '130121', '370314', '390210', '410120', '410901', '330610', '610110', '690119', '600420', '620112', '030610', '004100', '004190' ) AND t2.UCC NOT IN ( '110410', '120410', '180320', '180410', '190211', '320140', '320410', '330410', '330510', '100210', '100510', '110210', '180420', '320233', '320370', '190112', '330110', '470111', '190313', '190114', '560110', '190311', '180520', '320905', '640310', '660000', '370125', '390120', '340110', '190312', '200532', '390310', '640410', '200210', '380340', '200310', '200410', '360311', '370220', '370314', '340520', '440210', '410120', '410901', '380901', '620926', '300218', '340210', '320232', '610110', '380410', '180620', '240120', '320380', '640120', '620420', '010120', '040110', '010320', '040510', '050310', '004000', '004190' ) AND t1.UCC IN ( '600210', '340410', '410120', '690119', '600420', '620112', '004190' ) )
  OR ( t1.UCC IN ( '180320', '330410', '270000', '390120', '600210', '170533', '430120', '340410', '130121', '370314', '390210', '410120', '410901', '330610', '610110', '690119', '600420', '620112', '030610', '004100', '004190' ) AND t2.UCC IN ( '110410', '120410', '180320', '180410', '190211', '320140', '320410', '330410', '330510', '100210', '100510', '110210', '180420', '320233', '320370', '190112', '330110', '470111', '190313', '190114', '560110', '190311', '180520', '320905', '640310', '660000', '370125', '390120', '340110', '190312', '200532', '390310', '640410', '200210', '380340', '200310', '200410', '360311', '370220', '370314', '340520', '440210', '410120', '410901', '380901', '620926', '300218', '340210', '320232', '610110', '380410', '180620', '240120', '320380', '640120', '620420', '010120', '040110', '010320', '040510', '050310', '004000', '004190' ) AND t2.UCC4 NOT IN ( '1104', '1204', '1803', '1901', '1902', '3305', '1002', '1005', '1102', '3301', '4701', '6403', '1806', '2005', '3903', '6404', '4101', '3804', '3809', '3402', '6101', '0101', '0103', '0405', '0503', '0040' ) )
  OR ( t1.UCC IN ( '180320', '330410', '270000', '390120', '600210', '170533', '430120', '340410', '130121', '370314', '390210', '410120', '410901', '330610', '610110', '690119', '600420', '620112', '030610', '004100', '004190' ) AND t2.UCC IN ( '110410', '120410', '180320', '180410', '190211', '320140', '320410', '330410', '330510', '100210', '100510', '110210', '180420', '320233', '320370', '190112', '330110', '470111', '190313', '190114', '560110', '190311', '180520', '320905', '640310', '660000', '370125', '390120', '340110', '190312', '200532', '390310', '640410', '200210', '380340', '200310', '200410', '360311', '370220', '370314', '340520', '440210', '410120', '410901', '380901', '620926', '300218', '340210', '320232', '610110', '380410', '180620', '240120', '320380', '640120', '620420', '010120', '040110', '010320', '040510', '050310', '004000', '004190' ) AND t2.UCC4 IN ( '1104', '1204', '1803', '1901', '1902', '3305', '1002', '1005', '1102', '3301', '4701', '6403', '1806', '2005', '3903', '6404', '4101', '3804', '3809', '3402', '6101', '0101', '0103', '0405', '0503', '0040' ) AND t2.UCC IN ( '180320', '100210', '640310', '200532', '390310', '010120', '004000' ) )
  ) AND t2.TIME_STAMP <= t1.TIME_STAMP
  GROUP BY t1.rownum,
           t1.BASKETID,
           t1.TIME_STAMP;

Existing approaches

There are two kinds of approaches that deserve mentioning:

  1. Brute-force approaches, which often come in the form of simple tools that data scientists have written for themselves.

  2. Multi-relational Decision Tree Learning (MRDTL), which was a topic of academic research in the early 2000s. For a description of the algorithm and overview of the literature, click here. Ironically, more recent papers on the topic of feature engineering are much closer to brute-force than they are to MRDTL.

The very inefficient method

A widespread approach to feature engineering is writing small private tools. These self-made tools are usually based on a simple, brute-force algorithm:

  1. Generate a large number of features.

  2. Do a feature selection and keep only a small share of these features (usually less than 1% or even less than 0.1%).

The reason this approach is so popular is that it is very easy to implement using libraries like data.tables in R, pandas in Python or Apache Spark.

However, it is also very inefficient and because of that, there are severe limitations to this approach.

For instance, consider our Objective 2: If we want to find an appropriate value for our threshold X, we have to generate a lot of features like COUNT the number of transactions in the last 90 days, COUNT the number of transactions in the last 91 days, COUNT the number of transactions in the last 92 days etc. But the difference between these features is not likely to be very large, meaning that there are many redundant operations.

Moreover, imagine generating some 20,000 features for a dataset with 1,000,000 lines. That requires a lot of memory (80 GB if you’re using single precision or 160 GB for double precision). And let’s face it: 1,000,000 lines isn’t that much these days.

Also consider our Objective 3: Let’s say you want to generate features of the form COUNT the number of transactions for which the transaction type is ‘type_1’ OR ‘type_2’ OR ’type_3’ OR …. Assume that there are 100 different transaction types (which isn’t a lot). That leaves us with 2^100 or 1,267,650,600,228,229,401,496,703,205,376 different combinations (I don’t even know how to pronounce that number). I think you get how this a problem.

By this point, we must come to the conclusion that a brute force approach can only generate very simple features and requires significant input from the user to restrict the search space (thus violating Objective 1).

Some people have tried to spin this as an advantage, because simple features are usually easier to interpret. But we were able to demonstrate on real-world projects that these approaches severely limit the predictive power of your models. Also, we can always regularize more powerful models to produce simpler features - but not the other way around.

The somewhat better method

Multi-relational Decision Tree Learning (MRDTL) is a strain of academic literature that was particularly popular in the early 2000s. It addresses some of the problems with brute force feature generation. In our view, it is much better than brute force.

However, it has inefficiencies of its own and is more difficult to implement. Maybe that is why the more recent academic literature on automated feature engineering is much closer to brute force than it is to MRDTL.

As the name implies, MRDTL is based on a greedy, tree-like approach:

  1. Define some sort of objective function that evaluates the quality of your feature as it relates to the target variable(s).

  2. Pick an aggregation and some column to be aggregated.

  3. Try out different conditions. Keep the one that generates the greatest improvement of your objective. Repeat until no improvement can be found or some sort of stopping criterion is reached.

This general approach seems far more promising than brute force. Decision-tree-based learners like xgboost are considered to be the state-of-the-art for applications with structured data. So why shouldn’t a similar approach work for relational data as well?

In our view, the reason this approach has never really taken off outside of academia is that an efficient implementation is far from trivial. Most papers on MRDTL implement the algorithm on top of an existing relational database system like MySQL.

The main problem with trying to implement something like this on top of an existing database is that it requires you to do many redundant operations. Consider our Objective 2:: As we iterate through different values for the threshold X, we will find that even the comparatively efficient approaches like MRDTL-2 force us to repeat the same operations on the same data over and over again.

Moreover, greater efficiency is often achieved at the expense of flexibility. For instance, some flavors of MRDTL only support combining different WHERE conditions using AND, but not OR.

In practice, most of these approaches also require the user to provide a list of conditions that the algorithms is allowed to try out. Going back to our initial example, user might be required to hard-code conditions like in the last 60 days, in the last 90 days and in the last 120 days. This obviously violates our objective that we not want to require the user to do this.

That being said, MRDTL is certainly a step in the right direction. Strangely enough, it is all but ignored but by the more recent literature on automated feature engineering, which is closer to brute force.

But MRDTL needs a more efficient implementation. And this is where we come in.

How it works

In this section, we will explain how our algorithm works.

Simply speaking, our algorithm is a more efficient variation of MRDTL. The core idea is to minimize redundancies through incremental updates. We then combine MRDTL with ensemble learning methods.

To allow for incremental updates and maximal efficiency, we developed everything from scratch in C++.

The core idea: Incremental updates

The most important novelty of our approach are incremental updates. When we evaluate a feature like COUNT the number of transactions in the last 90 days and COUNT the number of transactions in the last 91 days, very little changes in between. So the idea is to only recalculate what has changed and keep everything else untouched.

The first ingredient we need is an objective function that can be incrementally updated. When we move from 90 days to 91 days, presumably only very few lines in the population table (which defines our statistical population and contains our targets) actually change. So we shouldn’t be forced to recalculate the entire table.

In practice, most commonly used objective functions are fine and this is not much of a limitation. However, there are some objective functions, like rank correlation, that cannot be used.

The second ingredient are incremental updates to aggregations. This part is a bit harder, so let us elaborate:

Let’s say we have a match between the population table that contains our targets and another table (or a self-join). That match happens to be between the two thresholds 90 days and 91 days. As we move from 90 days to 91 days, we have to update our aggregation for that match. For maximum efficiency, this needs to be done incrementally. That means we do not want to recalculate the entire aggregation for all matches that it aggregates - instead just for the one match in between the two thresholds.

Consider Objective 4: We also want to support the AND and OR combinations of conditions. So it is possible that the match was NOT included in the aggregation, but it becomes part of it as we move the threshold. It is also possible that the match WAS included in the aggregation, but now it isn’t anymore.

For an aggregation like COUNT, incremental updates are straight-forward. If the match was not included, but now it is, then increment by 1. If was included, but it isn’t anymore, then decrement by 1.

Things are more tricky for aggregations like MAX, MEDIAN or COUNT DISTINCT. For instance, whereas incrementing MAX is easy, decrementing it is hard. If the match used to be included and is in fact the maximum value, we now have to find the next biggest match. And we have to find it quickly - ideally iterating through a set of thresholds should take linear time in the number of matches. To make it even more complicated, some cross-joins might result in a lot of matches, so any data structures that have non-trivial memory overhead are a no-go.

Even though we do not have the space to discuss all aggregations here, we do hope that this has helped you understand the core idea of our approach, which is that incremental updates yield large gains in efficiency.

Ensemble learning

Everything so far has shed light on how we train ONE feature. But in practice, we want more than one. So how do we do that?

Since we are using a tree-based algorithm anyway, we are able to harness the power of ensemble learning algorithms that have been shown to work very well with non-relational decision trees, namely bagging and gradient boosting.

With bagging, we just sample randomly from our population table (which defines the statistical population and contains the targets). We train a feature on that sample and then pick a different random sample to train the next feature.

With gradient boosting, we calculate the pseudo-residuals of our previously trained features. We then train features that predict these pseudo-residuals. This procedure guarantees that new features are targeted and compensating the weaknesses of older features.

Try it yourself

If you enjoyed reading this and want to give it a try, you can download our software, getML.

The basic version supports everything we have discussed here and it’s for free.

Head over to https://get.ml/ to download the latest version and jump straight into our getting started guide.