XY Titel

Benchmarking Feature Selection Algorithms with Xy()

André Bleier Blog, Data Science

Feature Selection

Feature Selection is one of the most interesting fields in machine learning in my opinion. It is a boundary point of two different perspectives on machine learning – performance and inference. From a performance point of view, feature selection is typically used to increase the model performance or to reduce the complexity of the problem in order to optimize computational efficiency. From an inference perspective, it is important to extract variable importance to identify key drivers of a problem.

Many people argue that in the era of deep learning feature selection is not important anymore. As a method of representation learning, deep learning models can find important features of the input data on their own. Those features are basically nonlinear transformations of the input data space. However, not every problem is suited to be approached with neural nets (actually, many problems). In many practical ML applications feature selection plays a key role on the road to success.

At STATWORX we work in a lot of dynamic pricing projects. One key piece in the puzzle of dynamic pricing is component-wise Gradient Boosting is an excellent example of such a white-box machine.

Introducing Xy()

xy hexagon logo

In this blog post I will try to challenge feature selection algorithms with my simulation function Xy(). The function is designed to simulate regression and classification problems with a maximum degree of freedom for the user. If you want to get a quick introduction to the function I recommend reading my previous blog post, where I have analyzed Ordinary Least Squares coefficients under different signal strengths.

To challenge the feature selection algorithms, we first need to simulate data. By default, Xy() simulates regression learning data with 1000 observations, two linear and two nonlinear features as well as five randomly generated variables, which are not used to generate the target. We will use these random variables as pitfalls for the feature selection algorithms. To make things a little spicier let’s see how we can generate ten linear, ten non-linear and three categorical variables along with 50 random variables.

# Simulating regression data 

# install the package

# load the library

fs_sim = Xy(n = obs,
            numvars = c(10,10),
            catvars = c(3, 2),
            noisevars = 50,
            task = Xy_task(),
            nlfun = function(x) {x^2},
            interactions = 1,
            sig = c(1,4), 
            cor = c(0),
            weights = c(-10,10),
            intercept = TRUE,
            stn = 4)

 Xy Simulation 
 	 | + task regression
 	 | + observations 1000
 	 | + interactions 1D
 	 | + signal to noise ratio 4
 	 | + effects 
 	   | - linear 10
 	   | - nonlinear 10
 	   | - categorical 3
 	   | - noise 50
 	 | + intervals 
 	   | - correlation 0
 	   | - weights [-10,10]
 	   | - sd [1,4]

Target generating process: 
y = -0.218 + 5.4NLIN_01 - 9.48NLIN_02 + 7NLIN_03 - 7.59NLIN_04 - 7.43NLIN_05 - 0.39NLIN_06 - 7.72NLIN_07 + 0.02NLIN_08 + 9.05NLIN_09 + 7.75NLIN_10 + 8.76LIN_01 + 7.7LIN_02 - 0.77LIN_03 + 8.16LIN_04 + 1.06LIN_05 + 4.46LIN_06 - 4.01LIN_07 + 3.4LIN_08 - 7.94LIN_09 - 8.65LIN_10 - 3.51DUMMY_01__2 - 10.04DUMMY_02__2 - 1.63DUMMY_03__2 + e ~ N(0,99.28)

You can extract the feature importance with the varimp() function. Furthermore, there is a build-in plotting argument, that allows for visualization. I decided to use Boxplots rather than a plain barchart as it visualizes a local and global view on the importance.

# Feature Importance
fs_varimp <- varimp(fs_sim, plot = TRUE)


Benchmarking Framework

Ok, we saw how we can generate data, but how about the algorithms? We are going to use four different classical machine learning algorithms.

  1. Gradient Boosting with library(xgboost)
  2. Component-wise Gradient Boosting with library(mboost)
  3. Ridge Regression with library(foba)
  4. Random Forest with library(rpart)

Two gradient boosting implementations, a greedy ridge regression and a random forest. If you want to read up on the mechanics of the algorithms I strongly recommend the package documentations.
In the end we want to elect a winner of course, thus we have to define the rules. There is not much literature on this matter, so I created three measures which made sense to me. Why three though? I think there are several angles to this problem. Sure a mean absolute error (M1) between the real and estimated importance could be the first approach.

    \[M1 = \frac{1}{n} \sum_{i=1}^{n} | \widehat{Imp} - Imp |\]

This gives us a feeling of how close an algorithm gets to the real importance. However, one could argue, that the deviation itself is not as interesting as the plain number of features it got right (M2).

    \[M2 = p_{est} / p_{real} * 100\]

where p_{est} are the number of true features selected by the algorithm, whereas p_{real} is the total number of features.
Another angle to this problem are falsely selected features. Remember, we simulated fifty noise variables in the example above. The last measure aims to clarify the ratio of true to noise features, which have been selected by an algorithm.

    \[M3 = \frac{p_{est} * imp_{est}}{(p_{true} * imp_{est} + p_{noise} * imp_{noise})}\]

where imp_{est} is the aggregated importance for all estimated true features and p_{noise} are the noise variables selected by the algorithm and their summarized importance imp_{noise}.

Each algorithm gets a total of 100 randomly drawn hyperparameter sets. The hyperparameter configuration with the lowest cross-validation error will represent the feature importance of that particular algorithm. Furthermore, we will repeat the data simulation 20 times to create a little more variation. Each time the number of features, noise variables, polynomial degree of nonlinearity will be drawn randomly. You can find the simulation code on my GitHub. The whole simulation took three days on eight cores, so you might want to change the parameters at the beginning of the script.


A further point to note is, that the tree algorithms gradient boosting (xgboost) and the random forest (rpart) can capture nonlinearities while the ridge regression and the component-wise gradient boosting apply ordinary least squares approaches. Since, we are simulating nonlinear and linear features, this might be an advantage for the tree algorithms. The results are visualized in the following figure.


While the metrics do not speak in favor of one particular algorithm overall, there are several takeaways.

  • The greedy ridge regression (foba) tends to select all important features (M2), however, at the cost of selecting a relatively large amount of irrelevant variables (M3).
  • The component-wise gradient boosting algorithm (glmboost) selects over 78% of the true features on average (M2), while it can reassemble the true feature importance with the lowest mean absolute error (M1). Unlike the ridge regression it does not select a large amount of noise variables (M3).
  • The random forest (rpart) tends to be more restrictive on the amount of features it selects (M2). However, those which get selected seem to be true features (M3).
  • The gradient boosted trees (xgbDART), seem to be the winner for most research questions, since it has the best M2 and M3 scores. On average xgboost finds over 93% of the true features (M2), while selecting a relatively low amount of irrelevant features (M3). However, compared to the other algorithms it cannot really reassemble the true feature importance (M1).

Of course, real-world problems are by far more complex than these simulations. However, I think that at least you can extract some tendencies about real-world problems from this example. If you have any ideas about other metrics or angles I did not cover in this example or you have an idea to improve my function, please feel free to give me some feedback.

Über den Autor
André Bleier

André Bleier

André ist im Data Science Team und organisiert intern unsere Afterwork Veranstaltungen. In seiner Freizeit treibt er Sport und programmiert gerne kleine und große R Codes.