Feature Selection
Feature Selection is one of the most interesting fields in machine learning in my opinion. It is a boundary point of two different perspectives on machine learning – performance and inference. From a performance point of view, feature selection is typically used to increase the model performance or to reduce the complexity of the problem in order to optimize computational efficiency. From an inference perspective, it is important to extract variable importance to identify key drivers of a problem.
Many people argue that in the era of deep learning feature selection is not important anymore. As a method of representation learning, deep learning models can find important features of the input data on their own. Those features are basically nonlinear transformations of the input data space. However, not every problem is suited to be approached with neural nets (actually, many problems). In many practical ML applications feature selection plays a key role on the road to success.
At STATWORX we work in a lot of dynamic pricing projects. One key piece in the puzzle of dynamic pricing is componentwise Gradient Boosting is an excellent example of such a whitebox machine.
Introducing Xy()
In this blog post I will try to challenge feature selection algorithms with my simulation function Xy()
. The function is designed to simulate regression and classification problems with a maximum degree of freedom for the user. If you want to get a quick introduction to the function I recommend reading my previous blog post, where I have analyzed Ordinary Least Squares coefficients under different signal strengths.
To challenge the feature selection algorithms, we first need to simulate data. By default, Xy()
simulates regression learning data with 1000 observations, two linear and two nonlinear features as well as five randomly generated variables, which are not used to generate the target. We will use these random variables as pitfalls for the feature selection algorithms. To make things a little spicier let’s see how we can generate ten linear, ten nonlinear and three categorical variables along with 50 random variables.
# Simulating regression data
# install the package
devtools::install_github(“andrebleier/Xy”)
# load the library
library(Xy)
fs_sim = Xy(n = obs,
numvars = c(10,10),
catvars = c(3, 2),
noisevars = 50,
task = Xy_task(),
nlfun = function(x) {x^2},
interactions = 1,
sig = c(1,4),
cor = c(0),
weights = c(10,10),
intercept = TRUE,
stn = 4)
print(fs_sim)
Xy Simulation

 + task regression
 + observations 1000
 + interactions 1D
 + signal to noise ratio 4
 + effects
  linear 10
  nonlinear 10
  categorical 3
  noise 50
 + intervals
  correlation 0
  weights [10,10]
  sd [1,4]
Target generating process:
y = 0.218 + 5.4NLIN_01  9.48NLIN_02 + 7NLIN_03  7.59NLIN_04  7.43NLIN_05  0.39NLIN_06  7.72NLIN_07 + 0.02NLIN_08 + 9.05NLIN_09 + 7.75NLIN_10 + 8.76LIN_01 + 7.7LIN_02  0.77LIN_03 + 8.16LIN_04 + 1.06LIN_05 + 4.46LIN_06  4.01LIN_07 + 3.4LIN_08  7.94LIN_09  8.65LIN_10  3.51DUMMY_01__2  10.04DUMMY_02__2  1.63DUMMY_03__2 + e ~ N(0,99.28)
You can extract the feature importance with the varimp()
function. Furthermore, there is a buildin plotting argument, that allows for visualization. I decided to use Boxplots rather than a plain barchart as it visualizes a local and global view on the importance.
# Feature Importance
fs_varimp < varimp(fs_sim, plot = TRUE)
Benchmarking Framework
Ok, we saw how we can generate data, but how about the algorithms? We are going to use four different classical machine learning algorithms.
 Gradient Boosting with
library(xgboost)
 Componentwise Gradient Boosting with
library(mboost)
 Ridge Regression with
library(foba)
 Random Forest with
library(rpart)
Two gradient boosting implementations, a greedy ridge regression and a random forest. If you want to read up on the mechanics of the algorithms I strongly recommend the package documentations.
In the end we want to elect a winner of course, thus we have to define the rules. There is not much literature on this matter, so I created three measures which made sense to me. Why three though? I think there are several angles to this problem. Sure a mean absolute error () between the real and estimated importance could be the first approach.
This gives us a feeling of how close an algorithm gets to the real importance. However, one could argue, that the deviation itself is not as interesting as the plain number of features it got right ().
where are the number of true features selected by the algorithm, whereas is the total number of features.
Another angle to this problem are falsely selected features. Remember, we simulated fifty noise variables in the example above. The last measure aims to clarify the ratio of true to noise features, which have been selected by an algorithm.
where is the aggregated importance for all estimated true features and are the noise variables selected by the algorithm and their summarized importance .
Each algorithm gets a total of 100 randomly drawn hyperparameter sets. The hyperparameter configuration with the lowest crossvalidation error will represent the feature importance of that particular algorithm. Furthermore, we will repeat the data simulation 20 times to create a little more variation. Each time the number of features, noise variables, polynomial degree of nonlinearity will be drawn randomly. You can find the simulation code on my GitHub. The whole simulation took three days on eight cores, so you might want to change the parameters at the beginning of the script.
Evaluation
A further point to note is, that the tree algorithms gradient boosting (xgboost) and the random forest (rpart) can capture nonlinearities while the ridge regression and the componentwise gradient boosting apply ordinary least squares approaches. Since, we are simulating nonlinear and linear features, this might be an advantage for the tree algorithms. The results are visualized in the following figure.
While the metrics do not speak in favor of one particular algorithm overall, there are several takeaways.
 The greedy ridge regression (foba) tends to select all important features (M2), however, at the cost of selecting a relatively large amount of irrelevant variables (M3).
 The componentwise gradient boosting algorithm (glmboost) selects over 78% of the true features on average (M2), while it can reassemble the true feature importance with the lowest mean absolute error (M1). Unlike the ridge regression it does not select a large amount of noise variables (M3).
 The random forest (rpart) tends to be more restrictive on the amount of features it selects (M2). However, those which get selected seem to be true features (M3).
 The gradient boosted trees (xgbDART), seem to be the winner for most research questions, since it has the best M2 and M3 scores. On average xgboost finds over 93% of the true features (M2), while selecting a relatively low amount of irrelevant features (M3). However, compared to the other algorithms it cannot really reassemble the true feature importance (M1).
Of course, realworld problems are by far more complex than these simulations. However, I think that at least you can extract some tendencies about realworld problems from this example. If you have any ideas about other metrics or angles I did not cover in this example or you have an idea to improve my function, please feel free to give me some feedback.
ABOUT US
STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an email addressed to blog(at)statworx.com.