automated process

bounceR 0.1.2: Automated Feature Selection

Lukas Strömsdörfer Blog, Data Science

New Features

As promised, we kept on working on our bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.

bounceR logo

Further, we changed the source code quite a bit. Henrik Bengtsson gave a very inspiring talk on parallization using the genius future package at this year's eRum conference. A couple days later, Davis Vaughan released furrr. An incredibly smart – kudos – wrapper on-top of the no-less genius purrr package. Davis' package combines purrr's maping functions with future's parallization madness. As you can tell, I am a big fan of all these three packages. Thus, inspired by these new inventions, we wanted to make use of them in our package. So, the entire parallization setup of bounceR is now leveraging furrr. This way, the parallization is so much smarter, faster and works seemingless on different operating systems.

Practical Example

Thus, lets see how you can use it now. Let's start by downloading the package.

# you need devtools, cause we are just about to get it to CRAN, though we are not that far

# now you are good to go

# now you can source it like every normal package

To show how the feature selection works, we now need some data, so lets simulate some with our sim_data() function.

# simulate some data
data <- sim_data(n = 100,
                 modelvars = 10,
                 noisevars = 300)

Now you guys can all imagine that with 310 features on 100 observations, building models could be a little challenging. In order to be able to model the target no less, you need to reduce your feature space. There are numerous ways to do so. In my last Blog Post I described our solution. Let's see how to use our algorithm.

# run our algorithm
selected_features <- featureSelection(data = data,
                                      target = "y",
                                      max_time = "30 mins",
                                      bootstrap = "regular",
                                      early_stopping = "aic",
                                      parallel = TRUE)

What can you expect to get out of it? Well, we return a list with of course the optimal formula calculated by our algorithm. Further, you get a stability matrix with it, where you can see a ranking of the features by importance. Additionally we built in some convenient S4 methods, so you can easily access all the information you need.


I hope I could teaser you a little to check out the package and help us further improve it. Currently, we are developing two new algorithms for feature selection. Thus, in the next iteration we will implement those two as well. I am looking forward to your comments, issues and thoughts on the package.

Cheers Guys!

Über den Autor
Lukas Strömsdörfer

Lukas Strömsdörfer

I am a data scientists at STATWORX, apart from automating my job, I am taking my vintage bike for a spin and building a ML tool that lets me become a below-average gardener.


is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)