As promised, we kept on working on our
bounceR package. For once, we changed the interface: users now do not have to choose a number of tuning parameters, that – thanks to my somewhat cryptic documentation – sound more complicated than they are. Inspired by H2o.ai feature to let the user set the time he or she wants to wait, instead of a number of cryptic tuning parameters, we added a similar function.
Further, we changed the source code quite a bit. Henrik Bengtsson gave a very inspiring talk on parallization using the genius
future package at this year's eRum conference. A couple days later, Davis Vaughan released
furrr. An incredibly smart – kudos – wrapper on-top of the no-less genius
purrr package. Davis' package combines
purrr's maping functions with
future's parallization madness. As you can tell, I am a big fan of all these three packages. Thus, inspired by these new inventions, we wanted to make use of them in our package. So, the entire parallization setup of
bounceR is now leveraging
furrr. This way, the parallization is so much smarter, faster and works seemingless on different operating systems.
Thus, lets see how you can use it now. Let's start by downloading the package.
# you need devtools, cause we are just about to get it to CRAN, though we are not that far library(devtools) # now you are good to go devtools::install_github("STATWORX/bounceR") # now you can source it like every normal package library(bounceR)
To show how the feature selection works, we now need some data, so lets simulate some with our
# simulate some data data <- sim_data(n = 100, modelvars = 10, noisevars = 300)
Now you guys can all imagine that with 310 features on 100 observations, building models could be a little challenging. In order to be able to model the target no less, you need to reduce your feature space. There are numerous ways to do so. In my last Blog Post I described our solution. Let's see how to use our algorithm.
# run our algorithm selected_features <- featureSelection(data = data, target = "y", max_time = "30 mins", bootstrap = "regular", early_stopping = "aic", parallel = TRUE)
What can you expect to get out of it? Well, we return a list with of course the optimal formula calculated by our algorithm. Further, you get a stability matrix with it, where you can see a ranking of the features by importance. Additionally we built in some convenient S4 methods, so you can easily access all the information you need.
I hope I could teaser you a little to check out the package and help us further improve it. Currently, we are developing two new algorithms for feature selection. Thus, in the next iteration we will implement those two as well. I am looking forward to your comments, issues and thoughts on the package.