Automated Data Science
From a very philosophical point of view, as humans evolve, we tend to automate repetitive tasks in order to waste our time with more pleasant matters. The same holds true for the field of data science as a whole, as much as for many tasks at STATWORX. What started as a super fancy and fun profession quickly became tedious work. When the first data science prophets faced their first projects, they realized that rather than coding fancy models the entire day, you are stuck cleaning data, building and selecting features, selecting algorithms and so on. As data scientists as a species become more and more evolved, no wonder they are trying to automate boring tasks. Working in data science for quite a while now, we see that trend as well. Not only because of companies like H2o.ai, the automation of data science is progressing with an incredible pace.
Automation at STATWORX
Most Data Scientists are doing what they do out of passion for the entire idea of gathering information from messy data. The same applies to us. Thus, when we are not busy working projects, we are reading up on the latest developments of the industry. This inspires us to progress automating certain tasks that we find to be extremely time consuming every single time. For once, there is data prep – everyone who had to clean up and merge data to make it somewhat interpretable to a machine knows what I am talking about. The next task is feature engineering – man that takes time – you know what I mean. Building all kinds of interaction terms, mathematical transformations by hand is quite an effort. The next step, the selection of relevant features is extremely time consuming. Besides taking a lot of time, your feature selection procedure is one of the most important aspects in a solid anti-overfitting campaign. The next task is of course the selection of a feasible algorithm. This again, is very tedious of course and will probably make it into one of our other articles. Though this one is devoted to our approach of solving feature selection in a fully automated fashion.
Feature Selection at STATWORX
Most of the times when we face new data, we are let’s say "charmingly uninformed" about the actual meaning of the data. Thus, we talk to our business partners (they are the ones with the expertise), visualize information, and find statistical relationships to somehow make sense of the data. However, this knowledge often does not suffice to select all relevant features to solve a forecasting or prediction problem. Thus, we developed an automated way to help us solve our feature selection issues. Our selection approach relies on the cleverness of componentwise boosting and the genius learning procedure of backpropagation. We put everything together in a nice little R package, so that the community can challenge our approach. Sure, it is not the only way to select features, and sure, it is probably not the one solution to select them all. However, for all our use cases in which we have a lot of features and little observations, it is working exceedingly well. In fact, compared to other selection criteria, we can see that our algorithm is much better at selecting relevant features in a controlled simulation environment, than methods such as correlation-covariance filters, maximum relevance – minimum redundancy filters, random forest, penalized linear models, etc. We are currently working on a more generalizable simulation study – so stay tuned and check this blog from time to time, cause I will be getting back to this.
bounceR for real now!
Before I start talking about all the stuff we are going to do, I'd rather show you, what we did so far. The algorithm is quite simplistic really. By the way, I gave a talk on this lately, so you can check that out on youtube. So, how does the algorithm work? I am putting some pseudo code below, so you guys can check it out.
Looks legit, right. In principle, what it does, is to split the feature space into small little chunks of features with bootstrapped observations. And it does so very often to cover as many combinations as possible. Then, it evaluates the subsets and selects the most relevant features in every subset. The outcome of each subset is then aggregated to a global distribution. What we are essentially interested in is this aggregated distribution. So basically we ask the question: If we simulate little datasets with randomly drawn features and bootstrapped observations, which features will survive in this setting? Features that survive many of these little simulations are prone to serve in the final model. If you want to have a close look at the code, you should check out our GitHub repo.
A fully automated feature selection, of course, is just one module in our stack of automated data science tools. Writing about automated data science and about automating my job, I cannot help but wonder about my job security. So, if you are looking for someone with the brightness to make his or her own job obsolete, give me a call… 😉