array(1) {
  array(13) {
    string(2) "en"
    string(1) "1"
    string(7) "English"
    string(1) "1"
    string(1) "1"
    string(5) "en_US"
    string(1) "0"
    string(2) "en"
    string(7) "English"
    string(89) "https://www.statworx.com/en/content-hub/blog/using-machine-learning-for-causal-inference/"
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/en.png"
    string(2) "en"
Content Hub
Blog Post

Using Machine Learning for Causal Inference

  • Expert Markus Berroth
  • Date 13. July 2018
  • Topic CodingRStatistics & Methods
  • Format Blog
  • Category Technology
Using Machine Learning for Causal Inference

Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many “off-the-shelf” ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.

“Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal.” – Bradley Efron


However, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous treatment effects. So, we will take a look at three interesting and different approaches for it and focus on a very recent paper by Athey et al. which is forthcoming in “The Annals of Statistics”1.

Model-based Recursive Partitioning

One of the earlier papers about causal trees is by Zeileis et al., 20082. They describe an algorithm for Model-based Recursive Partitioning (MOB), which looks at recursive partitioning for more complex models. They fit at first a parametric model to the data set, while using Maximum-Likelihood, then test for parameter instability for a set of predefined variables and lastly split the model with the variable regarding the highest parameter instability. Those steps are repeated in each of the daughter nodes till a stopping criterion is reached. However, they do not provide statistical properties for the mob and the partitions are still quite large.

Bayesian Additive Regression Tree

Another paper uses Bayesian Additive Regression Tree (BART) for the estimation of heterogeneous treatment effects3. Hereby, one advantage of this approach is, that BART can detect and handle interactions and non-linearity in the response surface. It uses a Sum-of-Tree Model. First, a weak-learning tree is grown, whereby the residuals are calculated and the next tree is fitted according to these residuals. Similar to Boosting Algorithms, BART wants do avoid overfitting. This is achieved by using a regularization prior, which restricts overfitting and the contribution of each tree to the final result.

Generalized Random Forest

However, this and the next blog post will be mainly focused on the Generalized Random Forest (GRF) by Athey et al., who have already been exploring the possibilities of ML in economics before. It is a method for non-parametric statistical estimation, which uses the basic ideas of the Random Forest. Therefore, it keeps the recursive partitioning, subsampling and random split selection. Nevertheless, the final outcome is not estimated via simple averaging over the trees. The Forest is used to estimate an adaptive weighting function. So, we grow a set of trees and each observation gets weighted equalling how often it falls into the same leaf as the target observation. Those weights are used to solve a “local GMM” model.

Another important piece of the GRF is the split selection algorithm, which emphasizes maximizing heterogeneity. With this framework, a wide variety of applications is possible like quantile regressions but also the estimation of heterogeneous treatment effects. Therefore, the split selection must be suitable for a lot of different purposes. As in Breiman’s Random Forest, splits are selected greedily. However, in the case of general moment estimation, we don’t have a direct loss criterion to minimize. So instead we want to maximize a criterion ∆ , which favors splits that are increasing the heterogeneity of our in-sample estimation. Maximizing ∆ directly on the other side would be computationally costly, therefore Athey et al. are using a gradient-based approximation for it. This results in a computational performance, similar to standard CART- approaches.

Comparing the regression forest of GRF to standard random forest

Athey et al. are claiming in their paper that in the special case of a regression forest, the GRF gets the same results as the standard random forest by Breiman (2001). So, one already implemented estimation method in the grf-package4 is a regression forest. Therefore, I will compare those results, with the random forest implementations of the randomForest-package as well as the implementation of the ranger-packages. For tuning porpuses, I will use a random search with 50 iterations for the randomForest and ranger-package and for the grf the implemented tune_regression_forest()-function. The Algorithms will be benchmarked on 3 data sets, while using the RMSE to compare the results. For easy handling, I implemented the regression_forest() into the caret framework, which can be found on my GitHub.

Data Set Metric grf ranger randomForest
air RMSE 0.25 0.24 0.24
bike RMSE 2.90 2.41 2.67
gas RMSE 36.0 32.6 34.4

The GRF performs a little bit worse in comparison with the other implementations. However, this could be also due to the tuning of the parameters, because there are more parameters to tune. According to their GitHub, they are planning on improving the tune_regression_forest()-Function.
One advantage of the GRF is, that it produces unbiased confidence intervals for each estimation point. In order to do so, they are performing honest tree splitting, which was first described in their paper about causal trees5. With honest stree splitting, one sample is used to make the splits and another distinct sample is used to estimate the coefficients.

However, standard regression is not the exciting part of the Generalized Random Forest. Therefore, I will take a look at how the GRF performs in estimating heterogeneous treatment effects with simulated data and compare it to the estimation results of the MOB and the BART in my next blog post.


  1. Athey, Tibshirani, Wager. Forthcoming.”Generalized Random Forests”
  2. Zeileis, Hothorn, Hornik. 2008.”Model-based Recursive Partitioning”
  3. Hill. 2011.”Bayesian Nonparametric Modeling for Causal Inference”
  4. https://github.com/swager/grf
  5. Athey and Imbens. 2016.”Recursive partitioning for heterogeneous causal effects.”

Markus Berroth Markus Berroth

Learn more!

As one of the leading companies in the field of data science, machine learning, and AI, we guide you towards a data-driven future. Learn more about statworx and our motivation.
About us