## Introduction

Sometimes here at STATWORX impromptu discussions about statistical methods happen. In one such discussion one of my colleagues decided to declare (albeit jokingly) Bayesian statistics unnecessary. This made me ask myself: Why would I ever use Bayesian models in the context of a standard regression problem? Surely, existing approaches such as Ridge Regression are just as good, if not better. However, the Bayesian approach has the advantage that it lets you regularize your model to prevent overfitting **and** meaningfully interpret the regularization parameters. Contrary to the usual way of looking at ridge regression, the regularization parameters are no longer abstract numbers, but can be interpreted through the Bayesian paradigm as derived from prior beliefs. In this post, I’ll show you the formal similarity between a generalized ridge estimator and the Bayesian equivalent.

## A (very brief) primer on Bayesian Stats

To understand the Bayesian regression estimator a minimal amount of knowledge about Bayesian statistics is necessary, so here’s what you need to know (if you don’t already): In Bayesian statistics we think about model parameters (i.e. regression coefficients) probabilistically. In other words, the data given to us is fixed, and the parameters are considered random. This runs counter to the standard, frequentist perspective in which the underlying model parameters are treated as fixed, while the data are considered random realizations of the stochastic process driven by those fixed model parameters. The end goal of Bayesian analysis is to find the posterior distribution, which you may remember from Bayes Rule:

While is our likelihood and is a normalizing constant, is our prior which does not depend on the data, . In classical statistics, is set to 1 (an improper reference prior) so that when the posterior ‘probability’ is maximized, really just the likelihood is maximized, because its the only part that still depends on . However, in Bayesian statistics we use an actual probability distribution in place of , a Normal distribution for example. So lets consider the case of a regression problem and we’ll assume that our target, , **and** our prior follow normal distributions. This leads us to **conjugate** Bayesian analysis, in which we can neatly write down an equation for the posterior distribution. In many cases this is actually not possible and for this reason Markov Chain Monte Carlo methods were invented to sample from the posterior – taking a frequentist approach, ironically.

We’ll make the usual assumption about the data: is i.i.d. for all observations . This gives us our standard likelihood for the Normal distribution. Now we can specify the prior for the parameter we’re trying to estimate, . If we choose a Normal prior (conditional on the variance, ) for the vector or weights in , i.e. and an inverse-Gamma prior over the variance parameter it can be shown that the posterior distribution for is Normally distributed with mean

If you’re interested in a proof of this result check out Jackman (2009, p.526).

Let’s look at it piece by piece:

– is our standard OLS estimator,

– is the mean vector of (multivariate normal) prior distribution, so it lets us specify what we think the average values of each of our model parameters are

– is the covariance matrix and contains our respective uncertainties about the model parameters. The inverse of the variance is called the **precision**

What we can see from the equation is that the mean of our posterior is a **precision weighted average** of our prior mean (information not based on data) and the OLS estimator (based solely on the data). The second term in parentheses indicates that we are taking the uncertainty weighted prior mean, , and adding it to the weighted OLS estimator, . Imagine for a moment that . Then

This would mean that we are **infinitely** uncertain about our prior beliefs that the mean vector of our prior distribution would vanish, contributing nothing to our posterior! LIkewise, if our uncertainty decreases (and the precision thus increases) the prior mean, , would contribute more to the posterior mean.

After this short primer on Bayesian statistics, we can now formally compare the Ridge estimator with the above Bayesian estimator. But first, we need to take a look at a more general version of the Ridge estimator.

## Generalizing the Ridge estimator

A standard tool used in many regression problems, the standard Ridge estimator is derived by solving a least squares problem from the following loss function:

While minimizing this gives us the standard Ridge estimator you have probably seen in textbooks on the subject, there’s a slightly more general version of this loss function:

Let’s derive the estimator by first re-writing the loss function in terms of matrices:

Differentiating with respect to the parameter vector, we end up with this expression for the gradient:

So, Minimizing over we get this expression for the generalized ridge estimator:

The standard Ridge estimator can be recovered by setting . Usually we regard as an abstract parameter that regulates the penalty size and as a vector of values (one for each predictor) that increases the loss the further these coefficients deviate from these values. When the coefficients are pulled towards zero.

Let’s take a look how the estimator behaves when the parameters, and change. We’ll define a meaningful ‘prior’ for our example and then vary the penalty parameter. As an example, we’ll use the `diamonds` dataset from the `ggplot2` package and model the price as a linear function of the number of carats, in each diamond, the depth, table, x, y and z attributes

![penalty_plots](/Users/thomasalcock/Projekte/blog/bayesian_regularization/penalty_plots.tiff)

As we can see from the plot both with and without a prior the coefficient estimates change rapidly for the first few increases in the penalty size. We also see that the ‘shrinkage’ effect holds from the upper plot: as the penalty increases the coefficients tend towards zero, some faster than others. The plot on the right shows how the coefficients change when we set a sensible ‘prior’. The coefficients still change, but they now tend towards the ‘prior’ we specified. This is because penalizes deviations from our , which means that larger values for the penalty pull the coefficients towards . You might be asking yourself, how this compares to the Bayesian estimator. Let’s find out!

## Comparing the Ridge and Bayesian Estimator

Now that we’ve seen both the Ridge and the Bayesian estimators, it’s time to compare them. We discovered, that the Bayesian estimator contains the OLS estimator. Since we know its form, let’s substitute it and see what happens:

– corresponds to , the matrix of precisions. In other words, since is the identity matrix, the ridge estimator assumes no covariances between the regression coefficients and a constant precision across all coefficients (recall that is a scalar)

– corresponds to , which makes sense, since the vector is the mean of our prior distribution, which essentially pulls the estimator towards it, just like ‘shrinks’ the coefficients towards its values. This ‘pull’ depends on the uncertainty captured by or in the ridge estimator.

That’s all well and good, but lets see how changing the uncertainty in the Bayesian case compares to the behaviour of the ridge estimator. Using the same data and the same model specfication as above, we’ll set the covariance matrix matrix to equal and then change lambda. Remember, smaller values of now imply greater contribution of the prior (less uncertainty) and therefore increasing them makes the prior less important.

![bayes_penalty_plots](/Users/thomasalcock/Projekte/blog/bayesian_regularization/bayes_penalty_plots.tiff)

The above plots match out understanding so far: With a prior mean of zeros the coefficients are shrunken towards zero, as in the ridge regression case when the prior dominates, i.e. when the precision is high. And when a prior mean is set the coefficients tend towards it as the precision increases. So much for the coefficients, but what about the performance? Let’s have a look!

## Performance comparison

Lastly, we’ll compare the predictive performance of the two models. Although we could treat the parameters in the model as hyperparameters which we would need to tune, this would defy the purpose of using prior knowlegde. Instead, let’s choose a prior specification for both models, and then compare the performance on a hold out set (30% of the data). While we can use the simple as our predictor for the Ridge model, the Bayesian model provides us with a full posterior preditctive distribution which we can sample from to get model predictions. To estimate the model I used the `brms`package.

| | RMSE | MAE | MAPE |

| :——————– | ——: | ——: | —-: |

| Bayesian Linear Model | 1625.38 | 1091.36 | 44.15 |

| Ridge Estimator | 1756.01 | 1173.50 | 43.44 |

Overall both models perform similarly, although some error metrics slightly favor one model over the other. Judging by these errors, we could certainly improve our models by specifying a more approriate probability distribution for our target variable. After all, prices can not be negative yet our models can and do produce negative predictions.

## Recap

In this post, I’ve shown you how the ridge estimator compares to the Bayesian conjugate linear model. Now you understand the connection between the two models and how a Bayesian approach can provide a more readily interpretable way of regularizing your model. Normally would be considered a penalty size, but now it can be interpreted as a measure of prior uncertainty. Similarly, the parameter vector can be seen as a vector of prior means for our model parameters in the extended ridge model. As far as the Bayesian approach goes, we also can use prior distributions to implement expert knowledge in your estimation process. This regularizes your model and allows for incorporation of external information in your model. If you are interested in the code, check it out at our [GitHub page](https://github.com/STATWORX/blog/tree/master/bayesian_regularization)!

## References

– Jackman S. 2009. Bayesian Analysis for the Social Sciences. West Sussex: Wiley.

At STATWORX we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms that allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post.

As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. In my first blog post, I gave an introduction to the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science. In this second blog post, I will introduce the so-called Causal Forest, one of the most popular Causal Machine Learning algorithms to estimate heterogeneous treatment effects.

## Why Heterogeneous Treatment Effects?

In Causal Forests, the goal is to estimate heterogeneity in treatment effects. As explained in my previous blog post, a treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. For example the causal effect of a subsidized training program on earnings. As individual treatment effects are unobservable, the practice focuses on estimating unbiased and consistent averages of the individual treatment effect. The most common parameter thereof is the average treatment effect, which is the mean of all individual treatment effects in the entire population of interest. However, sometimes treatment effects may vary widely between different subgroups in the population, bet it larger or smaller than the average treatment effect. In some cases, it might therefore be more interesting to estimate these different, i.e. heterogeneous treatment effects.

In most applications it is also interesting to look beyond the average effects in order to understand how the causal effects vary with observable characteristics. (Knaus, Lechner & Strittmatter, 2018)

The estimation of heterogeneous treatment effects can assist in answering questions like: For whom are there big or small treatment effects? For which subgroup does a treatment generate beneficial or adverse effects? In the field of marketing, for example, the estimation of heterogeneous treatment effects can help to optimize resource allocation by answering the question of which customers respond the most to a certain marketing campaign or for which customers is the causal effect of intervention strategies on their churn behavior the highest. Or when it comes to pricing, it might be interesting to quantify how a change in price has a varying impact on sales among different age or income groups.

## Where Old Estimation Methods Fail

Estimating heterogeneous treatment effects is nothing new. Econometrics and other social sciences have long been studying which variables predict a smaller or larger than average treatment effect, which in statistical terms is also known as Moderation. One of the most traditional ways to find heterogeneous treatment effects is to use a Multiple Linear Regression with interaction terms between the variables of interest (i.e. the ones which might lead to treatment heterogeneity) and the treatment indicator. In this blog post, I will always assume that the data is from a randomized experiment, such that the assumptions to identify treatment effects are valid without further complications. We then conclude that the treatment effect depends on the variables whose interaction term is statistically significant. For example, if we have only one variable, the regression model would look as follows:

where is the treatment indicator and is the variable of interest. In that case, if is significant, we know that the treatment effect depends on variable . The treatment effect for each observation can then be calculated as

which is dependent on the value of and therefore heterogeneous among the different observations.

So why is there a need for more advanced methods to estimate heterogeneous treatment effects? The example above was very simple, it only included one variable. However, usually, we have more than one variable which might influence the treatment effect. To see which variables predict heterogeneous treatment effects, we have to include many interaction terms, not only between each variable and the treatment indicator but also for all possible combinations of variables with and without the treatment indicator. If we have variables and one treatment, this gives a total number of parameters of:

So, for example, if we had 5 variables, we would have to include a total number of 64 parameters into our Linear Regression Model. This approach suffers from a lack of statistical power and could also cause computational issues. The use of a Multiple Linear Regression also imposes linear relationships unless more interactions with polynomials are included. Because Machine Learning algorithms can handle enormous numbers of variables and combine them in nonlinear and highly interactive ways, researchers have found ways to better estimate heterogeneous treatment effects by combining the field of Machine Learning with the study of Causal Inference.

## Generalized Random Forests

Over recent years, different Machine Learning algorithms have been developed to estimate heterogeneous treatment effects. Most of them are based on the idea of Decision Trees or Random Forests, just like the one I focus on in this blog post: Generalised Random Forests by Athey, Tibshirani, and Wager (2018).

Generalized Random Forests follow the idea of Random Forests and apart from heterogeneous treatment effect estimation, this algorithm can also be used for non-parametric quantile regression and instrumental variable regression. It keeps the main structure of Random Forests such as the recursive partitioning, subsampling, and random split selection. However, instead of averaging over the trees Generalised Random Forests estimate a weighting function and use the resulting weights to solve a local GMM model. To estimate heterogeneous treatment effects, this algorithm has two important additional features, which distinguish it from standard Random Forests.

### 1. Splitting Criterion

The first important difference to Random Forests is the splitting criterion. In Random Forests, where we want to predict an outcome variable , the split at each tree node is performed by minimizing the mean squared error of the outcome variable . In other words, the variable and value to split at each tree node are chosen such that the greatest reduction in the mean squared error with regard to the outcomes is achieved. After each tree partition has been completed, the tree’s prediction for a new observation is obtained by letting it trickle through all the way from tree’s root into a terminal node, and then taking the average of outcomes of all the observations that fell into the same node during training. The Random Forest prediction is then calculated as the average of the predicted tree values.

In Causal Forests, we want to estimate treatment effects. As stated by the Fundamental Problem of Causal Inference, however, we can never observe a treatment effect on an individual level. Therefore, the prediction of a treatment effect is given by the difference in the average outcomes between the treated and the untreated observations in a terminal node. Without going into too much detail, to find the most heterogeneous but also accurate treatment effects, the splitting criterion is adapted such that it searches for a partitioning where the treatment effects differ the most including a correction that accounts for how the splits affect the variance of the parameter estimates.

### 2. Honesty

Random Forests are usually evaluated by applying them to a test set and measuring the accuracy of the predictions of Y using an error measure such as the mean squared error. Because we can never observe treatment effects, this form of performance measure is not possible in Causal Forests. When estimating causal effects, one, therefore, evaluates their accuracy by examining the bias, standard error and the related confidence interval of the estimates. To ensure that an estimate is as accurate as possible, the bias should asymptotically disappear and the standard error and, thus, the confidence interval, should be as small as possible. To enable this statistical inference in their Generalised Random Forest, Athey, Tibshirani and Wager introduce so-called *honest trees*.

In order to make a tree honest, the training data is split into two subsamples: a splitting subsample and an estimating subsample. The splitting subsample is used to perform the splits and thus grow the tree. The estimating subsample is then used to make the predictions. That is, all observations in the estimating subsample are dropped down the previously-grown tree until it falls into a terminal node. The prediction of the treatment effects is then given by the difference in the average outcomes between the treated and the untreated observations of the estimating subsample in the terminal nodes. With such honest trees, the estimates of a Causal Forest are consistent (i.e. the bias vanishes asymptotically) and asymptotically Gaussian which together with the estimator for the asymptotic variance allow valid confidence intervals.

## Causal Forest in Action

To show the advantages of Causal Forests compared to old estimation methods, in the following I will compare the Generalised Random Forest to a Regression with interaction terms in a small simulation study. I use simulated data to be able to compare the estimated treatment effects with the actual treatment effects, which, as we know, would not be observable in real data. To compare the two algorithms with respect to the estimation of heterogeneous treatment effects, I test them on two different data sets, one with and one wihtout heterogeneity in the treatment effect:

Data Set | Heterogeneity | Heterogeneity Variables | Variables | Observations |
---|---|---|---|---|

1 | No Heterogeneity | – | 20000 | |

2 | Heterogeneity | and | – | 20000 |

This means that in the first data set, all observations have the same treatment effect. In this case, the average treatment effect and the heterogeneous treatment effects are the same. In the second data set, the treatment effect varies with the variables and . Without going into too much detail here (I will probably write a separate blog post only about causal data generating processes), the relationship between those heterogeneity variables ( and ) and the treatment effect is not linear. Both simulated data sets have 20’000 observations containing an outcome variable and 10 covariates with values between zero and one.

To evaluate the two algorithms, the data sets are split in a train (75%) and a test set (25%). For the Causal Forest, I use the `causal_forest()`

from the grf-package with `tune.parameters = "all"`

. I compare this to an `lm()`

model, which includes all variables, the treatment indicator and the necessary interaction terms of the heterogeneity variables and the treatment indicator:

**Linear Regression Model for data set with heterogeneity:**

**Linear Regression Model for data set with no heterogeneity:**

where – are the heterogeneity variables and is the treatment indicator (i.e. if treated and if not treated). As already explained above, we usually do not know which variables affect the treatment effect and have therefore to include all possible interaction terms into the Linear Regression Model to see which variables lead to treatment effect heterogeneity. In the case of 10 variables, as we have it here, this means we would have to include a total of 2048 parameters in our Linear Regression Model. However, since the heterogeneity variables are known in the simulated data, here, I only include the interaction terms for those variables.

Data Set | Metric | grf | lm |
---|---|---|---|

No Heterogeneity | RMSE | 0.01 | 0.00 |

Heterogeneity | RMSE | 0.08 | 0.45 |

Looking at the results, we can see that without heterogeneity, the treatment effect is equally well predicted by the Causal Forest (RMSE of 0.01) and the Linear Regression (RMSE of 0.00). However, as the heterogeneity level increases, the Causal Forest is far more accurate (RMSE of 0.08) than the Linear Regression (RMSE of 0.45). As expected, the Causal Forest seems to be better at detecting the underlying non-linear relationship between the heterogeneity variables and the treatment effect than the Linear Regression Model, which can also be seen in the plots below. Thus, even if we already know which variables influence the treatment effect and only need to include the necessary interaction terms, the Linear Regression Model is still less accurate than the Causal Forest due to its lack of modeling flexibility.

## Outlook

I hope that this blog post has helped you to understand what Causal Forests are and what advantages they bring in estimating heterogeneous treatment effects compared to old estimation methods. In my upcoming blog posts on Causal Machine Learning, I will explore this new field of data science further. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or introduce different data generating processes to evaluate Causal Machine Learning methods in simulation studies.

## References

- Athey, S., Tibshirani, J., & Wager, S. (2019). Generalised random forests.
*The Annals of Statistics*,*47(2)*, 1148-1178. - Knaus, M. C., Lechner, M., & Strittmatter, A. (2018). Machine learning estimation of heterogeneous causal effects: Empirical monte carlo evidence.
*arXiv:1810.13237v2*.

At STATWORX, we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms that allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post.

As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. This first blog post is an introduction to the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science.

## The Origins of Causal Machine Learning

As Markus has already explained in his earlier blog post, analysis in economic and other social sciences revolves primarily around the estimation of causal effects, that is, the isolated effect of a feature on the outcome variable . Actually, in most cases, the interest lies in so-called treatment effects. A treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. In economics, one of the most analyzed treatment effects is the causal effect of a subsidized training program on earnings.

Following the potential outcome framework introduced by Rubin (1947), the treatment effect of an individual is defined as follows:

where indicates the potential outcome of the individual with treatment and contrary, denotes the potential outcome of the individual without treatment. However, as an individual can either receive the treatment or not, and thus, we can only ever observe one of the two potential outcomes for an individual at one point in time, the individual treatment effect is unobservable. This problem is also known as the Fundamental Problem of Causal Inference. Nevertheless, under certain assumptions, the averages of the individual treatment effect may be identified. In randomized experiments, where the treatment is randomly assigned, these assumptions are naturally valid, and the identification of any aggregation level of individual treatment effects is possible without further complications. In many situations, however, randomized experiments are not possible, and the researcher has to work with observational data, where these assumptions are usually not valid. Thus, extensive literature in economics and other fields has focused on techniques identifying causal effects in cases where these assumptions are not given.

Prediction and causal inference are distinct (though closely related) problems. Athey, 2017, p. 484

In contrast, (Supervised) Machine Learning literature has traditionally focused on prediction, that is, producing predictions of the outcome variable from the feature(s) . Machine Learning models are designed to discover complex structures in given data and generalize them so that they can be used to make accurate predictions on new data. These algorithms can handle enormous numbers of predictors and combine them in nonlinear and highly interactive ways. They have been proven to be hugely successful in practice and are used in applications ranging from medicine to resource allocations in cities.

## Bringing Together the Best of Both Worlds

Although economists and other social scientists prioritize precise estimates of causal effects above predictive power, they were intrigued by the advantages of Machine Learning methods, such as the precise out-of-sample prediction power or the ability to deal with large numbers of features. But as we have seen, classical Machine Learning models are not designed to estimate causal effects. Using off-the-shelf prediction methods from Machine Learning leads to biased estimates of causal effects. The existing Machine Learning techniques had to be modified to use the advantages of Machine Learning for consistently and efficiently estimating causal effects– the birth of Causal Machine Learning!

Currently, Causal Machine Learning can be broadly divided into two lines of research, defined by the type of causal effect to be estimated. One line of Causal Machine Learning research focuses on modifying Machine Learning methods to estimate unbiased and consistent average treatment effects. The average treatment effect is the mean of all individual treatment effects in an entire population of interest, and probably the most common parameter analyzed in econometric causal studies. Models from this line of research try to answer questions like: How will customers react on average to a marketing campaign? What is the average effect of a price change on sales? The other line of Causal Machine Learning research focuses on modifying Machine Learning methods to uncover treatment effect heterogeneity. That is, identifying subpopulations (based on features) of individuals who have a larger or smaller than average treatment effect. These models are designed to answer questions such as: Which customers respond the most to a marketing campaign? How does the effect of a price change on sales change with the age of customers?

## Decision-Making Questions Need Causal Answers

Although the study of Causal Machine Learning has been mainly driven by economic researchers, its importance for other areas such as business should not be neglected. Companies often reach for classical Machine Learning tools to solve decision-making problems, such as where to set the price or which customers to target with a marketing campaign. However, there is a significant gap between making a prediction and making a decision. To make a data-driven decision, the understanding of causal relationships is key. Let me illustrate this problem with two examples from our daily business.

**Example 1: Price Elasticities**

At the core of every company’s pricing management is the understanding of how customers will respond to a change in price. To set an optimal price, the company needs to know how much it will sell at different (hypothetical) price levels. The most practicable and meaningful metric answering this question is the price elasticity of demand. Although it might seem straightforward to estimate the price elasticity of demand using classical Machine Learning methods to predict sales as the outcome with the price level as a feature, in practice, this approach does not simply give us the causal effect of price on sales.

There are a number of gaps between making a prediction and making a decision, and underlying assumptions need to be understood in order to optimise data-driven decision making. Athey, 2017, p. 483

Following a similar example introduced by Athey (2017), assume we have historical data on airline prices and the respective occupancy rates. Typically, prices and occupancy rates are positively correlated as usual pricing policies of airlines specify that airlines raise their seat prices when their occupancy rate increases. In this case, a classical Machine Learning model will answer the following question: If on a particular day, airline ticket prices are high, what is the best prediction for the occupancy rate on that day? The model will correctly predict that the occupancy rate is likely to be high. However, it would be wrong to infer from this that an increase in prices leads to a higher occupancy rate. From common experience, we know that the true casual effect is quite the opposite – if an airline systematically raised its ticket prices by 10% everywhere, it would be unlikely to sell more tickets.

**Example 2: Customer Churn**

Another common problem, which companies like to solve with the help of Machine Learning, is the prediction of customer churn (i.e., customers abandoning the firm or service thereof). The companies are interested in identifying the customers with the highest risk of churn so that they can respond by allocating interventions in the hope of preventing these customers from leaving.

Classical Machine Learning algorithms have proven to be very good at predicting customer churn. Unfortunately, these results cannot sufficiently address the company’s resource allocation problem of which customers to best target with intervention strategies. The question of the optimal allocation of resources to customers is of causal nature: For which customers are the causal effect of intervention strategies on their churn behavior the highest? A study has shown that in many cases, the overlap between customers with the highest risk of churning and customers who would respond most to interventions was much lower than 100%. Thus, treating the problem of customer churn as a prediction problem and therefore using classical Machine Learning models is not optimal, yielding lower payoffs to companies.

## The Wish of Every Data Scientist

Looking beyond these practical examples, we can observe that there is a more profound reason why Causal Machine Learning should be of interest to any data scientist: model generalisability. A Machine Learning model that can capture causal relationships of data will be generalizable to new settings, which is still one of the biggest challenges in Machine Learning.

To illustrate this, I’ll use the example of the rooster and the sun, from “The Book of Why” by Pearl and Mackenzie (2018). A Machine Learning algorithm that is shown data about a rooster and the sun would associate the rising of the sun with the crow of the rooster and may be able to predict when the sun will rise accurately: If the rooster has just crowed, the sun will rise shortly after that. Such a model that is only capable of predicting correlations will not generalize to a situation where there is no rooster. In that case, a Machine Learning model will never predict that the sun will rise because it has never observed such a data point (i.e., without a rooster). If, however, the model captured the true causal relationship, that is, the sun being about to rise causes the rooster to crow, it would be perfectly able to predict that the sun will rise even if there is no rooster.

## No True Artificial Intelligence Without Causal Reasoning

Pearl and Mackenzie (2018) go even further, arguing that we can never reach true human-level Artificial Intelligence without teaching machines causal reasoning since cause and effect are the key mechanisms through which we humans process and understand the complex world around us. The ability to predict correlations does not make machines intelligent; it merely allows them to model a reality based on data the algorithm is provided.

The algorithmisation of counterfactuals invites thinking machines to benefit from the ability to reflect on one’s past actions and to participate in this (until now) uniquely human way of thinking about the world. Pearl & Mackenzie, 2018, p. 10

Furthermore, Machine Learning models need the capacity to detect causal effects to ask counterfactual questions, that is, to inquire how some relationship would change given some kind of intervention. As counterfactuals are the building blocks of moral behavior and scientific thought, machines will only be able to communicate more effectively with us humans and reach the status of moral beings with free will if they learn causal and thus counterfactual reasoning.

## Outlook

Although this last part has become very philosophical in the end, I hope that this blog post has helped you to understand what Causal Machine Learning is and why it is necessary not only in practice but also for the future of data science in general. In my upcoming blog posts, I will discuss various aspects of this topic in more detail. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or compare different Causal Machine Learning algorithms in a simulation study.

## References

- Athey, S. (2017). Beyond prediction: using big data for policy problems.
*Science 335*, 483-485. - Pearl, J., & Mackenzie, D. (2018).
*The book of why*. New York, NY: Basic Books. - Rubin, D. B. (1974). Estimating causal effects of treatments in randomised and non-randomised studies.
*Journal of Educational Psychology, 66(5)*, 688-701.

Cross-validation is a widely used technique to assess the generalization performance of a machine learning model. Here at STATWORX, we often discuss performance metrics and how to incorporate them efficiently in our data science workflow. In this blog post, I will introduce the basics of cross-validation, provide guidelines to tweak its parameters, and illustrate how to build it from scratch in an efficient way.

## Model evaluation and cross-validation basics

Cross-validation is a model evaluation technique. The central intuition behind model evaluation is to figure out if the trained model is generalizable, that is, whether the predictive power we observe while training is also to be expected on unseen data. We could feed it directly with the data it was developed for, i.e., meant to predict. But then again, there is no way for us to know, or *validate*, whether the predictions are accurate.

Naturally, we would want some kind of benchmark of our model’s generalization performance before launching it into production. Therefore, the idea is to split the existing training data into an actual training set and a hold-out test partition which is not used for training and serves as the “unseen” data. Since this test partition is, in fact, part of the original training data, we have a full range of “correct” outcomes to validate against. We can then use an appropriate error metric, such as the Root Mean Squared Error (RMSE) or the Mean Absolute Percentage Error (MAPE) to evaluate model performance. However, the applicable evaluation metric has to be chosen with caution as there are pitfalls (as described in this blog post by my colleague Jan).

Many machine learning algorithms allow the user to specify hyperparameters, such as the number of neighbors in k-Nearest Neighbors or the number of trees in a Random Forest. Cross-validation can also be leveraged for “tuning” the hyperparameters of a model by comparing the generalization error of different model specifications.

### Common approaches to model evaluation

There are dozens of model evaluation techniques that are always trading off between variance, bias, and computation time. It is essential to know these trade-offs when evaluating a model, since choosing the appropriate technique highly depends on the problem and the data we observe. I will cover this topic once I have introduced two of the most common model evaluation techniques: the train-test-split and k-fold cross-validation. In the former, the training data is randomly split into a train and test partition (Figure 1), commonly with a significant part of the data being retained as the training set. Proportions of 70/30 or 80/20 are the most frequently used in the literature, though the exact ratio depends on the size of your data.

The drawback of this approach is that this one-time random split can end up partitioning the data into two very imbalanced parts, thus yielding biased generalization error estimates. That is especially critical if you only have limited data, as some features or patterns could end up entirely in the test part. In such a case, the model has no chance to learn them, and you will potentially underestimate its performance.

A more robust alternative is the so-called k-fold cross-validation (Figure 2). Here, the data is shuffled and then randomly partitioned into folds. The main advantage over the train-test-split approach is that *each* of the partitions is iteratively used as a test (i.e., validation) set, with the remaining parts serving as the training sets in this iteration. This process is repeated times, such that every observation is included in both training and test sets. The appropriate error metric is then simply calculated as a mean of all of the folds, giving the cross-validation error.

This is more of an *extension* of the train-test split rather than a completely new method: That is, the train-test procedure is repeated times. However, note that even if is chosen to be as low as , i.e., you end up with only two parts. This approach is still superior to the train-test-split in that *both* parts are iteratively chosen for training so that the model has a chance to learn *all* the data rather than just a random subset of it. Therefore, this approach usually results in more robust performance estimates.

Comparing the two figures above, you can see that a train-test split with a ratio of 80/20 is equivalent to *one iteration* of a 5-fold (that is, ) cross-validation where 4/5 of the data are retained for training, and 1/5 is held out for validation. The crucial difference is that in k-fold the validation set is shifted in each of the iterations. Note that a k-fold cross-validation is more robust than merely repeating the train-test split times: In k-fold CV, the partitioning is done *once*, and then you iterate through the folds, whereas in the repeated train-test split, you re-partition the data times, potentially omitting some data from training.

### Repeated CV and LOOCV

There are many flavors of k-fold cross-validation. For instance, you can do “repeated cross-validation” as well. The idea is that, once the data is divided into folds, this partitioning is fixed for the whole procedure. This way, we’re not risking to exclude some portions by chance. In repeated CV, you repeat the process of shuffling and randomly partitioning the data into folds a certain number of times. You can then average over the resulting cross-validation errors of each run to get a global performance estimate.

Another special case of k-fold cross-validation is “Leave One Out Cross-Validation” (LOOCV), where you set . That is, in each iteration, you use a *single* observation from your data as the validation portion and the remaining observations as the training set. While this might sound like a hyper robust version of cross-validation, its usage is generally discouraged for two reasons:

- First, it’s usually
*very computationally expensive*. For most datasets used in applied machine learning, training your model times is neither desirable nor feasible (although it may be useful for very small datasets). - Second, even if you had the computational power (and time on your hands) to endure this process, another argument advanced by critics of LOOCV from a statistical point of view is that the resulting cross-validation error can exhibit high variance. The cause of that is that your “validation set” consists of only one observation, and depending on the distribution of your data (and potential outliers), this can vary substantially.

In general, note that the performance of LOOCV is a somewhat controversial topic, both in the scientific literature and the broader machine learning community. Therefore, I encourage you to read up on this debate if you consider using LOOCV for estimating the generalization performance of your model (for example, check out this and related posts on StackExchange). As is often the case, the answer might end up being “it depends”. In any case, keep in mind the computational overhead of LOOCV, which is hard to deny (unless you have a tiny dataset).

### The value of and the bias-variance trade-off

If is not (necessarily) the best choice, then how to find an appropriate value for ? It turns out that the answer to this question boils down to the notorious *bias-variance trade-off*. Why is that?

The value for governs how many folds your data is partitioned into and therefore the size of (i.e., number of observations contained in) each fold. We want to choose in a way that a sufficiently large portion of our data remains in the training set – after all, we don’t want to give too many observations away that could be used to train our model. The higher the value of , the more observations are included in our training set in each iteration.

For instance, suppose we have 1,200 observations in our dataset, then with our training set would consist of observations, but with it would include 1,050 observations. Naturally, with more observations used for training, you approximate your model’s actual performance (as if it were trained on the whole dataset), hence reducing the bias of your error estimate compared to a smaller fraction of the data. But with increasing , the size of your validation partition decreases, and your error estimate in each iteration is more sensitive to these few data points, potentially increasing its overall variance. Basically, it’s choosing between the “extremes” of the train-test-split on the one hand and LOOCV on the other. The figure below schematically (!) illustrates the bias-variance performance and computational overhead of different cross-validation methods.

As a rule of thumb, with higher values for , bias decreases and variance increases. By convention, values like or have been deemed to be a good compromise and have thus become the quasi-standard in most applied machine learning settings.

“These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.” James et al. 2013: 184

If you are not particularly concerned with the process of cross-validation itself but rather want to seamlessly integrate it into your data science workflow (which I highly recommend!), you should be fine choosing either of these values for and leave it at that.

## Implementing cross-validation in `caret`

Speaking of integrating cross-validation into your daily workflow—which possibilities are there? Luckily, cross-validation is a standard tool in popular machine learning libraries such as the `caret`

package in R. Here you can specify the method with the `trainControl`

function. Below is a script where we fit a random forest with 10-fold cross-validation to the `iris`

dataset.

```
library(caret)
set.seed(12345)
inTrain <- createDataPartition(y = iris[["Species"]], p = .7, list = FALSE)
iris.train <- iris[inTrain, ]
iris.test <- iris[- inTrain, ]
fit.control <- caret::trainControl(method = "cv", number = 10)
rf.fit <- caret::train(Species ~ .,
data = iris.train,
method = "rf",
trControl = fit.control)
```

We define our desired cross-validation method in the `trainControl`

function, store the output in the object `fit.control`

, and then pass this object to the `trControl`

argument of the `train`

function. You can specify the other methods introduced in this post in a similar fashion:

```
# Leave-One-Out Cross-validation:
fit.control <- caret::trainControl(method = "LOOCV", number = 10)
# Repeated CV (remember to specify the number of repeats!)
fit.control <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5)
```

## The old-fashioned way: Implementing k-fold cross-validation by hand

However, data science projects can quickly become so complex that the ready-made functions in machine learning packages are not suitable anymore. In such cases, you will have to implement the algorithm—including cross-validation techniques—by hand, tailored to the specific project needs. Let me walk you through a make-shift script for implementing simple k-fold cross-validation in R by hand (we will tackle the script step by step here; you can find the whole code on our GitHub).

### Simulating data, defining the error metric, and setting

```
# devtools::install_github("andrebleier/Xy")
library(tidyverse)
library(Xy)
sim <- Xy(n = 1000,
numvars = c(2,2),
catvars = 0,
cor = c(-0.5, 0.9),
noisevars = 0)
sim_data <- sim[["data"]]
RMSE <- function(f, o){
sqrt(mean((f - o)^2))
}
k <- 5
```

We start by loading the required packages and simulating some simulation data with 1,000 observations with the `Xy()`

package developed by my colleague André (check out his blog post on simulating regression data with Xy). Because we need some kind of error metric to evaluate model performance, we define our RMSE function which is pretty straightforward: The RMSE is the root of the mean of the squared error, where error is the difference between our fitted (`f`

) und observed (`o`

) values—you can pretty much read the function from left to right. Lastly, we specify our , which is set to the value of 5 in the example and is stored as a simple integer.

### Partitioning the data

```
set.seed(12345)
sim_data <- mutate(sim_data,
my.folds = sample(1:k,
size = nrow(sim_data),
replace = TRUE))
```

Next up, we partition our data into folds. For this purpose, we add a new column, `my.folds`

, to the data: We sample (with replacement) from 1 to the value of , so 1 to 5 in our case, and randomly add one of these five numbers to each row (observation) in the data. With 1,000 observations, each number should be assigned about 200 times.

### Training and validating the model

```
cv.fun <- function(this.fold, data){
train <- filter(data, my.folds != this.fold)
validate <- filter(data, my.folds == this.fold)
model <- lm(y ~ NLIN_1 + NLIN_2 + LIN_1 + LIN_2,
data = train)
pred <- predict(model, newdata = validate) %>% as.vector()
this.rmse <- RMSE(f = pred, o = validate[["y"]])
return(this.rmse)
}
```

Next, we define `cv.fun`

, which is the heart of our cross-validation procedure. This function takes two arguments: `this.fold`

and `data`

. I will come back to the meaning of `this.fold`

in a minute, let’s just set it to 1 for now. Inside the function, we divide the data into a training and validation partition by subsetting according to the values of `my.folds`

and `this.fold`

: Every observation with a randomly assigned `my.folds`

value **other than 1** (so approximately 4/5 of the data) goes into training. Every observation with a `my.folds`

value **equal to 1** (the remaining 1/5) forms the validation set. For illustration purposes, we then fit a simple linear model with the simulated outcome and four predictors. Note that we only fit this model on the `train`

data! We then use this model to `predict()`

our validation data, and since we have true observed outcomes for this *subset of the original overall training data* (this is the whole point!), we can compute our RMSE and return it.

### Iterating through the folds and computing the CV error

```
cv.error <- sapply(seq_len(k),
FUN = cv.fun,
data = sim_data) %>%
mean()
cv.error
```

Lastly, we wrap the function call to `cv.fun`

into a `sapply()`

loop—this is where all the magic happens: Here we iterate over the range of , so `seq_len(k)`

leaves us with the vector `[1] 1 2 3 4 5`

in this case. We apply each element of this vector to `cv.fun`

. In `apply()`

statements, the iteration vector is always passed as the first argument of the function which is called, so in our case, each element of this vector at a time is passed to `this.fold`

. We also pass our simulated `sim_data`

as the `data`

argument.

Let us quickly recap what this means: In the first iteration, `this.fold`

equals 1. This means that our train set consists of all the observations where `my.folds`

is not 1, and observations with a value of 1 form the validation set (just as in the example above). In the next iteration of the loop, `this.fold`

equals 2. Consequently, observations with 1, 3, 4, and 5 form the training set, and observations with a value of 2 go to validation, and so on. Iterating over all values of , this schematically provides us with the diagonal pattern seen in Figure 2 above, where each data partition at a time is used as a validation set.

To wrap it all up, we calculate the mean: This is the mean of our individual RMSE values and leaves us with our cross-validation error. And there you go: We just defined our custom cross-validation function!

This is merely a template: You can insert any model and any error metric. If you’ve been following along so far, feel free to try implementing repeated CV yourself or play around with different values for .

## Conclusion

As you can see, implementing cross-validation yourself isn’t all that hard. It gives you great flexibility to account for project-specific needs, such as custom error metrics. If you don’t need that much flexibility, enabling cross-validation in popular machine learning packages is a breeze.

I hope that I could provide you with a sufficient overview of cross-validation and how to implement it both in pre-defined functions as well as by hand. If you have questions, comments, or ideas, feel free to drop me an e-mail.

## References

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.

A major problem arises when comparing forecasting methods and models across different time series. This is a challenge we regularly face at STATWORX. Unit-dependent measures like the MAE (Mean Absolute Error) and the RMSE (Root Mean Squared Error) turn out to be unsuitable and hardly helpful if the time series is measured in different units. However, if this is not the case, both measures provide valuable information. The MAE is perfectly interpretable as it embodies the average absolute deviation from the actual values. The RMSE, on the other hand, is not that easy to interpret, more vulnerable to extreme values but still often used in practice.

One of the most commonly used measures that avoid this problem is called MAPE (Mean Absolute Percentage Error). It solves the problem of the mentioned approaches as it does not depend on the unit of the time series. Furthermore, decision-makers without a statistical background can easily interpret and understand this measure. Despite its popularity, the MAPE was and is still criticized.

In this article, I evaluate these critical arguments and prove that at least some of them are highly questionable. The second part of my article concentrates on true weaknesses of the MAPE, some of them well-known but others hiding in the shadows. In the third section, I discuss various alternatives and summarize under which circumstances the use of the MAPE seems to be appropriate (and when it’s not).

## What the MAPE is FALSELY blamed for!

### It Puts Heavier Penalties on Negative Errors Than on Positive Errors

Most sources dealing with the MAPE point out this “major” issue of the measure. The statement is primarily based on two different arguments. First, they claim that interchanging the actual value with the forecasted value proves their point *(Makridakis 1993)*.

Case 1: = 150 & = 100 (positive error)

Case 2: = 100 & = 150 (negative error)

It is true that Case 1 (positive error of 50) is related to a lower APE (Absolute Percentage Error) than Case 2 (negative error of 50). However, the reason here is not that the error is positive or negative but simply that the actual value changes. If the actual value stays constant, the APE is equal for both types of errors *(Goodwin & Lawton 1999)*. That is clarified by the following example.

Case 3: = 100 & = 50

Case 4: = 100 & = 150

The second, equally invalid argument supporting the asymmetry of the MAPE arises from the assumption about the predicted data. As the MAPE is mainly suited to be used to evaluate predictions on a ratio scale, the MAPE is bounded on the lower side by an error of 100% *(Armstrong & Collopy 1992)*. However, this does not imply that the MAPE overweights or underweights some types of errors, but that these errors are not possible.

## Its TRUE weaknesses!

### It Fails if Some of the Actual Values Are Equal to Zero

This statement is a well-known problem of the focal measure. However, that and the latter argument were the reason for the development of a modified form of the MAPE, the SMAPE (“Symmetric” Mean Absolute Percentage). Ironically, in contrast to the original MAPE, this modified form suffers from true asymmetry *(Goodwin & Lawton 1999)*. I will clarify this argument in the last section of the article.

### Particularly Small Actual Values Bias the Mape

If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE *(Hyndman & Koehler 2006)*. The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.

### The Mape Implies Only Which Forecast Is Proportionally Better

As mentioned at the beginning of this article, one advantage of using the MAPE for comparison between forecasts of different time series is its unit independence. However, it is essential to keep in mind that the MAPE only implies which forecast is *proportionally better*. The following graph shows three different time series and their corresponding forecasts. The only difference between them is their general level. The same absolute errors lead, therefore, to profoundly different MAPEs. This article critically questions, if it is reasonable to use such a percentage-based measure for the comparison between forecasts for different time series. If the different time series aren’t behaving in a somehow comparable level (as shown in the following graphic), using the MAPE to infer if a forecast is *generally better* for one time series than for another relies on the assumption that the same absolute errors are less problematic for time series on higher levels than for time series on lower levels:

“If a time series fluctuates around 100, then predicting 101 is way better than predicting 2 for a time series fluctuating around 1.”

That might be true in some cases. However, in general, this a questionable or at least an assumption people should always be aware of when using the MAPE to compare forecasts between different time series.

### Summary

In summary, the discussed findings show that the MAPE should be used with caution as an instrument for comparing forecasts across different time series. A necessary condition is that the time series only contains strictly positive values. Second only some extremely small values have the potential to bias the MAPE heavily. Last, the MAPE depends systematically on the level of the time series as it is a percentage-based error. This article critically questions if it is meaningful to generalize from being a *proportionally better* forecast to being a *generally better* forecast.

## BETTER alternatives!

The discussed implies that the MAPE alone is often not very useful when the objective is to compare accuracy between different forecasts for different time series. Although relying only on one easily understandable measure appears to be comfortable, it comes with a high risk of drawing misleading conclusions. In general, it is always recommended to use different measures combined. In addition to numerical measures, a visualization of the time series, including the actual and the forecasted values always provides valuable information. However, if one single numeric measure is the only option, there are some excellent alternatives.

### Scaled Measures

Scaled measures compare the measure of a forecast, for example, the MAE relative to the MAE of a benchmark method. Similar measures can be defined using RMSE, MAPE, or other measures. Common benchmark methods are the “random walk”, the “naïve” method and the “mean” method. These measures are easy to interpret as they show how the focal model compares to the benchmark methods. However, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method.

### Scaled Errors

Scaled error approaches also try to remove the scale of the data by comparing the forecasted values to those obtained by some benchmark forecast method, like the naïve method. The MASE (Mean Absolute Scaled Error), proposed by *Hydnmann & Koehler 2006*, is defined slightly differently depending on the seasonality of the time series. In the simple case of a non-seasonal time series, the error of the focal forecast is scaled based on the in-sample MAE from the naïve forecast method. One major advantage is that it can handle actual values of zero and that it is not biased by very extreme values. Once again, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method.

*Non-Seasonal*

*Seasonal*

### SDMAE

In my understanding, the basic idea of using the MAPE to compare different time series between forecasts is that the same absolute error is assumed to be less problematic for time series on higher levels than for time series on lower levels. Based on the examples shown earlier, I think that this idea is at least questionable.

I argue that how good or bad a specific absolute error is evaluated should not depend on the general level of the time series but on its variation. Accordingly, the following measure the SDMAE (Standard Deviation adjusted Mean Absolute Error) is a product of the discussed issues and my imagination. It can be used for evaluating forecasts for times series containing negative values and does not suffer from actual values being equal to zero nor particularly small. Note that this measure is not defined for time series that do not fluctuate at all. Furthermore, there might be other limitations of this measure, that I am currently not aware of.

### Summary

I suggest using a combination of different measures to get a comprehensive understanding of the performance of the different forecasts. I also suggest complementing the MAPE with a visualization of the time series, including the actual and the forecasted values, the MAE, and a Scaled Measure or Scaled Error approach. The SDMAE should be seen as an alternative approach that was not discussed by a broader audience so far. I am thankful for your critical thoughts and comments on this idea.

## Worse alternatives!

### SMAPE

The SMAPE was created, to solve and respond to the problems of the MAPE. However, this did neither solve the problem of extremely small actual values nor the level dependency of the MAPE. The reason is that extremely small actual values are typically related to extremely small predictions *(Hyndman & Koehler 2006)*. Additionally, and in contrast to the unmodified MAPE, the SMAPE raises the problem of asymmetry *(Goodwin & Lawton 1999)*. This is clarified through the following graphic, whereas the ” APE” relates to the MAPE and the “SAPE” relates to the SMAPE. It shows that the SAPE is higher for positive errors than for negative errors and therefore, asymmetric. The SMAPE is not recommended to be used by several scientists (Hyndman & Koehler 2006).

*On the asymmetry of the symmetric MAPE* *(Goodwin & Lawton 1999)*

## References

- Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE.
*International journal of forecasting*,*15*(4), 405-408. - Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy.
*International journal of forecasting*,*22*(4), 679-688. - Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns.
*International Journal of Forecasting*,*9*(4), 527-529. - Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing about forecasting methods: Empirical comparisons.
*International journal of forecasting*,*8*(1), 69-80.

On my daily commute to , I pass the gas station down the street twice a day. Even though I’m riding a bike, I’m still looking at the gas prices. Lately, after a few weeks of observation, I thought to have found some patterns. My statistic sensors were tickling!

Of course, I am not the first one to notice patterns. Nearly everyone has their own theory about when to buy cheap gas! Some of the more famous examples are:

- Everything was better in the old days
- It’s cheaper in the evening
- Gas stations at motorways are more expensive
- No-name stations have lower prices than top brands
- The difference between diesel and gas (E10) is shrinking
- It’s cheapest on Mondays

But, are these facts or just rumors? To check all those myths, I gathered some data, explored relations, and finally plotted my results!

FYI: The code I used will be presented in a future blog when I also will take a deeper look at my coding and optimization progress – stay tuned!

## The king of gas price data

Since December 2013 all gas station in Germany must report their prices to the *Markttransparenzstelle für Kraftstoffe*, which is part of the *Bundeskartellamt*. While searching for this data, I stumbled upon the site . They had gathered the price data for nearly all gas stations since 2014. First, they only provided a data bank dump with all historical data, but since October 2018, they also offer a with daily csv files.

The data contain all price changes of around 15.000 gas stations, looking as follows:

```
date station_uuid diesel e5 e10
1: 2019-01-01 00:01:06 2ae0e4dc 1.319 1.399 1.379
2: 2019-01-01 00:01:06 08a67e2e 1.339 1.409 1.389
3: 2019-01-01 00:01:06 939f06e5 1.299 1.379 1.359
4: 2019-01-01 00:01:06 d299da0e 1.279 1.389 1.369
5: 2019-01-01 00:01:06 a1e15688 1.228 1.328 1.308
```

In addition to the price information, tankerkoenig.de also provides additional information like the brand, the geo-coordinates and the city of each gas station. Since I was also interested in the fact whether a station is near a motorway or not, I included (the German automobile club). Of course, this did not seem to be complete, but good enough for my purpose. With the result, I was able to plot the stations, and the motorways emerged!

If you want to know more about how to make plots with geo-coordinates and `ggplot2`

, check out !

## Only believe the facts!

To confirm or debunk the myths I chose, I went ahead and filtered, aggregated, and merged the raw data. Be prepared for some nice plots and insights!

### 1. Everything was better in the old days

Nothing shows a development better than a good old timeline! As you can see in the following graphic, I plotted the mean price per day. To get a little more insight about the price ranges, I included the 10% and 90% quantiles of prices for each day.

Of course, this was just a glimpse, and there were times (many years back) when the price was below 1DM (~0.5€). But for the last five years, the price was kind of within the same range. There was a low point early in 2016, but the latest peak was still a bit lower than in mid-2014.

**Conclusion**: For the “short-term” old days the myth does not hold up – for the “long-run” … probably yes!

### 2. It’s cheaper in the evening

To check this myth, I aggregated the data over days and years. I noticed that there were different patterns over the years. From 2014 till 2017, there seemed to be a pattern where the price was high in the morning, slowly decreased over the day, with a small peak at noon, and rose up again in the evening. Since 2018, a new pattern with three peaks in a wave pattern emerged: The highest one in the morning around 7:00, one at noon and the last one around 17:00.

Since the price level varied a bit over the years, I also looked at the scaled prices. Here, the difference between the patterns was even more visible.

**Conclusion**: From 2015 to 2017, the myth seemed to be right. However, lately there were multiple “good times” across a day. Mostly right before the peak in the afternoon or the evening. As the price did not rise that much after its evening peak, the myth is still correct. The lowest point at the moment seemed to be around 21:00.

**Update**

While writing this post I found , that shows, that there is a new pattern with even more peaks!

### 3. Gas stations at motorways are more expensive

In the plot above I already showed where the motorway stations are – the blue dots. In the next plot, the blue lines represent the prices at those motorway stations. One can clearly spot the difference! Even though there are some stations within the 10% and 90% quartile range where the prices overlapped.

**Conclusion**: Obviously right!

### 4. No-name stations have lower prices than top brands

To get the top brands, I took the nine brands with the most stations in Germany and labelled the rest “other”. After that, I calculated the difference between mean price and mean price per brand. Therefore, lines above zero indicated that a brand is more expensive than the mean and vice versa.. Of the top brands, ARAL and SHELL were always above the mean, some fluctuated around the mean and some were always below the mean. The no-name stations were cheaper than the mean over the whole timespan.

**Conclusion**: The result depends on how a no-name station is defined. Since the top five brands are mostly more expensive than the mean, I think the myth is somewhat right. At least, it is safe to say that there are differences between the brands.

### 5. The difference between diesel and gas (E10) is shrinking

To compare the differences, I took the E10 price as the baseline and plotted the other two types against it. Between E5 and E10 there was nearly no change over the last five years. However, for diesel, it was a story. Since 2014, the difference took a wild ride from 10% cheaper than E10 to 20% cheaper at the beginning of 2016 back to only 5% now.

**Conclusion**: Yes, the difference is shrinking at the moment compared to the last five years.

### 6. It’s cheapest on Mondays

To check the last myth, I calculated the mean price per hour, type and weekday. The weekday with the lowest price at a given time can be seen in the following plot. All gas types revealed a similar pattern. Between the years, there was a subtle change mostly between Wednesday and Thursday.

**Conclusion**: Since Monday barely ever was the cheapest day – this myth got debunked!

## Conclusion

After all of that data hacking and pattern finding, these are my results:

- Everything was better in the old days –
**FALSE**ish - It’s cheaper in the evening –
**TRUE** - Gas stations at motorways are more expensive –
**TRUE** - No-name stations have lower prices than top brands –
**TRUE**ish - The difference between diesel and gas (e10) is shrinking –
**TRUE** - It’s cheapeast on Mondays –
**FALSE**

Of course, this analysis is far from being complete and even had some shortcuts in it, as well. Still, I had a lot of fun analysing this big data set regardless of all the internal fights with my RAM. I will present some of these fights and their outcome in my next blog post. In that post, you will also get more information about my code optimisation process.

In another upcoming blog post, my colleague Matthias will use this data to predict the future price. So, stay tuned!

## References

- The data is used under the Creative-Commons-Lizenz (CC BY 4.0). For more information on the data check out the

## Introduction

At STATWORX, we are not only helping customers find, develop, and implement a suitable data strategy but we also spend some time doing research to improve our own tool stack. This way, we can give back to the open-source community.

One special focus of my research lies in tree-based ensemble methods. These algorithms are bread and butter tools in predictive learning and you can find them as a standalone model or as an ingredient to an ensemble in almost every supervised learning Kaggle challenge. Renowned models are Random Forest by Leo Breiman, Extremely Randomized Trees (Extra-Trees) by Pierre Geurts, Damien Ernst & Louis Wehenkel, and Multiple Additive Regression Trees (MART; also known as Gradient Boosted Trees) by Jerome Friedman.

One thing I was particularly interested in, was how much randomization techniques have helped improve prediction performance in all of the algorithms named above. In Random Forest and Extra-Trees, it is quite obvious. Here, randomization is the reason why the ensembles offer an improvement over bagging; through the de-correlation of the base learners, the variance of the ensemble and therefore its prediction error decreases. In the end, you achieve de-correlation by “shaking up” the base trees, as it’s done in the two ensembles. However, MART also profits from randomization. In 2002, Friedman published another paper on boosting, showing that you can improve the prediction performance of boosted trees by training each tree on only a random subsample of your data. As a side-effect, your training time also decreases. Furthermore, in 2015, Rashmi and Gilad suggested adding a method known as a dropout to the boosting ensemble: a method found and used in neural nets.

## The Idea behind Random Boost

Inspired by theoretical readings on **randomization techniques** in boosting, I developed a new algorithm, that I called **Random Boost** (RB). In its essence, Random Boost sequentially grows regression trees with random depth. More precisely, the algorithm is almost identical to and has the exact same input arguments as MART. The only difference is the parameter . In MART, determines the maximum depth of all trees in the ensemble. In Random Boost, the argument constitutes the upper bound of possible tree sizes. In each boosting iteration , a random number between 1 and is drawn, which then defines the maximum depth of that tree .

In comparison to MART, this has two advantages:

First, **RB is faster than MART on average**, when being equipped with the same value for the tree size. When RB and MART are trained with a value for the maximum tree depth equal to , then Random Boost will in many cases grow trees the size of by nature. If you assume that for MART, all trees will be grown to their full size (i.e. there is enough data left in each internal node so that tree growing doesn’t stop before the maximum size is reached), you can derive a formula showing the relative computation gain of RB over MART:

e.g. is the training time of a RB boosting ensemble with the tree size parameter being equal to .

To make it a bit more practical, the formula predicts that for , , and , RB takes %, %, and % of the computation time of MART, respectively. These predictions, however, should be seen as RB’s best-case scenario, as MART is also not necessarily growing full trees. Still, the calculations suggest that efficiency gains can be expected (more on that later).

Second, there are also reasons to assume that randomizing over tree depths can have a beneficial effect on the prediction performance. As already mentioned, from a variance perspective, boosting suffers from overcapacity for various reasons. One of them is choosing too rich of a base learner in terms of depth. If, for example, one assumes that the dominant interaction in the data generating process is of order three, one would pick a tree with equivalent depth in MART in order to capture this interaction depth. However, this may be overkill, as fully grown trees with a depth equal to have eight leaves and therefore learn noise in the data, if there are only a few of such high order interactions. Perhaps, in this case, a tree with depth but less than eight leaves would be optimal. This is not accounted for in MART, if one doesn’t want to add a pruning step to each boosting iteration at the expense of computational overhead. Random Boost may offer a more efficient remedy to this issue. With probability , a tree is grown, which is able to capture the high order effect at the cost of also learning noise. However, in all the other cases, Random Boost constructs smaller trees that do not show the over-capacity behavior and that can focus on interactions of smaller order. If over-capacity is an issue in MART due to different interactions in the data governed by a small number of high order interactions, Random Boost may perform better than MART. Furthermore, Random Boost also decorrelates trees through the extra source of randomness, which has a variance reducing the effect on the ensemble.

The concept of Random Boost constitutes a slight change to MART. I used the `sklearn`

package as a basis for my code. As a result, the algorithm is developed based on `sklearn.ensemle.GradientBoostingRegressor`

and `sklearn.ensemle.GradientBoostingClassifier`

and is used in exactly the same way (i.e. argument names match exactly and CV can be carried out with `sklearn.model_selection.GridSearchCV`

). The only difference is that the `RandomBoosting*`

-object uses `max_depth`

to randomly draw tree depths for each iteration. As an example, you can use it like this:

```
rb = RandomBoostingRegressor(learning_rate=0.1, max_depth=4, n_estimators=100)
rb = rb.fit(X_train, y_train)
rb.predict(X_test)
```

For the full code, check out my GitHub account.

## Random Boost versus MART – A Simulation Study

In order to compare the two algorithms, I ran a simulation on 25 Datasets generated by a Random Target Function Generator that was introduced by Jerome Friedman in his famous Boosting paper he wrote in 2001 (you can find the details in his paper. Python code can be found here). Each dataset (containing 20,000 observations) was randomly split into a 25% test set and a 75% training set. RB and MART were tuned via 5-fold CV on the same tuning grid.

`learning_rate = 0.1`

`max_depth = (2, 3, ..., 8)`

`n_estimators = (100, 105, 110, ..., 195)`

For each dataset, I tuned both models to obtain the best parameter constellation. Then, I trained each model on every point of the tuning grid again and saved the test MAE as well as the total training time in seconds. Why did I train every model again and didn’t simply store the prediction accuracy of the tuned models along with the overall tuning time? Well, I wanted to be able to see how training time varies with the tree size parameter.

### A Comparison of Prediction Accuracies

You can see the distribution of MAEs of the best models on all 25 datasets below.

Evidently, both algorithms perform similarly.

For a better comparison, I compute the relative difference between the predictive performance of RB and MART for each dataset , i.e. . If , then RB had a larger mean absolute error than MART on dataset , and vice versa.

In the majority of cases, RB did worse than MART in terms of prediction accuracy (). In the worst case, RB had a 1% higher MAE than MART. In the median, RB has a 0.19% higher MAE. I’ll leave itu p to you to decide whether that difference is practically significant.

### A comparison of Training times

When we look at training time, we get a quite clear picture. In absolute terms, it took 433 seconds to train all parameter combinations of RB on average, as opposed to 803 seconds for MART.

The small black lines on top of each bar are the error bars (2 times the means standard deviation; rather small in this case).

To give you a better feeling of how each model performed on each dataset, I also plotted the training times for each round.

If you now compute the training time ratio between MART and RB (), you see that RB is roughly 1.8 times faster than MART on average.

Another perspective on the case is to compute the relative training time which is just 1 over the speedup. Note that this measure has to be interpreted a bit differently from the relative MAE measure above. If then RB is as fast as MART, if , then it takes longer to train RB than MART, and if , then RB is faster than MART.

On average, RB only needs roughly 54% of MART’s tuning time in the median, and it is noticeably faster in all cases. I was also wondering how the relative training time varies with and how well the theoretically derived lower bound from above fits the actually measured relative training time. That’s why I computed the relative training time across all 25 datasets by tree size.

Tree size () | Actual Training time (RB / MART) | Theoretical lower bound |
---|---|---|

2 | 0.751 | 0.750 |

3 | 0.652 | 0.583 |

4 | 0.596 | 0.469 |

5 | 0.566 | 0.388 |

6 | 0.532 | 0.328 |

7 | 0.505 | 0.283 |

8 | 0.479 | 0.249 |

The theoretical figures are optimistic, but the relative performance gain of RB increases with tree size.

## Results in a Nutshell and Next Steps

As part of my research on tree-based ensemble methods, I developed a new algorithm called Random Boost. Random Boost is based on Jerome Friedman’s MART, with the slight difference that it fits trees of random size. In total, this little change can reduce the problem of overfitting and noticeably speed up computation. Using a Random Target Function Generator suggested by Friedman, I found that, on average, RB is roughly twice as fast as MART with a comparable prediction accuracy in expectation.

Since running the whole simulation takes quite some time (finding the optimal parameters and retraining every model takes roughly one hour for each data set on my Mac), I couldn’t run hundreds or more simulations for this blog post. That’s the objective for future research on Random Boost. Furthermore, I want to benchmark the algorithm on real-world datasets.

In the meantime, feel free to look at my code and run the simulations yourself. Everything is on GitHub. Moreover, if you find something interesting and you want to share it with me, please feel free to shoot me an email.

## References

- Breiman, Leo (2001). Random Forests.
*Machine Learning*, 45, 5–32 - Chang, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 785-794
- Chapelle, Olivier, and Yi Chang. 2011. “Yahoo! learning to rank challenge overview”. In
*Proceedings of the Learning to Rank Challenge*, 1–24. - Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
*Annals of statistics*, 1189-1232. - Friedman, J. H. (2002). “Stochastic gradient boosting”.
*Computational Statistics & Data Analysis*38 (4): 367–378. - Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006). “Extremely randomized trees”.
*Machine learning*63 (1): 3–42. - Rashmi, K. V., and Ran Gilad-Bachrach (2015).
*Proceedings of the 18th International Conference on Artificial Intelligence and Statistics*(AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38.

In this blog, we will explore the Bagging algorithm and a computational more efficient variant thereof, Subagging. With minor modifications, these algorithms are also known as Random Forest and are widely applied here at STATWORX, in industry and academia.

Almost all statistical prediction and learning problems encounter a bias-variance tradeoff. This is particularly pronounced for so-called unstable predictors. While yielding low biased estimates due to flexible adaption to the data, those kind of predictors react very sensitive to small changes in the underlying dataset and have hence high variance. A common example are Regression Tree predictors.

Bagging bypasses this tradeoff by reducing the variance of the unstable predictor while leaving its bias mostly unaffected.

## Method

In particular, Bagging uses repeated bootstrap sampling to construct multiple versions of the same prediction model (e.g. Regression Trees) and averages over the resulting predictions.

Let’s see how Bagging works in detail:

- Construct a bootstrap sample (with replacement) of the original i.i.d. data at hand .
- Fit a Regression Tree to the bootstrap sample – we will denote the tree predictor by .
- Repeat Steps one and two many times and calculate .

OK – so let us take a glimpse into the construction phase: We draw in total different bootstrap samples simultaneously from the original data. Then to each of those samples, a tree is fitted and the (in-sample) fitted values are averaged in Step 3 yielding the Bagged predictor.

The variance-reduction happens in Step 3. To see this, consider the following toy example.

Let be i.i.d. random variables with and and let . Easy re-formulations show that

We observe that indeed the variance of the mean is weakly smaller than for the individual random variables while the sample mean is unbiased.

It is widely discussed in the literature why Bagging works and it remains an open research question. Bühlmann and Yu (2002) propose a subsampling variant of Bagging, called Subagging, which is more traceable from a theoretical point of view.

In particular, Bühlmann and Yu (2002) replace the bootstrap procedure of Bagging by subsampling without replacement. Essentially, we are only changing Step 1 of our Bagging algorithm by randomly drawing times without replacement from our original data with and get hence a subset of size . With this variant at hand, it is possible to state upper bounds for the variance and mean squared error of the predictor given an appropriate choice of the subsample size .

## Simulation Set-Up

As the theory is a little bit cumbersome and involves knowledge in real analysis, we simulate the main findings of Bühlmann and Yu (2002).

Let’s compare the mean-squared prediction error (MSPE) of the Regression Tree, Bagged and Subagged predictor and illustrate the theory part a little bit more.

In order to do this, we consider the following model

where is the regression function, is the design matrix generated from a uniform distribution and is the error term ().

For the true data-generating process (DGP), we consider the following model which is quite frequently used in the machine learning literature and termed “Friedman #1”-model:

where is the -th column of the design matrix (for ).

As you can see, this model is highly non-linear – Regression Tree models shall therefore be appropriate to approximate our DGP.

To evaluate the prediction performance of Bagging and Subagging predictors we conduct a Monte Carlo simulation in Python.

We first import the relevant packages.

```
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> numpy <span class="hljs-keyword"><span class="hljs-keyword">as</span></span> np
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.model_selection
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.ensemble
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> simulation_class
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> math
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.metrics <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> mean_squared_error
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.tree <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> DecisionTreeRegressor
```

The module `simulation_class`

is a user-specified class that we will not discuss in this blog post but in a subsequent one.

Further, we specify the simulation set-up:

```
<span class="hljs-comment"><span class="hljs-comment"># Number of regressors</span></span>
n_reg = <span class="hljs-number"><span class="hljs-number">10</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Observations</span></span>
n_obs = <span class="hljs-number"><span class="hljs-number">500</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Simulation runs</span></span>
n_sim = <span class="hljs-number"><span class="hljs-number">50</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Number of trees, i.e. number of bootstrap samples (Step 1)</span></span>
n_tree = <span class="hljs-number"><span class="hljs-number">50</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Error Variance</span></span>
sigma = <span class="hljs-number"><span class="hljs-number">1</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Grid for subsample size</span></span>
start_grid = <span class="hljs-number"><span class="hljs-number">0.1</span></span>
end_grid = <span class="hljs-number"><span class="hljs-number">1</span></span>
n_grid = <span class="hljs-number"><span class="hljs-number">100</span></span>
grid_range = np.linspace(start_grid, end_grid, num = n_grid)
```

Below we will explain in more detail what we need the grid specification.

To store our simulation results we set up containers.

```
<span class="hljs-comment"><span class="hljs-comment"># Container Set-up</span></span>
mse_temp_bagging = np.empty(shape = (n_obs, n_sim))
mse_temp_subagging = np.empty(shape = (n_obs, n_sim))
y_predict_bagging = np.empty(shape = (n_obs, n_sim))
y_predict_subagging = np.empty(shape = (n_obs, n_sim))
mse_decomp = np.empty(shape = (len(grid_range),<span class="hljs-number"><span class="hljs-number">2</span></span>))
```

With this initialization at hand, we generate the test and train data by the `simulation_class module`

.

```
<span class="hljs-comment"><span class="hljs-comment">#Creation of Simulation-Data</span></span>
train_setup = simulation_class.simulation(n_reg = n_reg,
n_obs = n_obs,
n_sim = n_sim,
sigma = sigma,
random_seed_design = <span class="hljs-number"><span class="hljs-number">0</span></span>,
random_seed_noise = <span class="hljs-number"><span class="hljs-number">1</span></span>)
test_setup = simulation_class.simulation(n_reg = n_reg,
n_obs = n_obs,
n_sim = n_sim,
sigma = sigma,
random_seed_design = <span class="hljs-number"><span class="hljs-number">2</span></span>,
random_seed_noise = <span class="hljs-number"><span class="hljs-number">3</span></span>)
f_train = train_setup.friedman_model()
X_train, y_train = train_setup.error_term(f_train)
f_test = test_setup.friedman_model()
X_test, y_test = test_setup.error_term(f_test)
```

As we have generated the data for our “Friedman #1”-model we are now able to simulate the mean squared error of the Bagged predictor and Subagged predictor. In Python, both algorithms are implemented via the `BaggingRegressor`

method of the `sklearn.ensemble`

package. Observe that for the Subagged predictor we need to specify the parameter `max_samples`

in the `BaggingRegressor`

. This ensures that we can draw a subsample size with subsample fraction from the original data. Indeed, for the subsample fraction we have already specified the grid above by the variable `grid_range`

.

```
<span class="hljs-comment"><span class="hljs-comment">#Subagging-Simulation</span></span>
<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> index, a <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> enumerate(grid_range):
<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">0</span></span>, n_sim):
<span class="hljs-comment"><span class="hljs-comment"># bagged estimator</span></span>
bagging = sklearn.ensemble.BaggingRegressor(
bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">True</span></span>,
n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
y_predict_bagging[:,i] = bagging.fit(
X_train,
y_train[:,i]).predict(X_test)
mse_temp_bagging[:,i] = mean_squared_error(
y_test[:,i],
y_predict_bagging[:,i])
<span class="hljs-comment"><span class="hljs-comment"># subagged estimator</span></span>
subagging = sklearn.ensemble.BaggingRegressor(
max_samples = math.ceil(a*n_obs),
bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">False</span></span>,
n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
y_predict_subagging[:,i] = subagging.fit(
X_train,
y_train[:,i]).predict(X_test)
mse_temp_subagging[:,i] = mean_squared_error(
y_test[:,i],
y_predict_subagging[:,i])
mse_decomp[index, <span class="hljs-number"><span class="hljs-number">1</span></span>] = np.mean(mse_temp_bagging)
mse_decomp[index, <span class="hljs-number"><span class="hljs-number">2</span></span>] = np.mean(mse_temp_subagging)
```

On my GitHub-Account you can find additional code which also calculates the simulated bias and variance for the fully grown tree and the Bagged tree.

## Results

The results of our above simulation can be found in Figure 1.

Let us first compare the performance in terms of MSPE of the Regression Tree and the Bagged predictor. Table 1 shows us that Bagging drastically reduces the MSPE by decreasing the variance while almost not affecting the bias. (Recall – the mean squared prediction error is just the sum of the squared bias of the estimate, variance of the estimate and the variance of the error term (not reported).)

*Table 1: Performance of fully grown tree and Bagged Predictor*

Predictor | Tree (fully grown) | Bagged Tree |
---|---|---|

3.47 | 2.94 | |

6.13 | 0.35 | |

10.61 | 4.26 |

Figure 1 displays the MSPE as a function of the subsample fraction for the Bagged and Subagged predictor (our above code). Together with Figure 1 and Table 1, we make several observations:

- We see that both the Bagged and Subagged predictor outperform a single tree (in terms of MSPE).
- For a subsampling fraction of approximately 0.5, Subagging achieves nearly the same prediction performance as Bagging while coming at a lower computational cost.

## References

- Breiman, L.: Bagging predictors. Machine Learning,
**24**, 123–140 (1996). - Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics
**30**, 927–961 (2002).

Last time we dove deep into the world of a little salad bar just a few steps away from the STATWORX office. This time we are going to dig even deeper … well, we are going to dig a little deeper. Today’s Specials are cross-elasticities and the effect of promotions.

We talked so much about salads because the situation of our little island of semi-healthy lunch choices makes for an illustrative case on how we can calculate price elasticities – the measure generally agreed upon by economists to evaluate how any change in price affects demand. Since the intricacies of deriving these price elasticities of demand with regressions are to be the subject of this short blog series, the salad bar was a cheap example.

## What we did so far

Specifically, in my last post, we wanted to know how a linear regression function relates to elasticity. It turns out that this depends on how the price and demand variables have been transformed. We explored four different transformations and in the end, we came to the conclusion that the log-log model fits our data best.

This was by no means an accident. Empirical explorations may occasionally guide us in choosing a different direction, but microeconomics arguments are heavily in favor of log-log models. The underlying demand curve describes demand most like economists assume it to behave. This model ensures that demand cannot sink below zero as the price increases; while on the other side, demand exponentially grows as the price decreases. Yet, the deductibility of a constant elasticity value is its most desirable feature. It makes the utilization of elasticity that much simpler.

Granted, if I’m being honest, the real reason the log-log model worked may have to do with the fact that I created the data. This salad vendor does exist, but obviously, they certainly did not fork over their sales data just because I thought their store would make an illustrative example. All the data we worked with was the result of how I imagined this little salad vendor’s market situation to be. Daily sales prices over the past two years were simulated for our little salad bar by randomly selecting prices between a certain price range, then a multiplicative demand function was used to derive sales with some added randomness. And with that, we were done simulating the data.

Obviously, this is far from the often messy historical data that one will encounter at retailers or any business that sells anything. There was no consideration of competition, no in-store alternatives, no new promotional activities, no seasonal effects, or any of the other business-specific factors that obscure price effects. The relationship between price and demand is usually obfuscated by the many other factors that influence a consumers’ buying decision. Thus, it is the intricacies of isolating the interfering factors that determine the success of empirical work.

## A closer look at the data at hand

To illustrate this I went back to the sales data drawing board and made up some more data. For details, check out the code on our Github page. The results can be seen in the graph below. The graph actually consists of two graphs: a scatter plot that illustrates daily sales quantities over time and a line graph that also describes the price development over time. If one looks more closely at the graph, the development of sales cannot always be explained by just looking at the price. The second half of the year 2014 illustrates this most glaringly. In some cases, sales spikes occur which seem to be unrelated to the product price.

Additional information is thus needed. Obviously, there is a multitude of possible factors that might explain the discrepancies in the relationship between price and demand. And of course, when offered, we look at anything provided to us in order to evaluate whether we can extract some pricing-relevant insight from it. The two most desired requests concern information about promotional activities and intel on competitor prices. Luckily, we do not have to ask as I simulated the data and integrated it into the previously seen graph (see below). Promotional activities by the salad bar are indicated by thin blue lines across the two graphs and the pricing history of the closest competitor is illustrated in the graph below.

Almost surprisingly (though not really), the graph illustrates that both promotions and competitor prices impact the number of salads sold by our small salad bar. If we were now running the same log-log regression, the resulting elasticity score would be skewed.

But how does one include both the promotional data and the competitor price information? There are several fancy ways of doing it, but for now we’ll stick with simple (also often efficient) ways. For the promotional data, we will use a dummy variable – coded one for every day our salad bar turned on its promotions engine and zero otherwise. In reality though, the salad bar’s key promotional tools are an incentive-lacking stamp card and the occasional flyer with some time-limited coupons. The data describes the latter.

## The theory of cross-price elasticity

To utilize the prices of competitors, we quickly need to delve into some economical theories. Next to the price elasticity of demand, there is a second concept called cross-price elasticity of demand. The two concepts are as similar, as their names suggest. Just look at the two formulas.

The formula for price elasticity:

And the formula for cross price elasticity:

The only thing that differs is the price with which one calculates. Instead of using the product’s price, one uses the competitor’s price. Conceptually, however price elasticity and cross-price elasticity slightly differ. With cross elasticity scores, it is plausible to get positive scores. However, with elasticity this is highly implausible and almost always indicates omitted variable bias, granted some luxury goods, like iPhones, may be the exception.

However, cross elasticities are actually commonly positive. The reason is captured by the terms substitutional goods and complementary goods. An obvious substitutional good would be a salad from the restaurant next to the salad bar. One eats either at the salad bar or at the restaurant. Thusly, a higher price for the salad at the restaurant could make it more likely that more customers choose to buy at the salad at our salad bar. A higher price at the restaurant would therefore consequently positively impact sales at the salad bar. The according cross elasticity score would be positive.

## The influence of competitors

But we can also imagine negative cross elasticity scores. In this case, we are confronted with a complementary product. For example, next to our salad bar is a cute little coffee place. Coffee is perfect at overcoming the after-lunch food coma – so many people want one. Unfortunately, the greedy owner of this cute coffee shop just increased her prices – again! Annoying, but not a big problem for us, the restaurant with which our salad bar competes also sells cheap coffee. Now, this is a big problem for the salad bar. Their salad and the coffee from cute, but greedy shop are complementary products. As the coffee price increases, salad sales at our salad bar slow.

In Frankfurt, there are obviously only coffee shops with very reasonably priced coffee, so we stick to just the information about the main competitor. But how to operationalize the competitor price data? The conceptional and mathematical closeness between elasticity and cross elasticity suggests that one could treat them similarly. And indeed, it is a good starting point to include competitor prices the same way as the actual product prices, spoiler alert, logged.

Advanced | Simple | |||
---|---|---|---|---|

Estimate | Std. Error | Estimate | Std. Error | |

Intercept | 5.36 | 0.09 | 8.43 | 0.11 |

Log Price | -1.45 | 0.03 | -1.76 | 0.05 |

Promo | 0.45 | 0.02 | ||

Log Price | 1.23 | 0.03 | ||

Adjusted | 0.80 | 0.41 |

Let’s look at the output in the table above. As I was in full control of the data generation process – and since we know what the underlying elasticity actually was – I set it at -1.5. Obviously, because of the minuscule level of randomness added, both models do reasonably well. Even withstanding and the standard errors, there is still a clear winner. By ignoring the impact of promotional activities and the competitor’s prices, the price elasticity estimate of the simpler model is biased. Here the difference is relatively small, but with real data, the difference can be substantial or as we omit essential information here systemic.

That is all! In the next blog post in this series, we introduce more complex relationships between price, promotions, competitor pricing, and concepts to utilize the insights for business purposes.

Machine Learning (ML) is still an underdog in the field of economics. However, it gets more and more recognition in the recent years. One reason for being an underdog is, that in economics and other social sciences one is not only interested in predicting but also in making causal inference. Thus many “off-the-shelf” ML algorithms are solving a fundamentally different problem. We here at STATWORX are also facing a variety of problems e.g. dynamic pricing optimization.

“Prediction by itself is only occasionally sufficient. The post office is happy with any method that predicts correct addresses from hand-written scrawls…[But] most statistical surveys have the identification of causal factors as their ultimate goal.” – Bradley Efron

## Introduction

However, the literature of combining ML and casual inferencing is growing by the day. One common problem of causal inference is the estimation of heterogeneous treatment effects. So, we will take a look at three interesting and different approaches for it and focus on a very recent paper by Athey et al. which is forthcoming in “The Annals of Statistics”^{1}.

### Model-based Recursive Partitioning

One of the earlier papers about causal trees is by Zeileis et al., 2008^{2}. They describe an algorithm for Model-based Recursive Partitioning (MOB), which looks at recursive partitioning for more complex models. They fit at first a parametric model to the data set, while using Maximum-Likelihood, then test for parameter instability for a set of predefined variables and lastly split the model with the variable regarding the highest parameter instability. Those steps are repeated in each of the daughter nodes till a stopping criterion is reached. However, they do not provide statistical properties for the mob and the partitions are still quite large.

### Bayesian Additive Regression Tree

Another paper uses Bayesian Additive Regression Tree (BART) for the estimation of heterogeneous treatment effects^{3}. Hereby, one advantage of this approach is, that BART can detect and handle interactions and non-linearity in the response surface. It uses a Sum-of-Tree Model. First, a weak-learning tree is grown, whereby the residuals are calculated and the next tree is fitted according to these residuals. Similar to Boosting Algorithms, BART wants do avoid overfitting. This is achieved by using a regularization prior, which restricts overfitting and the contribution of each tree to the final result.

## Generalized Random Forest

However, this and the next blog post will be mainly focused on the Generalized Random Forest (GRF) by Athey et al., who have already been exploring the possibilities of ML in economics before. It is a method for non-parametric statistical estimation, which uses the basic ideas of the Random Forest. Therefore, it keeps the recursive partitioning, subsampling and random split selection. Nevertheless, the final outcome is not estimated via simple averaging over the trees. The Forest is used to estimate an adaptive weighting function. So, we grow a set of trees and each observation gets weighted equalling how often it falls into the same leaf as the target observation. Those weights are used to solve a “local GMM” model.

Another important piece of the GRF is the split selection algorithm, which emphasizes maximizing heterogeneity. With this framework, a wide variety of applications is possible like quantile regressions but also the estimation of heterogeneous treatment effects. Therefore, the split selection must be suitable for a lot of different purposes. As in Breiman’s Random Forest, splits are selected greedily. However, in the case of general moment estimation, we don’t have a direct loss criterion to minimize. So instead we want to maximize a criterion ∆ , which favors splits that are increasing the heterogeneity of our in-sample estimation. Maximizing ∆ directly on the other side would be computationally costly, therefore Athey et al. are using a gradient-based approximation for it. This results in a computational performance, similar to standard CART- approaches.

## Comparing the regression forest of GRF to standard random forest

Athey et al. are claiming in their paper that in the special case of a regression forest, the GRF gets the same results as the standard random forest by Breiman (2001). So, one already implemented estimation method in the `grf-package`

^{4} is a regression forest. Therefore, I will compare those results, with the random forest implementations of the `randomForest`

-package as well as the implementation of the `ranger`

-packages. For tuning porpuses, I will use a random search with 50 iterations for the `randomForest`

and `ranger`

-package and for the grf the implemented `tune_regression_forest()`

-function. The Algorithms will be benchmarked on 3 data sets, while using the RMSE to compare the results. For easy handling, I implemented the `regression_forest()`

into the caret framework, which can be found on my GitHub.

Data Set | Metric | grf | ranger | randomForest |
---|---|---|---|---|

air | RMSE | 0.25 | 0.24 | 0.24 |

bike | RMSE | 2.90 | 2.41 | 2.67 |

gas | RMSE | 36.0 | 32.6 | 34.4 |

The GRF performs a little bit worse in comparison with the other implementations. However, this could be also due to the tuning of the parameters, because there are more parameters to tune. According to their GitHub, they are planning on improving the `tune_regression_forest()`

-Function.

One advantage of the GRF is, that it produces unbiased confidence intervals for each estimation point. In order to do so, they are performing honest tree splitting, which was first described in their paper about causal trees^{5}. With honest stree splitting, one sample is used to make the splits and another distinct sample is used to estimate the coefficients.

However, standard regression is not the exciting part of the Generalized Random Forest. Therefore, I will take a look at how the GRF performs in estimating heterogeneous treatment effects with simulated data and compare it to the estimation results of the MOB and the BART in my next blog post.

## References

- Athey, Tibshirani, Wager. Forthcoming.”Generalized Random Forests”
- Zeileis, Hothorn, Hornik. 2008.”Model-based Recursive Partitioning”
- Hill. 2011.”Bayesian Nonparametric Modeling for Causal Inference”
- https://github.com/swager/grf
- Athey and Imbens. 2016.”Recursive partitioning for heterogeneous causal effects.”