en
                    array(2) {
  ["de"]=>
  array(13) {
    ["code"]=>
    string(2) "de"
    ["id"]=>
    string(1) "3"
    ["native_name"]=>
    string(7) "Deutsch"
    ["major"]=>
    string(1) "1"
    ["active"]=>
    int(0)
    ["default_locale"]=>
    string(5) "de_DE"
    ["encode_url"]=>
    string(1) "0"
    ["tag"]=>
    string(2) "de"
    ["missing"]=>
    int(0)
    ["translated_name"]=>
    string(6) "German"
    ["url"]=>
    string(72) "https://www.statworx.com/content-hub/blog/tag/statistics-and-methods-de/"
    ["country_flag_url"]=>
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/de.png"
    ["language_code"]=>
    string(2) "de"
  }
  ["en"]=>
  array(13) {
    ["code"]=>
    string(2) "en"
    ["id"]=>
    string(1) "1"
    ["native_name"]=>
    string(7) "English"
    ["major"]=>
    string(1) "1"
    ["active"]=>
    string(1) "1"
    ["default_locale"]=>
    string(5) "en_US"
    ["encode_url"]=>
    string(1) "0"
    ["tag"]=>
    string(2) "en"
    ["missing"]=>
    int(0)
    ["translated_name"]=>
    string(7) "English"
    ["url"]=>
    string(75) "https://www.statworx.com/en/content-hub/blog/tag/statistics-and-methods-en/"
    ["country_flag_url"]=>
    string(87) "https://www.statworx.com/wp-content/plugins/sitepress-multilingual-cms/res/flags/en.png"
    ["language_code"]=>
    string(2) "en"
  }
}
                    
Contact

The Hidden Risks of Black-Box Algorithms

Reading and evaluating countless resumes in the shortest possible time and making recommendations for suitable candidates – this is now possible with artificial intelligence in applicant management. This is because advanced AI technologies can efficiently analyze even large volumes of complex data. In HR management, this not only saves valuable time in the pre-selection process but also enables applicants to be contacted more quickly. Artificial intelligence also has the potential to make application processes fairer and more equitable.

However, real-world experience has shown that artificial intelligence is not always “fair”. A few years ago, for example, an Amazon recruiting algorithm stirred up controversy for discriminating against women when selecting candidates. Additionally, facial recognition algorithms have repeatedly led to incidents of discrimination against People of Color.

One reason for this is that complex AI algorithms independently calculate predictions and results based on the data fed into them. How exactly they arrive at a particular result is not initially comprehensible. This is why they are also known as black-box algorithms. In Amazon’s case, the AI determined suitable applicant profiles based on the current workforce, which was predominantly male, and thus made biased decisions. In a similar way, algorithms can reproduce stereotypes and reinforce discrimination.

Principles for Trustworthy AI

The Amazon incident shows that transparency is highly relevant in the development of AI solutions to ensure that they function ethically. This is why transparency is also one of the seven statworx Principles for trustworthy AI. The employees at statworx have collectively defined the following AI principles: Human-centered, transparent, ecological, respectful, fair, collaborative, and inclusive. These serve as orientations for everyday work with artificial intelligence. Universally applicable standards, rules, and laws do not yet exist. However, this could change in the near future.

The European Union (EU) has been discussing a draft law on the regulation of artificial intelligence for some time. Known as the AI Act, this draft has the potential to be a gamechanger for the global AI industry. This is because it is not only European companies that are targeted by this draft law. All companies offering AI systems on the European market, whose AI-generated output is used within the EU, or operate AI systems for internal use within the EU would be affected. The requirements that an AI system must meet depend on its application.

Recruiting algorithms are likely to be classified as high-risk AI. Accordingly, companies would have to fulfill comprehensive requirements during the development, publication, and operation of the AI solution. Among other things, companies are required to comply with data quality standards, prepare technical documentation, and establish risk management. Violations may result in heavy fines of up to 6% of global annual sales. Therefore, companies should already start dealing with the upcoming requirements and their AI algorithms. Explainable AI methods (XAI) can be a useful first step. With their help, black-box algorithms can be better understood, and the transparency of the AI solution can be increased.

Unlocking the Black Box with Explainable AI Methods

XAI methods enable developers to better interpret the concrete decision-making processes of algorithms. This means that it becomes more transparent how an algorithm has formed patterns and rules and makes decisions. As a result, potential problems such as discrimination in the application process can be discovered and corrected. Thus, XAI not only contributes to greater transparency of AI but also favors its ethical use and thus increases the conformity of an AI with the upcoming AI Act.

Some XAI methods are even model-agnostic, i.e. applicable to any AI algorithm from decision trees to neural networks. The field of research around XAI has grown strongly in recent years, which is why there is now a wide variety of methods. However, our experience shows that there are large differences between different methods in terms of the reliability and meaningfulness of their results. Furthermore, not all methods are equally suitable for robust application in practice and for gaining the trust of external stakeholders. Therefore, we have identified our top 3 methods based on the following criteria for this blog post:

  1. Is the method model agnostic, i.e. does it work for all types of AI models?
  2. Does the method provide global results, i.e. does it say anything about the model as a whole?
  3. How meaningful are the resulting explanations?
  4. How good is the theoretical foundation of the method?
  5. Can malicious actors manipulate the results or are they trustworthy?

Our Top 3 XAI Methods at a Glance

Using the above criteria, we selected three widely used and proven methods that are worth diving a bit deeper into: Permutation Feature Importance (PFI), SHAP Feature Importance, and Accumulated Local Effects (ALE). In the following, we explain how each of these methods work and what they are used for. We also discuss their advantages and disadvantages and illustrate their application using the example of a recruiting AI.

Efficiently Identify Influencial Variables with Permutation Feature Importance

The goal of Permutation Feature Importance (PFI) is to find out which variables in the data set are particularly crucial for the model to make accurate predictions. In the case of the recruiting example, PFI analysis can shed light on what information the model relies on to make its decision. For example, if gender emerges as an influential factor here, it can alert the developers to potential bias in the model. In the same way, a PFI analysis creates transparency for external users and regulators. Two things are needed to compute PFI:

  1. An accuracy metric such as the error rate (proportion of incorrect predictions out of all predictions).
  2. A test data set that can be used to determine accuracy.

In the test data set, one variable after the other is concealed from the model by adding random noise. Then, the accuracy of the model is determined over the transformed test dataset. From there, we conclude that those variables whose concealment affects model accuracy the most are particularly important. Once all variables are analyzed and sorted, we obtain a visualization like Figure 1. Using our artificially generated sample data set, we can derive the following: Work experience did not play a major role in the model, but ratings from the interview were influencial.


Figure 1 – Permutation Feature Importance using the example of a recruiting AI (data artificially generated).

A great strength of PFI is that it follows a clear mathematical logic. The correctness of its explanation can be proven by statistical considerations. Furthermore, there are hardly any manipulable parameters in the algorithm with which the results could be deliberately distorted. This makes PFI particularly suitable for gaining the trust of external observers. Finally, the computation of PFI is very resource efficient compared to other XAI methods.

One weakness of PFI is that it can provide misleading explanations under some circumstances. If a variable is assigned a low PFI value, it does not always mean that the variable is unimportant to the issue. For example, if the bachelor’s degree grade has a low PFI value, this may simply be because the model can simply look at the master’s degree grade instead since they are usually similar. Such correlated variables can complicate the interpretation of the results. Nonetheless, PFI is an efficient and useful method for creating transparency in black-box models.

Strengths Weaknesses
Little room for malicious manipulation of results Does not consider interactions between variables
Efficient computation

Uncover Complex Relationships with SHAP Feature Importance

SHAP Feature Importance is a method for explaining black box models based on game theory. The goal is to quantify the contribution of each variable to the prediction of the model. As such, it closely resembles Permutation Feature Importance at first glance. However, unlike PFI, SHAP Feature Importance provides results that can account for complex relationships between multiple variables.

SHAP is based on a concept from game theory: Shapley values. Shapley values are a fairness criterion that assigns a weight to each variable that corresponds to its contribution to the outcome. This is analogous to a team sport, where the winning prize is divided fairly among all players, according to their contribution to the victory. With SHAP, we can look at every individual obversation in the data set and analyze what contribution each variable has made to the prediction of the model.

If we now determine the average absolute contribution of a variable across all observations in the data set, we obtain the SHAP Feature Importance. Figure 2 illustrates the results of this analysis. The similarity to the PFI is evident, even though the SHAP Feature Importance only places the rating of the job interview in second place.


Figure 2 – SHAP Feature Importance using the example of a recruiting AI (data artificially generated).

A major advantage of this approach is the ability to account for interactions between variables. By simulating different combinations of variables, it is possible to show how the prediction changes when two or more variables vary together. For example, the final grade of a university degree should always be considered in the context of the field of study and the university. In contrast to the PFI, the SHAP Feature Importance takes this into account. Also, Shapley Values, once calculated, are the basis of a wide range of other useful XAI methods.

However, one weakness of the method is that it is more computationally expensive than PFI. Efficient implementations are available only for certain types of AI algorithms like decision trees or random forests. Therefore, it is important to carefully consider whether a given problem requires a SHAP Feature Importance analysis or whether PFI is sufficient.

Strengths Weaknesses
Little room for malicious manipulation of results Calculation is computationally expensive
Considers complex interactions between variables

Focus in on Specific Variables with Accumulated Local Effects

Accumulated Local Effects (ALE) is a further development of the commonly used Partial Dependence Plots (PDP). Both methods aim at simulating the influence of a certain variable on the prediction of the model. This can be used to answer questions such as “Does the chance of getting a management position increase with work experience?” or “Does it make a difference if I have a 1.9 or a 2.0 on my degree certificate?”. Therefore, unlike the previous two methods, ALE makes a statement about the model’s decision-making, not about the relevance of certain variables.

In the simplest case, the PDP, a sample of observations is selected and used to simulate what effect, for example, an isolated increase in work experience would have on the model prediction. Isolated means that none of the other variables are changed in the process. The average of these individual effects over the entire sample can then be visualized (Figure 3, above). Unfortunately, PDP’s results are not particularly meaningful when variables are correlated. For example, let us look at university degree grades. PDP simulates all possible combinations of grades in bachelor’s and master’s programs. Unfortunately, this results in cases that rarely occur in the real world, e.g., an excellent bachelor’s degree and a terrible master’s degree. The PDP has no sense for unreaslistic cases, and the results may suffer accordingly.

ALE analysis, on the other hand, attempts to solve this problem by using a more realistic simulation that adequately represents the relationships between variables. Here, the variable under consideration, e.g., bachelor’s grade, is divided into several sections (e.g., 6.0-5.1, 5.0-4.1, 4.0-3.1, 3.0-2.1, and 2.0-1.0). Now, the simulation of the bachelor’s grade increase is performed only for individuals in the respective grade group. This prevents unrealistic combinations from being included in the analysis. An example of an ALE plot can be found in Figure 3 (below). Here, we can see that ALE identifies a negative impact of work experience on the chance of employment, which PDP was unable to find. Is this behavior of the AI desirable? For example, does the company want to hire young talent in particular? Or is there perhaps an unwanted age bias behind it? In both cases, the ALE plot helps to create transparency and to identify undesirable behavior.


Figure 3- Partial Dependence Plot and Accumulated Local Effects using a Recruiting AI as an example (data artificially generated).

In summary, ALE is a suitable method to gain insight into the influence of a certain variable on the model prediction. This creates transparency for users and even helps to identify and fix unwanted effects and biases. A disadvantage of the method is that ALE can only analyze one or two variables together in the same plot, meaningfully. Thus, to understand the influence of all variables, multiple ALE plots must be generated, which makes the analysis less compact than PFI or a SHAP Feature Importance.

Strengths Weaknesses
Considers complex interactions between variables Only one or two variables can be analyzed in one ALE plot
Little room for malicious manipulation of results

Build Trust with Explainable AI Methods

In this post, we presented three Explainable AI methods that can help make algorithms more transparent and interpretable. This also favors meeting the requirements of the upcoming AI Act. Even though it has not yet been passed, we recommend to start working on creating transparency and traceability for AI models based on the draft law as soon as possible. Many Data Scientists have little experience in this field and need further training and time to familiarize with XAI concepts before they can identify relevant algorithms and implement effective solutions. Therefore, it makes sense to familiarize yourself with our recommended methods preemptively.

With Permutation Feature Importance (PFI) and SHAP Feature Importance, we demonstrated two techniques to determine the relevance of certain variables to the prediction of the model. In summary, SHAP Feature Importance is a powerful method for explaining black-box models that considers the interactions between variables. PFI, on the other hand, is easier to implement but less powerful for correlated data. Which method is most appropriate in a particular case depends on the specific requirements.

We also introduced Accumulated Local Effects (ALE), a technique that can analyze and visualize exactly how an AI responds to changes in a specific variable. The combination of one of the two feature importance methods with ALE plots for selected variables is particularly promising. This can provide a theoretically sound and easily interpretable overview of the model – whether it is a decision tree or a deep neural network.

The application of Explainable AI is a worthwhile investment – not only to build internal and external trust in one’s own AI solutions. Rather, we expect that the skillful use of interpretation-enhancing methods can help avoid impending fines due to the requirements of the AI Act, prevents legal consequences, and protects those affected from harm – as in the case of incomprehensible recruiting software.
Our free AI Act Quick Check helps you assess whether any of your AI systems could be affected by the AI Act: https://www.statworx.com/en/ai-act-tool/

Sources & Further Information:

https://www.faz.net/aktuell/karriere-hochschule/buero-co/ki-im-bewerbungsprozess-und-raus-bist-du-17471117.html (last opened 03.05.2023)
https://t3n.de/news/diskriminierung-deshalb-platzte-amazons-traum-vom-ki-gestuetzten-recruiting-1117076/ (last opened 03.05.2023)
For more information on the AI Act: https://www.statworx.com/en/content-hub/blog/how-the-ai-act-will-change-the-ai-industry-everything-you-need-to-know-about-it-now/
Statworx principles: https://www.statworx.com/en/content-hub/blog/statworx-ai-principles-why-we-started-developing-our-own-ai-guidelines/
Christoph Molnar: Interpretable Machine Learning: https://christophm.github.io/interpretable-ml-book/ Max Hilsdorf, Julia Rettig

Image Sources
AdobeStock 566672394 – by TheYaksha

Introduction

Forecasts are of central importance in many industries. Whether it’s predicting resource consumption, estimating a company’s liquidity, or forecasting product sales in retail, forecasts are an indispensable tool for making successful decisions. Despite their importance, many forecasts still rely primarily on the prior experience and intuition of experts. This makes it difficult to automate the relevant processes, potentially scale them, and provide efficient support. Furthermore, experts may be biased due to their experiences and perspectives or may not have all the relevant information necessary for accurate predictions.

These reasons have led to the increasing importance of data-driven forecasts in recent years, and the demand for such predictions is accordingly strong.

At statworx, we have already successfully implemented a variety of projects in the field of forecasting. As a result, we have faced many challenges and become familiar with numerous industry-specific use cases. One of our internal working groups, the Forecasting Cluster, is particularly passionate about the world of forecasting and continuously develops their expertise in this area.

Based on our collected experiences, we now aim to combine them in a user-friendly tool that allows anyone to obtain initial assessments for specific forecasting use cases depending on the data and requirements. Both customers and employees should be able to use the tool quickly and easily to receive methodological recommendations. Our long-term goal is to make the tool publicly accessible. However, we are first testing it internally to optimize its functionality and usefulness. We place special emphasis on ensuring that the tool is intuitive to use and provides easily understandable outputs.

Although our Recommender Tool is still in the development phase, we would like to provide an exciting sneak peek.

Common Challenges

Model Selection

In the field of forecasting, there are various modeling approaches. We differentiate between three central approaches:

  1. Time Series Models
  2. Tree-based Models
  3. Deep Learning Models

There are many criteria that can be used when selecting a model. For univariate time series data with strong seasonality and trends, classical time series models such as (S)ARIMA and ETS are appropriate. On the other hand, for multivariate time series data with potentially complex relationships and large amounts of data, deep learning models are a good choice. Tree-based models like LightGBM offer greater flexibility compared to time series models, are well-suited for interpretability due to their architecture, and tend to have lower computational requirements compared to deep learning models.

Seasonality

Seasonality refers to recurring patterns in a time series that occur at regular intervals (e.g. daily, weekly, monthly, or yearly). Including seasonality in the modeling is important to capture these regular patterns and improve the accuracy of forecasts. Time series models such as SARIMA, ETS, or TBATS can explicitly account for seasonality. For tree-based models like LightGBM, seasonality can only be considered by creating corresponding features, such as dummies for relevant seasonalities. One way to explicitly account for seasonality in deep learning models is by using sine and cosine functions. It is also possible to use a deseasonalized time series. This involves removing the seasonality initially, followed by modeling on the deseasonalized time series. The resulting forecasts are then supplemented with seasonality by applying the process used for deseasonalization in reverse. However, this process adds another level of complexity, which is not always desirable.

Hierarchical Data

Especially in the retail industry, hierarchical data structures are common as products can often be represented at different levels of granularity. This frequently results in the need to create forecasts for different hierarchies that do not contradict each other. The aggregated forecasts must therefore match the disaggregated forecasts. There are various approaches to this. With top-down and bottom-up methods, forecasts are created at one level and then disaggregated or aggregated downstream. Reconciliation methods such as Optimal Reconciliation involve creating forecasts at all levels and then reconciling them to ensure consistency across all levels.

Cold Start

In a cold start, the challenge is to forecast products that have little or no historical data. In the retail industry, this usually refers to new product introductions. Since it is not possible to train a model for these products due to the lack of history, alternative approaches must be used. A classic approach to performing a cold start is to rely on expert knowledge. Experts can provide initial estimates of demand, which can serve as a starting point for forecasting. However, this approach can be highly subjective and cannot be scaled. Similarly, similar products or potential predecessor products can be referenced. Grouping of products can be done based on product categories or clustering algorithms such as K-Means. Using cross-learning models trained on many products represents a scalable option.

Recommender Concept

With our Recommender Tool, we aim to address different problem scenarios to enable the most efficient development process. It is an interactive tool where users can provide inputs based on their objectives or requirements and the characteristics of the available data. Users can also prioritize certain requirements, and the output will prioritize those accordingly. Based on these inputs, the tool generates methodological recommendations that best cover the solution requirements, depending on the available data characteristics. Currently, the outputs consist of a purely content-based representation of the recommendations, providing concrete guidelines for central topics such as model selection, pre-processing, and feature engineering. The following example provides an idea of the conceptual approach:

The output presented here is based on a real project where the implementation in R and the possibility of local interpretability were of central importance. At the same time, new products were frequently introduced, which should also be forecasted by the developed solution. To achieve this goal, several global models were trained using Catboost. Thanks to this approach, over 200 products could be included in the training. Even for newly introduced products where no historical data was available, forecasts could be generated. To ensure the interpretability of the forecasts, SHAP values were used. This made it possible to clearly explain each prediction based on the features used.

Summary

The current development is focused on creating a tool optimized for forecasting. Through its use, we aim to increase efficiency in forecasting projects. By combining gathered experience and expertise, the tool will offer guidelines for modeling, pre-processing, and feature engineering, among other topics. It will be designed to be used by both customers and employees to quickly and easily obtain estimates and methodological recommendations. An initial test version will be available soon for internal use, but the tool is ultimately intended to be made accessible to external users as well. In addition to the technical output currently in development, a less technical output will also be available. The latter will focus on the most important aspects and their associated efforts. In particular, the business perspective in the form of expected efforts and potential trade-offs between effort and benefit will be covered by this.

 

 

Benefit from our forecasting expertise!

If you need support in addressing the challenges in your forecasting projects or have a forecasting project planned, we are happy to provide our expertise and experience to assist you.

     

    Image Source:

    AdobeStock 83282923 – Mego-studio Marlon Schumacher

    Introduction

    Sometimes here at STATWORX impromptu discussions about statistical methods happen. In one such discussion one of my colleagues decided to declare (albeit jokingly) Bayesian statistics unnecessary. This made me ask myself: Why would I ever use Bayesian models in the context of a standard regression problem? Surely, existing approaches such as Ridge Regression are just as good, if not better. However, the Bayesian approach has the advantage that it lets you regularize your model to prevent overfitting **and** meaningfully interpret the regularization parameters. Contrary to the usual way of looking at ridge regression, the regularization parameters are no longer abstract numbers, but can be interpreted through the Bayesian paradigm as derived from prior beliefs. In this post, I’ll show you the formal similarity between a generalized ridge estimator and the Bayesian equivalent.

    A (very brief) primer on Bayesian Stats

    To understand the Bayesian regression estimator a minimal amount of knowledge about Bayesian statistics is necessary, so here’s what you need to know (if you don’t already): In Bayesian statistics we think about model parameters (i.e. regression coefficients) probabilistically. In other words, the data given to us is fixed, and the parameters are considered random. This runs counter to the standard, frequentist perspective in which the underlying model parameters are treated as fixed, while the data are considered random realizations of the stochastic process driven by those fixed model parameters. The end goal of Bayesian analysis is to find the posterior distribution, which you may remember from Bayes Rule:

        \[p(\theta|y) = \frac{p(y|\theta) p(\theta)}{p(y)}\]


    While p(y|\theta) is our likelihood and p(y) is a normalizing constant, p(\theta) is our prior which does not depend on the data, y. In classical statistics, p(\theta) is set to 1 (an improper reference prior) so that when the posterior ‘probability’ is maximized, really just the likelihood is maximized, because its the only part that still depends on \theta. However, in Bayesian statistics we use an actual probability distribution in place of p(\theta), a Normal distribution for example. So lets consider the case of a regression problem and we’ll assume that our target, y, **and** our prior follow normal distributions. This leads us to **conjugate** Bayesian analysis, in which we can neatly write down an equation for the posterior distribution. In many cases this is actually not possible and for this reason Markov Chain Monte Carlo methods were invented to sample from the posterior – taking a frequentist approach, ironically.

    We’ll make the usual assumption about the data: y_i is i.i.d. N(\bold {x_i \beta}, \sigma^2) for all observations i. This gives us our standard likelihood for the Normal distribution. Now we can specify the prior for the parameter we’re trying to estimate, (\beta, \sigma^2). If we choose a Normal prior (conditional on the variance, \sigma^2) for the vector or weights in \beta, i.e. N(b_0, \sigma^2 B_0) and an inverse-Gamma prior over the variance parameter it can be shown that the posterior distribution for \beta is Normally distributed with mean

        \[\hat\beta_{Bayesian} = (B_0^{-1} + X'X)^{-1}(B_0^{-1} b_0 + X'X \hat\beta)\]


    If you’re interested in a proof of this result check out Jackman (2009, p.526).

    Let’s look at it piece by piece:

    \hat\beta is our standard OLS estimator, (X'X)^{-1}X'y
    b_0 is the mean vector of (multivariate normal) prior distribution, so it lets us specify what we think the average values of each of our model parameters are
    B_0 is the covariance matrix and contains our respective uncertainties about the model parameters. The inverse of the variance is called the **precision**

    What we can see from the equation is that the mean of our posterior is a **precision weighted average** of our prior mean (information not based on data) and the OLS estimator (based solely on the data). The second term in parentheses indicates that we are taking the uncertainty weighted prior mean, B_0^{-1} b_0, and adding it to the weighted OLS estimator, X'X\hat\beta. Imagine for a moment that B_0^{-1} = 0 . Then

        \[\hat\beta_{Bayesian} = (X'X)^{-1}(X'X \hat\beta) = \hat\beta\]


    This would mean that we are **infinitely** uncertain about our prior beliefs that the mean vector of our prior distribution would vanish, contributing nothing to our posterior! LIkewise, if our uncertainty decreases (and the precision thus increases) the prior mean, b_0, would contribute more to the posterior mean.

    After this short primer on Bayesian statistics, we can now formally compare the Ridge estimator with the above Bayesian estimator. But first, we need to take a look at a more general version of the Ridge estimator.

    Generalizing the Ridge estimator

    A standard tool used in many regression problems, the standard Ridge estimator is derived by solving a least squares problem from the following loss function:

        \[L(\beta,\lambda) = \frac{1}{2}\sum(y-X\beta)^2 + \frac{1}{2} \lambda ||\beta||^2\]


    While minimizing this gives us the standard Ridge estimator you have probably seen in textbooks on the subject, there’s a slightly more general version of this loss function:

        \[L(\beta,\lambda,\mu) = \frac{1}{2}\sum(y-X\beta)^2 + \frac{1}{2} \lambda ||\beta - \mu||^2\]


    Let’s derive the estimator by first re-writing the loss function in terms of matrices:

        \[\begin{aligned}L(\beta,\lambda,\mu) &= \frac{1}{2}(y - X \beta)^{T}(y - X \beta) + \frac{1}{2} \lambda||\beta - \mu||^2 \\&= \frac{1}{2} y^Ty - \beta^T X^T y + \frac{1}{2} \beta^T X^T X \beta + \frac{1}{2} \lambda||\beta - \mu||^2\end{aligned}\]


    Differentiating with respect to the parameter vector, we end up with this expression for the gradient:

        \[\nabla_{\beta} L (\beta, \lambda, \mu) = -X^T y + X^T X \beta + \lambda (\beta - \mu)\]


    So, Minimizing over \beta we get this expression for the generalized ridge estimator:

        \[\hat\beta_{Ridge} = (X'X + \lambda I )^{-1}(\lambda \mu + X'y)\]


    The standard Ridge estimator can be recovered by setting \mu=0. Usually we regard \lambda as an abstract parameter that regulates the penalty size and \mu as a vector of values (one for each predictor) that increases the loss the further these coefficients deviate from these values. When \mu=0 the coefficients are pulled towards zero.

    Let’s take a look how the estimator behaves when the parameters, \mu and \lambda change. We’ll define a meaningful ‘prior’ for our example and then vary the penalty parameter. As an example, we’ll use the `diamonds` dataset from the `ggplot2` package and model the price as a linear function of the number of carats, in each diamond, the depth, table, x, y and z attributes

    ![penalty_plots](/Users/thomasalcock/Projekte/blog/bayesian_regularization/penalty_plots.tiff)

    As we can see from the plot both with and without a prior the coefficient estimates change rapidly for the first few increases in the penalty size. We also see that the ‘shrinkage’ effect holds from the upper plot: as the penalty increases the coefficients tend towards zero, some faster than others. The plot on the right shows how the coefficients change when we set a sensible ‘prior’. The coefficients still change, but they now tend towards the ‘prior’ we specified. This is because \lambda penalizes deviations from our \mu, which means that larger values for the penalty pull the coefficients towards \mu. You might be asking yourself, how this compares to the Bayesian estimator. Let’s find out!

    Comparing the Ridge and Bayesian Estimator

    Now that we’ve seen both the Ridge and the Bayesian estimators, it’s time to compare them. We discovered, that the Bayesian estimator contains the OLS estimator. Since we know its form, let’s substitute it and see what happens:

        \[\begin{aligned}\hat\beta_{Bayesian} &= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'X \hat\beta) \\&= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'X (X'X)^{-1}X'y) \\&= (X'X + B_0^{-1})^{-1}(B_0^{-1} b_0 + X'y)\end{aligned}\]

     
    This form makes the analogy much clearer:

    \lambda I corresponds to B_0^{-1}, the matrix of precisions. In other words, since I is the identity matrix, the ridge estimator assumes no covariances between the regression coefficients and a constant precision across all coefficients (recall that \lambda is a scalar)
    \lambda \mu corresponds to B_0^{-1} b_0, which makes sense, since the vector b_0 is the mean of our prior distribution, which essentially pulls the estimator towards it, just like \mu ‘shrinks’ the coefficients towards its values. This ‘pull’ depends on the uncertainty captured by B_0^{-1} or \lambda I in the ridge estimator.

    That’s all well and good, but lets see how changing the uncertainty in the Bayesian case compares to the behaviour of the ridge estimator. Using the same data and the same model specfication as above, we’ll set the covariance matrix B_0 matrix to equal \lambda I and then change lambda. Remember, smaller values of \lambda now imply greater contribution of the prior (less uncertainty) and therefore increasing them makes the prior less important.

    ![bayes_penalty_plots](/Users/thomasalcock/Projekte/blog/bayesian_regularization/bayes_penalty_plots.tiff)

    The above plots match out understanding so far: With a prior mean of zeros the coefficients are shrunken towards zero, as in the ridge regression case when the prior dominates, i.e. when the precision is high. And when a prior mean is set the coefficients tend towards it as the precision increases. So much for the coefficients, but what about the performance? Let’s have a look!

    Performance comparison

    Lastly, we’ll compare the predictive performance of the two models. Although we could treat the parameters in the model as hyperparameters which we would need to tune, this would defy the purpose of using prior knowlegde. Instead, let’s choose a prior specification for both models, and then compare the performance on a hold out set (30% of the data). While we can use the simple X\hat\beta as our predictor for the Ridge model, the Bayesian model provides us with a full posterior preditctive distribution which we can sample from to get model predictions. To estimate the model I used the `brms`package.

    | | RMSE | MAE | MAPE |
    | :——————– | ——: | ——: | —-: |
    | Bayesian Linear Model | 1625.38 | 1091.36 | 44.15 |
    | Ridge Estimator | 1756.01 | 1173.50 | 43.44 |

    Overall both models perform similarly, although some error metrics slightly favor one model over the other. Judging by these errors, we could certainly improve our models by specifying a more approriate probability distribution for our target variable. After all, prices can not be negative yet our models can and do produce negative predictions.

    Recap

    In this post, I’ve shown you how the ridge estimator compares to the Bayesian conjugate linear model. Now you understand the connection between the two models and how a Bayesian approach can provide a more readily interpretable way of regularizing your model. Normally \lambda would be considered a penalty size, but now it can be interpreted as a measure of prior uncertainty. Similarly, the parameter vector \mu can be seen as a vector of prior means for our model parameters in the extended ridge model. As far as the Bayesian approach goes, we also can use prior distributions to implement expert knowledge in your estimation process. This regularizes your model and allows for incorporation of external information in your model. If you are interested in the code, check it out at our [GitHub page](https://github.com/STATWORX/blog/tree/master/bayesian_regularization)!

    References

    – Jackman S. 2009. Bayesian Analysis for the Social Sciences. West Sussex: Wiley.

    Thomas Alcock Thomas Alcock

    At STATWORX we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms that allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post.

    As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. In my first blog post, I gave an introduction to the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science. In this second blog post, I will introduce the so-called Causal Forest, one of the most popular Causal Machine Learning algorithms to estimate heterogeneous treatment effects.

    Why Heterogeneous Treatment Effects?

    In Causal Forests, the goal is to estimate heterogeneity in treatment effects. As explained in my previous blog post, a treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. For example the causal effect of a subsidized training program on earnings. As individual treatment effects are unobservable, the practice focuses on estimating unbiased and consistent averages of the individual treatment effect. The most common parameter thereof is the average treatment effect, which is the mean of all individual treatment effects in the entire population of interest. However, sometimes treatment effects may vary widely between different subgroups in the population, bet it larger or smaller than the average treatment effect. In some cases, it might therefore be more interesting to estimate these different, i.e. heterogeneous treatment effects.

    In most applications it is also interesting to look beyond the average effects in order to understand how the causal effects vary with observable characteristics. (Knaus, Lechner & Strittmatter, 2018)

    The estimation of heterogeneous treatment effects can assist in answering questions like: For whom are there big or small treatment effects? For which subgroup does a treatment generate beneficial or adverse effects? In the field of marketing, for example, the estimation of heterogeneous treatment effects can help to optimize resource allocation by answering the question of which customers respond the most to a certain marketing campaign or for which customers is the causal effect of intervention strategies on their churn behavior the highest. Or when it comes to pricing, it might be interesting to quantify how a change in price has a varying impact on sales among different age or income groups.

    Where Old Estimation Methods Fail

    Estimating heterogeneous treatment effects is nothing new. Econometrics and other social sciences have long been studying which variables predict a smaller or larger than average treatment effect, which in statistical terms is also known as Moderation. One of the most traditional ways to find heterogeneous treatment effects is to use a Multiple Linear Regression with interaction terms between the variables of interest (i.e. the ones which might lead to treatment heterogeneity) and the treatment indicator. In this blog post, I will always assume that the data is from a randomized experiment, such that the assumptions to identify treatment effects are valid without further complications. We then conclude that the treatment effect depends on the variables whose interaction term is statistically significant. For example, if we have only one variable, the regression model would look as follows:

        \[Y = beta_0 + beta_1 w + beta_2 x_1 + beta_3 (w * x_1),\]

    where w is the treatment indicator and x_1 is the variable of interest. In that case, if beta_3 is significant, we know that the treatment effect depends on variable x_1. The treatment effect for each observation can then be calculated as

        \[beta_1 + beta_3 * x_1,\]

    which is dependent on the value of x_1 and therefore heterogeneous among the different observations.

    So why is there a need for more advanced methods to estimate heterogeneous treatment effects? The example above was very simple, it only included one variable. However, usually, we have more than one variable which might influence the treatment effect. To see which variables predict heterogeneous treatment effects, we have to include many interaction terms, not only between each variable and the treatment indicator but also for all possible combinations of variables with and without the treatment indicator. If we have p variables and one treatment, this gives a total number of parameters of:

        \[displaystylesum_{k = 0}^{p + 1} {p + 1 choose k}.\]

    So, for example, if we had 5 variables, we would have to include a total number of 64 parameters into our Linear Regression Model. This approach suffers from a lack of statistical power and could also cause computational issues. The use of a Multiple Linear Regression also imposes linear relationships unless more interactions with polynomials are included. Because Machine Learning algorithms can handle enormous numbers of variables and combine them in nonlinear and highly interactive ways, researchers have found ways to better estimate heterogeneous treatment effects by combining the field of Machine Learning with the study of Causal Inference.

    Generalized Random Forests

    Over recent years, different Machine Learning algorithms have been developed to estimate heterogeneous treatment effects. Most of them are based on the idea of Decision Trees or Random Forests, just like the one I focus on in this blog post: Generalised Random Forests by Athey, Tibshirani, and Wager (2018).

    Generalized Random Forests follow the idea of Random Forests and apart from heterogeneous treatment effect estimation, this algorithm can also be used for non-parametric quantile regression and instrumental variable regression. It keeps the main structure of Random Forests such as the recursive partitioning, subsampling, and random split selection. However, instead of averaging over the trees Generalised Random Forests estimate a weighting function and use the resulting weights to solve a local GMM model. To estimate heterogeneous treatment effects, this algorithm has two important additional features, which distinguish it from standard Random Forests.

    1. Splitting Criterion

    The first important difference to Random Forests is the splitting criterion. In Random Forests, where we want to predict an outcome variable Y, the split at each tree node is performed by minimizing the mean squared error of the outcome variable Y. In other words, the variable and value to split at each tree node are chosen such that the greatest reduction in the mean squared error with regard to the outcomes Y is achieved. After each tree partition has been completed, the tree’s prediction for a new observation x is obtained by letting it trickle through all the way from tree’s root into a terminal node, and then taking the average of outcomes Y of all the observations x that fell into the same node during training. The Random Forest prediction is then calculated as the average of the predicted tree values.

    In Causal Forests, we want to estimate treatment effects. As stated by the Fundamental Problem of Causal Inference, however, we can never observe a treatment effect on an individual level. Therefore, the prediction of a treatment effect is given by the difference in the average outcomes Y between the treated and the untreated observations in a terminal node. Without going into too much detail, to find the most heterogeneous but also accurate treatment effects, the splitting criterion is adapted such that it searches for a partitioning where the treatment effects differ the most including a correction that accounts for how the splits affect the variance of the parameter estimates.

    tree branches

    2. Honesty

    Random Forests are usually evaluated by applying them to a test set and measuring the accuracy of the predictions of Y using an error measure such as the mean squared error. Because we can never observe treatment effects, this form of performance measure is not possible in Causal Forests. When estimating causal effects, one, therefore, evaluates their accuracy by examining the bias, standard error and the related confidence interval of the estimates. To ensure that an estimate is as accurate as possible, the bias should asymptotically disappear and the standard error and, thus, the confidence interval, should be as small as possible. To enable this statistical inference in their Generalised Random Forest, Athey, Tibshirani and Wager introduce so-called honest trees.

    In order to make a tree honest, the training data is split into two subsamples: a splitting subsample and an estimating subsample. The splitting subsample is used to perform the splits and thus grow the tree. The estimating subsample is then used to make the predictions. That is, all observations in the estimating subsample are dropped down the previously-grown tree until it falls into a terminal node. The prediction of the treatment effects is then given by the difference in the average outcomes between the treated and the untreated observations of the estimating subsample in the terminal nodes. With such honest trees, the estimates of a Causal Forest are consistent (i.e. the bias vanishes asymptotically) and asymptotically Gaussian which together with the estimator for the asymptotic variance allow valid confidence intervals.

    Causal Forest in Action

    To show the advantages of Causal Forests compared to old estimation methods, in the following I will compare the Generalised Random Forest to a Regression with interaction terms in a small simulation study. I use simulated data to be able to compare the estimated treatment effects with the actual treatment effects, which, as we know, would not be observable in real data. To compare the two algorithms with respect to the estimation of heterogeneous treatment effects, I test them on two different data sets, one with and one wihtout heterogeneity in the treatment effect:

    Data Set Heterogeneity Heterogeneity Variables Variables Observations
    1 No Heterogeneity   x_1x_{10} 20000
    2 Heterogeneity x_1 and x_2 x_1x_{10} 20000

    This means that in the first data set, all observations have the same treatment effect. In this case, the average treatment effect and the heterogeneous treatment effects are the same. In the second data set, the treatment effect varies with the variables x_1 and x_2. Without going into too much detail here (I will probably write a separate blog post only about causal data generating processes), the relationship between those heterogeneity variables (x_1 and x_2) and the treatment effect is not linear. Both simulated data sets have 20’000 observations containing an outcome variable Y and 10 covariates with values between zero and one.

    To evaluate the two algorithms, the data sets are split in a train (75%) and a test set (25%). For the Causal Forest, I use the causal_forest() from the grf-package with tune.parameters = "all". I compare this to an lm() model, which includes all variables, the treatment indicator and the necessary interaction terms of the heterogeneity variables and the treatment indicator:

    Linear Regression Model for data set with heterogeneity:

        \[Y = beta_0 + beta_1 x_1 + beta_2 x_2 + dots + beta_{10} x_{10} + beta_{11} w +\]

        \[beta_{12} (w * x_1) + beta_{13} (w * x_2) + beta_{14} (x_1 * x_2) + beta_{15} (w * x_1 * x_2)\]

    Linear Regression Model for data set with no heterogeneity:

        \[Y = beta_0 + beta_1 x_1 + beta_2 x_2 + … + beta_{10} x_{10} + beta_{11} w\]

    where x_1x_{10} are the heterogeneity variables and w is the treatment indicator (i.e. w = 1 if treated and w = 0 if not treated). As already explained above, we usually do not know which variables affect the treatment effect and have therefore to include all possible interaction terms into the Linear Regression Model to see which variables lead to treatment effect heterogeneity. In the case of 10 variables, as we have it here, this means we would have to include a total of 2048 parameters in our Linear Regression Model. However, since the heterogeneity variables are known in the simulated data, here, I only include the interaction terms for those variables.

    Data Set Metric grf lm
    No Heterogeneity RMSE 0.01 0.00
    Heterogeneity RMSE 0.08 0.45

    Looking at the results, we can see that without heterogeneity, the treatment effect is equally well predicted by the Causal Forest (RMSE of 0.01) and the Linear Regression (RMSE of 0.00). However, as the heterogeneity level increases, the Causal Forest is far more accurate (RMSE of 0.08) than the Linear Regression (RMSE of 0.45). As expected, the Causal Forest seems to be better at detecting the underlying non-linear relationship between the heterogeneity variables and the treatment effect than the Linear Regression Model, which can also be seen in the plots below. Thus, even if we already know which variables influence the treatment effect and only need to include the necessary interaction terms, the Linear Regression Model is still less accurate than the Causal Forest due to its lack of modeling flexibility.

    treatment effect hexplot

    Outlook

    I hope that this blog post has helped you to understand what Causal Forests are and what advantages they bring in estimating heterogeneous treatment effects compared to old estimation methods. In my upcoming blog posts on Causal Machine Learning, I will explore this new field of data science further. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or introduce different data generating processes to evaluate Causal Machine Learning methods in simulation studies.

    References

    • Athey, S., Tibshirani, J., & Wager, S. (2019). Generalised random forests. The Annals of Statistics, 47(2), 1148-1178.
    • Knaus, M. C., Lechner, M., & Strittmatter, A. (2018). Machine learning estimation of heterogeneous causal effects: Empirical monte carlo evidence. arXiv:1810.13237v2.
    Livia Eichenberger Livia Eichenberger

    At STATWORX, we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms that allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post.

    As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. This first blog post is an introduction to the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science.

    The Origins of Causal Machine Learning

    As Markus has already explained in his earlier blog post, analysis in economic and other social sciences revolves primarily around the estimation of causal effects, that is, the isolated effect of a feature X on the outcome variable Y. Actually, in most cases, the interest lies in so-called treatment effects. A treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. In economics, one of the most analyzed treatment effects is the causal effect of a subsidized training program on earnings.

    Following the potential outcome framework introduced by Rubin (1947), the treatment effect of an individual is defined as follows:

        \[gamma_i = Y_i(1) - Y_i(0)\]

    where Y_i(1) indicates the potential outcome of the individual i with treatment and contrary, Y_i(0) denotes the potential outcome of the individual i without treatment. However, as an individual can either receive the treatment or not, and thus, we can only ever observe one of the two potential outcomes for an individual at one point in time, the individual treatment effect is unobservable. This problem is also known as the Fundamental Problem of Causal Inference. Nevertheless, under certain assumptions, the averages of the individual treatment effect may be identified. In randomized experiments, where the treatment is randomly assigned, these assumptions are naturally valid, and the identification of any aggregation level of individual treatment effects is possible without further complications. In many situations, however, randomized experiments are not possible, and the researcher has to work with observational data, where these assumptions are usually not valid. Thus, extensive literature in economics and other fields has focused on techniques identifying causal effects in cases where these assumptions are not given.

    Prediction and causal inference are distinct (though closely related) problems. Athey, 2017, p. 484

    In contrast, (Supervised) Machine Learning literature has traditionally focused on prediction, that is, producing predictions of the outcome variable from the feature(s) . Machine Learning models are designed to discover complex structures in given data and generalize them so that they can be used to make accurate predictions on new data. These algorithms can handle enormous numbers of predictors and combine them in nonlinear and highly interactive ways. They have been proven to be hugely successful in practice and are used in applications ranging from medicine to resource allocations in cities.

    Bringing Together the Best of Both Worlds

    Although economists and other social scientists prioritize precise estimates of causal effects above predictive power, they were intrigued by the advantages of Machine Learning methods, such as the precise out-of-sample prediction power or the ability to deal with large numbers of features. But as we have seen, classical Machine Learning models are not designed to estimate causal effects. Using off-the-shelf prediction methods from Machine Learning leads to biased estimates of causal effects. The existing Machine Learning techniques had to be modified to use the advantages of Machine Learning for consistently and efficiently estimating causal effects– the birth of Causal Machine Learning!

    distracted-economist

    Currently, Causal Machine Learning can be broadly divided into two lines of research, defined by the type of causal effect to be estimated. One line of Causal Machine Learning research focuses on modifying Machine Learning methods to estimate unbiased and consistent average treatment effects. The average treatment effect is the mean of all individual treatment effects in an entire population of interest, and probably the most common parameter analyzed in econometric causal studies. Models from this line of research try to answer questions like: How will customers react on average to a marketing campaign? What is the average effect of a price change on sales? The other line of Causal Machine Learning research focuses on modifying Machine Learning methods to uncover treatment effect heterogeneity. That is, identifying subpopulations (based on features) of individuals who have a larger or smaller than average treatment effect. These models are designed to answer questions such as: Which customers respond the most to a marketing campaign? How does the effect of a price change on sales change with the age of customers?

    Decision-Making Questions Need Causal Answers

    Although the study of Causal Machine Learning has been mainly driven by economic researchers, its importance for other areas such as business should not be neglected. Companies often reach for classical Machine Learning tools to solve decision-making problems, such as where to set the price or which customers to target with a marketing campaign. However, there is a significant gap between making a prediction and making a decision. To make a data-driven decision, the understanding of causal relationships is key. Let me illustrate this problem with two examples from our daily business.

    Example 1: Price Elasticities

    At the core of every company’s pricing management is the understanding of how customers will respond to a change in price. To set an optimal price, the company needs to know how much it will sell at different (hypothetical) price levels. The most practicable and meaningful metric answering this question is the price elasticity of demand. Although it might seem straightforward to estimate the price elasticity of demand using classical Machine Learning methods to predict sales as the outcome with the price level as a feature, in practice, this approach does not simply give us the causal effect of price on sales.

    There are a number of gaps between making a prediction and making a decision, and underlying assumptions need to be understood in order to optimise data-driven decision making. Athey, 2017, p. 483

    Following a similar example introduced by Athey (2017), assume we have historical data on airline prices and the respective occupancy rates. Typically, prices and occupancy rates are positively correlated as usual pricing policies of airlines specify that airlines raise their seat prices when their occupancy rate increases. In this case, a classical Machine Learning model will answer the following question: If on a particular day, airline ticket prices are high, what is the best prediction for the occupancy rate on that day? The model will correctly predict that the occupancy rate is likely to be high. However, it would be wrong to infer from this that an increase in prices leads to a higher occupancy rate. From common experience, we know that the true casual effect is quite the opposite – if an airline systematically raised its ticket prices by 10% everywhere, it would be unlikely to sell more tickets.

    Example 2: Customer Churn

    Another common problem, which companies like to solve with the help of Machine Learning, is the prediction of customer churn (i.e., customers abandoning the firm or service thereof). The companies are interested in identifying the customers with the highest risk of churn so that they can respond by allocating interventions in the hope of preventing these customers from leaving.

    Classical Machine Learning algorithms have proven to be very good at predicting customer churn. Unfortunately, these results cannot sufficiently address the company’s resource allocation problem of which customers to best target with intervention strategies. The question of the optimal allocation of resources to customers is of causal nature: For which customers are the causal effect of intervention strategies on their churn behavior the highest? A study has shown that in many cases, the overlap between customers with the highest risk of churning and customers who would respond most to interventions was much lower than 100%. Thus, treating the problem of customer churn as a prediction problem and therefore using classical Machine Learning models is not optimal, yielding lower payoffs to companies.

    The Wish of Every Data Scientist

    Looking beyond these practical examples, we can observe that there is a more profound reason why Causal Machine Learning should be of interest to any data scientist: model generalisability. A Machine Learning model that can capture causal relationships of data will be generalizable to new settings, which is still one of the biggest challenges in Machine Learning.

    rooster

    To illustrate this, I’ll use the example of the rooster and the sun, from “The Book of Why” by Pearl and Mackenzie (2018). A Machine Learning algorithm that is shown data about a rooster and the sun would associate the rising of the sun with the crow of the rooster and may be able to predict when the sun will rise accurately: If the rooster has just crowed, the sun will rise shortly after that. Such a model that is only capable of predicting correlations will not generalize to a situation where there is no rooster. In that case, a Machine Learning model will never predict that the sun will rise because it has never observed such a data point (i.e., without a rooster). If, however, the model captured the true causal relationship, that is, the sun being about to rise causes the rooster to crow, it would be perfectly able to predict that the sun will rise even if there is no rooster.

    No True Artificial Intelligence Without Causal Reasoning

    Pearl and Mackenzie (2018) go even further, arguing that we can never reach true human-level Artificial Intelligence without teaching machines causal reasoning since cause and effect are the key mechanisms through which we humans process and understand the complex world around us. The ability to predict correlations does not make machines intelligent; it merely allows them to model a reality based on data the algorithm is provided.

    The algorithmisation of counterfactuals invites thinking machines to benefit from the ability to reflect on one’s past actions and to participate in this (until now) uniquely human way of thinking about the world. Pearl & Mackenzie, 2018, p. 10

    Furthermore, Machine Learning models need the capacity to detect causal effects to ask counterfactual questions, that is, to inquire how some relationship would change given some kind of intervention. As counterfactuals are the building blocks of moral behavior and scientific thought, machines will only be able to communicate more effectively with us humans and reach the status of moral beings with free will if they learn causal and thus counterfactual reasoning.

    Outlook

    Although this last part has become very philosophical in the end, I hope that this blog post has helped you to understand what Causal Machine Learning is and why it is necessary not only in practice but also for the future of data science in general. In my upcoming blog posts, I will discuss various aspects of this topic in more detail. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or compare different Causal Machine Learning algorithms in a simulation study.

    References

    • Athey, S. (2017). Beyond prediction: using big data for policy problems. Science 335, 483-485.
    • Pearl, J., & Mackenzie, D. (2018). The book of why. New York, NY: Basic Books.
    • Rubin, D. B. (1974). Estimating causal effects of treatments in randomised and non-randomised studies. Journal of Educational Psychology, 66(5), 688-701.
    Livia Eichenberger Livia Eichenberger

    Cross-validation is a widely used technique to assess the generalization performance of a machine learning model. Here at STATWORX, we often discuss performance metrics and how to incorporate them efficiently in our data science workflow. In this blog post, I will introduce the basics of cross-validation, provide guidelines to tweak its parameters, and illustrate how to build it from scratch in an efficient way.

    Model evaluation and cross-validation basics

    Cross-validation is a model evaluation technique. The central intuition behind model evaluation is to figure out if the trained model is generalizable, that is, whether the predictive power we observe while training is also to be expected on unseen data. We could feed it directly with the data it was developed for, i.e., meant to predict. But then again, there is no way for us to know, or validate, whether the predictions are accurate.

    Naturally, we would want some kind of benchmark of our model’s generalization performance before launching it into production. Therefore, the idea is to split the existing training data into an actual training set and a hold-out test partition which is not used for training and serves as the “unseen” data. Since this test partition is, in fact, part of the original training data, we have a full range of “correct” outcomes to validate against. We can then use an appropriate error metric, such as the Root Mean Squared Error (RMSE) or the Mean Absolute Percentage Error (MAPE) to evaluate model performance. However, the applicable evaluation metric has to be chosen with caution as there are pitfalls (as described in this blog post by my colleague Jan).

    Many machine learning algorithms allow the user to specify hyperparameters, such as the number of neighbors in k-Nearest Neighbors or the number of trees in a Random Forest. Cross-validation can also be leveraged for “tuning” the hyperparameters of a model by comparing the generalization error of different model specifications.

    Common approaches to model evaluation

    There are dozens of model evaluation techniques that are always trading off between variance, bias, and computation time. It is essential to know these trade-offs when evaluating a model, since choosing the appropriate technique highly depends on the problem and the data we observe. I will cover this topic once I have introduced two of the most common model evaluation techniques: the train-test-split and k-fold cross-validation. In the former, the training data is randomly split into a train and test partition (Figure 1), commonly with a significant part of the data being retained as the training set. Proportions of 70/30 or 80/20 are the most frequently used in the literature, though the exact ratio depends on the size of your data.

    The drawback of this approach is that this one-time random split can end up partitioning the data into two very imbalanced parts, thus yielding biased generalization error estimates. That is especially critical if you only have limited data, as some features or patterns could end up entirely in the test part. In such a case, the model has no chance to learn them, and you will potentially underestimate its performance.

    train-test-split

    A more robust alternative is the so-called k-fold cross-validation (Figure 2). Here, the data is shuffled and then randomly partitioned into k folds. The main advantage over the train-test-split approach is that each of the k partitions is iteratively used as a test (i.e., validation) set, with the remaining k-1 parts serving as the training sets in this iteration. This process is repeated k times, such that every observation is included in both training and test sets. The appropriate error metric is then simply calculated as a mean of all of the k folds, giving the cross-validation error.

    This is more of an extension of the train-test split rather than a completely new method: That is, the train-test procedure is repeated k times. However, note that even if k is chosen to be as low as k=2, i.e., you end up with only two parts. This approach is still superior to the train-test-split in that both parts are iteratively chosen for training so that the model has a chance to learn all the data rather than just a random subset of it. Therefore, this approach usually results in more robust performance estimates.

    k-fold-cross-validation

    Comparing the two figures above, you can see that a train-test split with a ratio of 80/20 is equivalent to one iteration of a 5-fold (that is, k=5) cross-validation where 4/5 of the data are retained for training, and 1/5 is held out for validation. The crucial difference is that in k-fold the validation set is shifted in each of the k iterations. Note that a k-fold cross-validation is more robust than merely repeating the train-test split k times: In k-fold CV, the partitioning is done once, and then you iterate through the folds, whereas in the repeated train-test split, you re-partition the data k times, potentially omitting some data from training.

    Repeated CV and LOOCV

    There are many flavors of k-fold cross-validation. For instance, you can do “repeated cross-validation” as well. The idea is that, once the data is divided into k folds, this partitioning is fixed for the whole procedure. This way, we’re not risking to exclude some portions by chance. In repeated CV, you repeat the process of shuffling and randomly partitioning the data into k folds a certain number of times. You can then average over the resulting cross-validation errors of each run to get a global performance estimate.

    Another special case of k-fold cross-validation is “Leave One Out Cross-Validation” (LOOCV), where you set k = n. That is, in each iteration, you use a single observation from your data as the validation portion and the remaining n-1 observations as the training set. While this might sound like a hyper robust version of cross-validation, its usage is generally discouraged for two reasons:

    • First, it’s usually very computationally expensive. For most datasets used in applied machine learning, training your model n-1 times is neither desirable nor feasible (although it may be useful for very small datasets).
    • Second, even if you had the computational power (and time on your hands) to endure this process, another argument advanced by critics of LOOCV from a statistical point of view is that the resulting cross-validation error can exhibit high variance. The cause of that is that your “validation set” consists of only one observation, and depending on the distribution of your data (and potential outliers), this can vary substantially.

    In general, note that the performance of LOOCV is a somewhat controversial topic, both in the scientific literature and the broader machine learning community. Therefore, I encourage you to read up on this debate if you consider using LOOCV for estimating the generalization performance of your model (for example, check out this and related posts on StackExchange). As is often the case, the answer might end up being “it depends”. In any case, keep in mind the computational overhead of LOOCV, which is hard to deny (unless you have a tiny dataset).

    The value of k and the bias-variance trade-off

    If k=n is not (necessarily) the best choice, then how to find an appropriate value for k? It turns out that the answer to this question boils down to the notorious bias-variance trade-off. Why is that?

    The value for k governs how many folds your data is partitioned into and therefore the size of (i.e., number of observations contained in) each fold. We want to choose k in a way that a sufficiently large portion of our data remains in the training set – after all, we don’t want to give too many observations away that could be used to train our model. The higher the value of k, the more observations are included in our training set in each iteration.

    For instance, suppose we have 1,200 observations in our dataset, then with k=3 our training set would consist of frac{k-1}{k} * N = 800 observations, but with k=8 it would include 1,050 observations. Naturally, with more observations used for training, you approximate your model’s actual performance (as if it were trained on the whole dataset), hence reducing the bias of your error estimate compared to a smaller fraction of the data. But with increasing k, the size of your validation partition decreases, and your error estimate in each iteration is more sensitive to these few data points, potentially increasing its overall variance. Basically, it’s choosing between the “extremes” of the train-test-split on the one hand and LOOCV on the other. The figure below schematically (!) illustrates the bias-variance performance and computational overhead of different cross-validation methods.

    bias-variance-tradeoff

    As a rule of thumb, with higher values for k, bias decreases and variance increases. By convention, values like k=5 or k=10 have been deemed to be a good compromise and have thus become the quasi-standard in most applied machine learning settings.

    “These values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.” James et al. 2013: 184

    If you are not particularly concerned with the process of cross-validation itself but rather want to seamlessly integrate it into your data science workflow (which I highly recommend!), you should be fine choosing either of these values for k and leave it at that.

    Implementing cross-validation in caret

    Speaking of integrating cross-validation into your daily workflow—which possibilities are there? Luckily, cross-validation is a standard tool in popular machine learning libraries such as the caret package in R. Here you can specify the method with the trainControl function. Below is a script where we fit a random forest with 10-fold cross-validation to the iris dataset.

    library(caret)
    
    set.seed(12345)
    inTrain <- createDataPartition(y = iris[["Species"]], p = .7, list = FALSE)
    
    iris.train <- iris[inTrain, ]
    iris.test <- iris[- inTrain, ]
    
    fit.control <- caret::trainControl(method = "cv", number = 10)
    
    rf.fit <- caret::train(Species ~ .,
                           data = iris.train,
                           method = "rf",
                           trControl = fit.control)

    We define our desired cross-validation method in the trainControl function, store the output in the object fit.control, and then pass this object to the trControl argument of the train function. You can specify the other methods introduced in this post in a similar fashion:

    # Leave-One-Out Cross-validation:
    fit.control <- caret::trainControl(method = "LOOCV", number = 10)
    
    # Repeated CV (remember to specify the number of repeats!)
    fit.control <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5)

    The old-fashioned way: Implementing k-fold cross-validation by hand

    However, data science projects can quickly become so complex that the ready-made functions in machine learning packages are not suitable anymore. In such cases, you will have to implement the algorithm—including cross-validation techniques—by hand, tailored to the specific project needs. Let me walk you through a make-shift script for implementing simple k-fold cross-validation in R by hand (we will tackle the script step by step here; you can find the whole code on our GitHub).

    Simulating data, defining the error metric, and setting k

    # devtools::install_github("andrebleier/Xy")
    library(tidyverse)
    library(Xy)
    
    sim <- Xy(n = 1000,
              numvars = c(2,2),
              catvars = 0,
              cor = c(-0.5, 0.9),
              noisevars = 0)
    
    sim_data <- sim[["data"]]
    
    RMSE <- function(f, o){
      sqrt(mean((f - o)^2))
    }
    
    k <- 5

    We start by loading the required packages and simulating some simulation data with 1,000 observations with the Xy() package developed by my colleague André (check out his blog post on simulating regression data with Xy). Because we need some kind of error metric to evaluate model performance, we define our RMSE function which is pretty straightforward: The RMSE is the root of the mean of the squared error, where error is the difference between our fitted (f) und observed (o) values—you can pretty much read the function from left to right. Lastly, we specify our k, which is set to the value of 5 in the example and is stored as a simple integer.

    Partitioning the data

    set.seed(12345)
    sim_data <- mutate(sim_data,
                       my.folds = sample(1:k,
                                         size = nrow(sim_data),
                                         replace = TRUE))

    Next up, we partition our data into k folds. For this purpose, we add a new column, my.folds, to the data: We sample (with replacement) from 1 to the value of k, so 1 to 5 in our case, and randomly add one of these five numbers to each row (observation) in the data. With 1,000 observations, each number should be assigned about 200 times.

    Training and validating the model

    cv.fun <- function(this.fold, data){
    
      train <- filter(data, my.folds != this.fold)
      validate <- filter(data, my.folds == this.fold)
    
      model <- lm(y ~ NLIN_1 + NLIN_2 + LIN_1 + LIN_2,
                  data = train)
    
      pred <- predict(model, newdata = validate) %>% as.vector()
    
      this.rmse <- RMSE(f = pred, o = validate[["y"]])
    
      return(this.rmse)
    }

    Next, we define cv.fun, which is the heart of our cross-validation procedure. This function takes two arguments: this.fold and data. I will come back to the meaning of this.fold in a minute, let’s just set it to 1 for now. Inside the function, we divide the data into a training and validation partition by subsetting according to the values of my.folds and this.fold: Every observation with a randomly assigned my.folds value other than 1 (so approximately 4/5 of the data) goes into training. Every observation with a my.folds value equal to 1 (the remaining 1/5) forms the validation set. For illustration purposes, we then fit a simple linear model with the simulated outcome and four predictors. Note that we only fit this model on the train data! We then use this model to predict() our validation data, and since we have true observed outcomes for this subset of the original overall training data (this is the whole point!), we can compute our RMSE and return it.

    Iterating through the folds and computing the CV error

    cv.error <- sapply(seq_len(k),
                       FUN = cv.fun,
                       data = sim_data) %>%
      mean()
    
    cv.error

    Lastly, we wrap the function call to cv.fun into a sapply() loop—this is where all the magic happens: Here we iterate over the range of k, so seq_len(k) leaves us with the vector [1] 1 2 3 4 5 in this case. We apply each element of this vector to cv.fun. In apply() statements, the iteration vector is always passed as the first argument of the function which is called, so in our case, each element of this vector at a time is passed to this.fold. We also pass our simulated sim_data as the data argument.

    Let us quickly recap what this means: In the first iteration, this.fold equals 1. This means that our train set consists of all the observations where my.folds is not 1, and observations with a value of 1 form the validation set (just as in the example above). In the next iteration of the loop, this.fold equals 2. Consequently, observations with 1, 3, 4, and 5 form the training set, and observations with a value of 2 go to validation, and so on. Iterating over all values of k, this schematically provides us with the diagonal pattern seen in Figure 2 above, where each data partition at a time is used as a validation set.

    To wrap it all up, we calculate the mean: This is the mean of our k individual RMSE values and leaves us with our cross-validation error. And there you go: We just defined our custom cross-validation function!

    This is merely a template: You can insert any model and any error metric. If you’ve been following along so far, feel free to try implementing repeated CV yourself or play around with different values for k.

    Conclusion

    As you can see, implementing cross-validation yourself isn’t all that hard. It gives you great flexibility to account for project-specific needs, such as custom error metrics. If you don’t need that much flexibility, enabling cross-validation in popular machine learning packages is a breeze.

    I hope that I could provide you with a sufficient overview of cross-validation and how to implement it both in pre-defined functions as well as by hand. If you have questions, comments, or ideas, feel free to drop me an e-mail.

    References

    James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. New York: Springer.

    Lukas Feick Lukas Feick Lukas Feick

    A major problem arises when comparing forecasting methods and models across different time series. This is a challenge we regularly face at STATWORX. Unit-dependent measures like the MAE (Mean Absolute Error) and the RMSE (Root Mean Squared Error) turn out to be unsuitable and hardly helpful if the time series is measured in different units. However, if this is not the case, both measures provide valuable information. The MAE is perfectly interpretable as it embodies the average absolute deviation from the actual values. The RMSE, on the other hand, is not that easy to interpret, more vulnerable to extreme values but still often used in practice.

    MAE\ =\frac{1}{n}\ \sum_{i\ =1}^{n}{|{\rm Actual}_i\ -\ {\rm Forecast}_i}|

    \mathrm{RMSE=\ }\sqrt{\frac{\mathrm{1}}{\mathrm{n}}\mathrm{\ } \sum_{\mathrm{i\ =\ 1}}^{\mathrm{n}}{\mathrm{(}{\mathrm{Actual}}_\mathrm{i}\mathrm{-} {\mathrm{Forecast}}_\mathrm{i}\mathrm{)} }^\mathrm{2}}

    One of the most commonly used measures that avoid this problem is called MAPE (Mean Absolute Percentage Error). It solves the problem of the mentioned approaches as it does not depend on the unit of the time series. Furthermore, decision-makers without a statistical background can easily interpret and understand this measure. Despite its popularity, the MAPE was and is still criticized.

    MAPE\ =\frac{1}{n}\ \sum_{i\ =1}^{n}{|\frac{{\rm Actual}_i\ -\ {\rm Forecast}_i}{{\rm Actual}_i}|}*100

    In this article, I evaluate these critical arguments and prove that at least some of them are highly questionable. The second part of my article concentrates on true weaknesses of the MAPE, some of them well-known but others hiding in the shadows. In the third section, I discuss various alternatives and summarize under which circumstances the use of the MAPE seems to be appropriate (and when it’s not).

    What the MAPE is FALSELY blamed for!

    It Puts Heavier Penalties on Negative Errors Than on Positive Errors

    Most sources dealing with the MAPE point out this “major” issue of the measure. The statement is primarily based on two different arguments. First, they claim that interchanging the actual value with the forecasted value proves their point (Makridakis 1993).

    Case 1: {Actual}_1 = 150 & {Forecast}_1 = 100 (positive error)

    {\rm APE}_1\ =\ |\frac{{\rm Actual}_1\ -\ {\rm Forecast}_1}{{\rm Actual}_1}|\ *100 =\ |\frac{150\ -\ 100}{150}|\ *100 =\ 33.33\ Percent

    Case 2: {Actual}_2 = 100 & {Forecast}_2 = 150 (negative error)

    {\rm APE}_2\ =\ |\frac{{\rm Actual}_2\ -\ {\rm Forecast}_2}{{\rm Actual}_2}|\ *100 =\ |\frac{100\ -\ 150}{100}|\ *100 =\ 50\ Percent

    It is true that Case 1 (positive error of 50) is related to a lower APE (Absolute Percentage Error) than Case 2 (negative error of 50). However, the reason here is not that the error is positive or negative but simply that the actual value changes. If the actual value stays constant, the APE is equal for both types of errors (Goodwin & Lawton 1999). That is clarified by the following example.

    Case 3: {Actual}_3 = 100 & {Forecast}_3 = 50

    {\rm APE}_3\ =\ |\frac{{\rm Actual}_3\ -\ {\rm Forecast}_3}{{\rm Actual}_3}|\ *100 =\ |\frac{100\ -\ 50}{100}|\ *100 =\ 50\ Percent

    Case 4: {Actual}_4= 100 & {Forecast}_4 = 150

    {\rm APE}_4\ =\ |\frac{{\rm Actual}_4\ -\ {\rm Forecast}_4}{{\rm Actual}_4}|\ *100 =\ |\frac{100\ -\ 150}{100}|\ *100 =\ 50\ Percent

    The second, equally invalid argument supporting the asymmetry of the MAPE arises from the assumption about the predicted data. As the MAPE is mainly suited to be used to evaluate predictions on a ratio scale, the MAPE is bounded on the lower side by an error of 100% (Armstrong & Collopy 1992). However, this does not imply that the MAPE overweights or underweights some types of errors, but that these errors are not possible.

    Its TRUE weaknesses!

    It Fails if Some of the Actual Values Are Equal to Zero

    This statement is a well-known problem of the focal measure. However, that and the latter argument were the reason for the development of a modified form of the MAPE, the SMAPE (“Symmetric” Mean Absolute Percentage). Ironically, in contrast to the original MAPE, this modified form suffers from true asymmetry (Goodwin & Lawton 1999). I will clarify this argument in the last section of the article.

    Particularly Small Actual Values Bias the Mape

    If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE (Hyndman & Koehler 2006). The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.

    The Mape Implies Only Which Forecast Is Proportionally Better

    As mentioned at the beginning of this article, one advantage of using the MAPE for comparison between forecasts of different time series is its unit independence. However, it is essential to keep in mind that the MAPE only implies which forecast is proportionally better. The following graph shows three different time series and their corresponding forecasts. The only difference between them is their general level. The same absolute errors lead, therefore, to profoundly different MAPEs. This article critically questions, if it is reasonable to use such a percentage-based measure for the comparison between forecasts for different time series. If the different time series aren’t behaving in a somehow comparable level (as shown in the following graphic), using the MAPE to infer if a forecast is generally better for one time series than for another relies on the assumption that the same absolute errors are less problematic for time series on higher levels than for time series on lower levels:

    If a time series fluctuates around 100, then predicting 101 is way better than predicting 2 for a time series fluctuating around 1.”

    That might be true in some cases. However, in general, this a questionable or at least an assumption people should always be aware of when using the MAPE to compare forecasts between different time series.

    Summary

    In summary, the discussed findings show that the MAPE should be used with caution as an instrument for comparing forecasts across different time series. A necessary condition is that the time series only contains strictly positive values. Second only some extremely small values have the potential to bias the MAPE heavily. Last, the MAPE depends systematically on the level of the time series as it is a percentage-based error. This article critically questions if it is meaningful to generalize from being a proportionally better forecast to being a generally better forecast.

    BETTER alternatives!

    The discussed implies that the MAPE alone is often not very useful when the objective is to compare accuracy between different forecasts for different time series. Although relying only on one easily understandable measure appears to be comfortable, it comes with a high risk of drawing misleading conclusions. In general, it is always recommended to use different measures combined. In addition to numerical measures, a visualization of the time series, including the actual and the forecasted values always provides valuable information. However, if one single numeric measure is the only option, there are some excellent alternatives.

    Scaled Measures

    Scaled measures compare the measure of a forecast, for example, the MAE relative to the MAE of a benchmark method. Similar measures can be defined using RMSE, MAPE, or other measures. Common benchmark methods are the “random walk”, the “naïve” method and the “mean” method. These measures are easy to interpret as they show how the focal model compares to the benchmark methods. However, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method.

    Relative\ MAE\ =\ \frac{{\rm MAE}_{focal\ model}}{{\rm MAE}_{benchmark\ model}}

    Scaled Errors

    Scaled error approaches also try to remove the scale of the data by comparing the forecasted values to those obtained by some benchmark forecast method, like the naïve method. The MASE (Mean Absolute Scaled Error), proposed by Hydnmann & Koehler 2006, is defined slightly differently depending on the seasonality of the time series. In the simple case of a non-seasonal time series, the error of the focal forecast is scaled based on the in-sample MAE from the naïve forecast method. One major advantage is that it can handle actual values of zero and that it is not biased by very extreme values. Once again, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method.

    Non-Seasonal

    MASE=\frac{1}{n}\sum_{i\ =\ 1}^{n}{|\frac{{\rm Actual}_i\ -\ {\rm Forecast}_i}{\frac{1}{T-1}\sum_{t=2}^{T}{|{\rm Actual}_t-{\rm Actual}_{t-1}|}}|}

    Seasonal

    MASE=\frac{1}{n}\sum_{i\ =\ 1}^{n}{|\frac{{\rm Actual}_i\ -\ {\rm Forecast}_i}{\frac{1}{T-M}\sum_{t=m+1}^{T}{|{\rm Actual}_t-{\rm Actual}_{t-m}|}}|}

    SDMAE

    In my understanding, the basic idea of using the MAPE to compare different time series between forecasts is that the same absolute error is assumed to be less problematic for time series on higher levels than for time series on lower levels. Based on the examples shown earlier, I think that this idea is at least questionable.

    I argue that how good or bad a specific absolute error is evaluated should not depend on the general level of the time series but on its variation. Accordingly, the following measure the SDMAE (Standard Deviation adjusted Mean Absolute Error) is a product of the discussed issues and my imagination. It can be used for evaluating forecasts for times series containing negative values and does not suffer from actual values being equal to zero nor particularly small. Note that this measure is not defined for time series that do not fluctuate at all. Furthermore, there might be other limitations of this measure, that I am currently not aware of.

    SDMAE\ =\ \frac{{\rm MAE}_{focal\ model}}{{\rm SD}_{actual\ values}}

    Summary

    I suggest using a combination of different measures to get a comprehensive understanding of the performance of the different forecasts. I also suggest complementing the MAPE with a visualization of the time series, including the actual and the forecasted values, the MAE, and a Scaled Measure or Scaled Error approach. The SDMAE should be seen as an alternative approach that was not discussed by a broader audience so far. I am thankful for your critical thoughts and comments on this idea.

    Worse alternatives!

    SMAPE

    The SMAPE was created, to solve and respond to the problems of the MAPE. However, this did neither solve the problem of extremely small actual values nor the level dependency of the MAPE. The reason is that extremely small actual values are typically related to extremely small predictions (Hyndman & Koehler 2006). Additionally, and in contrast to the unmodified MAPE, the SMAPE raises the problem of asymmetry (Goodwin & Lawton 1999). This is clarified through the following graphic, whereas the ” APE” relates to the MAPE and the “SAPE” relates to the SMAPE. It shows that the SAPE is higher for positive errors than for negative errors and therefore, asymmetric. The SMAPE is not recommended to be used by several scientists (Hyndman & Koehler 2006).

    SMAPE=\frac{1}{n}\sum_{i\ =\ 1}^{n}{|\frac{{\rm Actual}_i\ -\ {\rm Forecast}_i}{({\rm Actual}_i+{\rm Forecast}_1)/2}|*100}

    On the asymmetry of the symmetric MAPE (Goodwin & Lawton 1999)

    References

    • Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE. International journal of forecasting, 15(4), 405-408.
    • Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International journal of forecasting, 22(4), 679-688.
    • Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 9(4), 527-529.
    • Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing about forecasting methods: Empirical comparisons. International journal of forecasting, 8(1), 69-80.

     

    Jan Fischer Jan Fischer Jan Fischer

    On my daily commute to STATWORX, I pass the gas station down the street twice a day. Even though I’m riding a bike, I’m still looking at the gas prices. Lately, after a few weeks of observation, I thought to have found some patterns. My statistic sensors were tickling!

    Of course, I am not the first one to notice patterns. Nearly everyone has their own theory about when to buy cheap gas! Some of the more famous examples are:

    1. Everything was better in the old days
    2. It’s cheaper in the evening
    3. Gas stations at motorways are more expensive
    4. No-name stations have lower prices than top brands
    5. The difference between diesel and gas (E10) is shrinking
    6. It’s cheapest on Mondays

    But, are these facts or just rumors? To check all those myths, I gathered some data, explored relations, and finally plotted my results!

    FYI: The code I used will be presented in a future blog when I also will take a deeper look at my coding and optimization progress – stay tuned!

    The king of gas price data

    Since December 2013 all gas station in Germany must report their prices to the Markttransparenzstelle für Kraftstoffe, which is part of the Bundeskartellamt. While searching for this data, I stumbled upon the site tankerkoenig.de. They had gathered the price data for nearly all gas stations since 2014. First, they only provided a data bank dump with all historical data, but since October 2018, they also offer a git repository with daily csv files.

    The data contain all price changes of around 15.000 gas stations, looking as follows:

                      date station_uuid diesel    e5   e10
    1: 2019-01-01 00:01:06     2ae0e4dc  1.319 1.399 1.379
    2: 2019-01-01 00:01:06     08a67e2e  1.339 1.409 1.389
    3: 2019-01-01 00:01:06     939f06e5  1.299 1.379 1.359
    4: 2019-01-01 00:01:06     d299da0e  1.279 1.389 1.369
    5: 2019-01-01 00:01:06     a1e15688  1.228 1.328 1.308

    In addition to the price information, tankerkoenig.de also provides additional information like the brand, the geo-coordinates and the city of each gas station. Since I was also interested in the fact whether a station is near a motorway or not, I included this data from the ADAC (the German automobile club). Of course, this did not seem to be complete, but good enough for my purpose. With the result, I was able to plot the stations, and the motorways emerged!

    If you want to know more about how to make plots with geo-coordinates and ggplot2, check out the blog post of my colleague Lea!

    Only believe the facts!

    To confirm or debunk the myths I chose, I went ahead and filtered, aggregated, and merged the raw data. Be prepared for some nice plots and insights!

    1. Everything was better in the old days

    Nothing shows a development better than a good old timeline! As you can see in the following graphic, I plotted the mean price per day. To get a little more insight about the price ranges, I included the 10% and 90% quantiles of prices for each day.

    Of course, this was just a glimpse, and there were times (many years back) when the price was below 1DM (~0.5€). But for the last five years, the price was kind of within the same range. There was a low point early in 2016, but the latest peak was still a bit lower than in mid-2014.

    Conclusion: For the “short-term” old days the myth does not hold up – for the “long-run” … probably yes!

    2. It’s cheaper in the evening

    To check this myth, I aggregated the data over days and years. I noticed that there were different patterns over the years. From 2014 till 2017, there seemed to be a pattern where the price was high in the morning, slowly decreased over the day, with a small peak at noon, and rose up again in the evening. Since 2018, a new pattern with three peaks in a wave pattern emerged: The highest one in the morning around 7:00, one at noon and the last one around 17:00.

    Since the price level varied a bit over the years, I also looked at the scaled prices. Here, the difference between the patterns was even more visible.

    Conclusion: From 2015 to 2017, the myth seemed to be right. However, lately there were multiple “good times” across a day. Mostly right before the peak in the afternoon or the evening. As the price did not rise that much after its evening peak, the myth is still correct. The lowest point at the moment seemed to be around 21:00.

    Update

    While writing this post I found a recent study from the Goethe university of Frankfurt, that shows, that there is a new pattern with even more peaks!

    3. Gas stations at motorways are more expensive

    In the plot above I already showed where the motorway stations are – the blue dots. In the next plot, the blue lines represent the prices at those motorway stations. One can clearly spot the difference! Even though there are some stations within the 10% and 90% quartile range where the prices overlapped.

    Conclusion: Obviously right!

    4. No-name stations have lower prices than top brands

    To get the top brands, I took the nine brands with the most stations in Germany and labelled the rest “other”. After that, I calculated the difference between mean price and mean price per brand. Therefore, lines above zero indicated that a brand is more expensive than the mean and vice versa.. Of the top brands, ARAL and SHELL were always above the mean, some fluctuated around the mean and some were always below the mean. The no-name stations were cheaper than the mean over the whole timespan.

    Conclusion: The result depends on how a no-name station is defined. Since the top five brands are mostly more expensive than the mean, I think the myth is somewhat right. At least, it is safe to say that there are differences between the brands.

    5. The difference between diesel and gas (E10) is shrinking

    To compare the differences, I took the E10 price as the baseline and plotted the other two types against it. Between E5 and E10 there was nearly no change over the last five years. However, for diesel, it was a story. Since 2014, the difference took a wild ride from 10% cheaper than E10 to 20% cheaper at the beginning of 2016 back to only 5% now.

    Conclusion: Yes, the difference is shrinking at the moment compared to the last five years.

    6. It’s cheapest on Mondays

    To check the last myth, I calculated the mean price per hour, type and weekday. The weekday with the lowest price at a given time can be seen in the following plot. All gas types revealed a similar pattern. Between the years, there was a subtle change mostly between Wednesday and Thursday.

    Conclusion: Since Monday barely ever was the cheapest day – this myth got debunked!

    Conclusion

    After all of that data hacking and pattern finding, these are my results:

    1. Everything was better in the old days – FALSEish
    2. It’s cheaper in the evening – TRUE
    3. Gas stations at motorways are more expensive – TRUE
    4. No-name stations have lower prices than top brands – TRUEish
    5. The difference between diesel and gas (e10) is shrinking – TRUE
    6. It’s cheapeast on Mondays – FALSE

    Of course, this analysis is far from being complete and even had some shortcuts in it, as well. Still, I had a lot of fun analysing this big data set regardless of all the internal fights with my RAM. I will present some of these fights and their outcome in my next blog post. In that post, you will also get more information about my code optimisation process.

    In another upcoming blog post, my colleague Matthias will use this data to predict the future price. So, stay tuned!

    References

    • The data is used under the Creative-Commons-Lizenz (CC BY 4.0). For more information on the data check out the tankerkoenig website.

    Jakob Gepp Jakob Gepp

    Introduction

    At STATWORX, we are not only helping customers find, develop, and implement a suitable data strategy but we also spend some time doing research to improve our own tool stack. This way, we can give back to the open-source community.

    One special focus of my research lies in tree-based ensemble methods. These algorithms are bread and butter tools in predictive learning and you can find them as a standalone model or as an ingredient to an ensemble in almost every supervised learning Kaggle challenge. Renowned models are Random Forest by Leo Breiman, Extremely Randomized Trees (Extra-Trees) by Pierre Geurts, Damien Ernst & Louis Wehenkel, and Multiple Additive Regression Trees (MART; also known as Gradient Boosted Trees) by Jerome Friedman.

    One thing I was particularly interested in, was how much randomization techniques have helped improve prediction performance in all of the algorithms named above. In Random Forest and Extra-Trees, it is quite obvious. Here, randomization is the reason why the ensembles offer an improvement over bagging; through the de-correlation of the base learners, the variance of the ensemble and therefore its prediction error decreases. In the end, you achieve de-correlation by “shaking up” the base trees, as it’s done in the two ensembles. However, MART also profits from randomization. In 2002, Friedman published another paper on boosting, showing that you can improve the prediction performance of boosted trees by training each tree on only a random subsample of your data. As a side-effect, your training time also decreases. Furthermore, in 2015, Rashmi and Gilad suggested adding a method known as a dropout to the boosting ensemble: a method found and used in neural nets.

    The Idea behind Random Boost

    Inspired by theoretical readings on randomization techniques in boosting, I developed a new algorithm, that I called Random Boost (RB). In its essence, Random Boost sequentially grows regression trees with random depth. More precisely, the algorithm is almost identical to and has the exact same input arguments as MART. The only difference is the parameter d_{max}. In MART, d_{max} determines the maximum depth of all trees in the ensemble. In Random Boost, the argument constitutes the upper bound of possible tree sizes. In each boosting iteration i, a random number d_i between 1 and d_{max} is drawn, which then defines the maximum depth of that tree T_i(d_i).

    In comparison to MART, this has two advantages:

    First, RB is faster than MART on average, when being equipped with the same value for the tree size. When RB and MART are trained with a value for the maximum tree depth equal to d_{max}, then Random Boost will in many cases grow trees the size of d < d_{max} by nature. If you assume that for MART, all trees will be grown to their full size d_{max} (i.e. there is enough data left in each internal node so that tree growing doesn’t stop before the maximum size is reached), you can derive a formula showing the relative computation gain of RB over MART:

        \[t_{rel}(d_{max}) = frac{t_{mathrm{RB}}(d_{max})}{t_{mathrm{MART}}(d_{max})} approx frac{2}{d_{max}}left(1 - left(frac{1}{2}right)^{d_{max}} right).\]

    t_{RB}(d_{max}) e.g. is the training time of a RB boosting ensemble with the tree size parameter being equal to d_{max}.

    To make it a bit more practical, the formula predicts that for d_{max}=2, 3, and 4, RB takes 75%, 58%, and 47% of the computation time of MART, respectively. These predictions, however, should be seen as RB’s best-case scenario, as MART is also not necessarily growing full trees. Still, the calculations suggest that efficiency gains can be expected (more on that later).

    Second, there are also reasons to assume that randomizing over tree depths can have a beneficial effect on the prediction performance. As already mentioned, from a variance perspective, boosting suffers from overcapacity for various reasons. One of them is choosing too rich of a base learner in terms of depth. If, for example, one assumes that the dominant interaction in the data generating process is of order three, one would pick a tree with equivalent depth in MART in order to capture this interaction depth. However, this may be overkill, as fully grown trees with a depth equal to 3 have eight leaves and therefore learn noise in the data, if there are only a few of such high order interactions. Perhaps, in this case, a tree with depth 3 but less than eight leaves would be optimal. This is not accounted for in MART, if one doesn’t want to add a pruning step to each boosting iteration at the expense of computational overhead. Random Boost may offer a more efficient remedy to this issue. With probability 1 / d_{max}, a tree is grown, which is able to capture the high order effect at the cost of also learning noise. However, in all the other cases, Random Boost constructs smaller trees that do not show the over-capacity behavior and that can focus on interactions of smaller order. If over-capacity is an issue in MART due to different interactions in the data governed by a small number of high order interactions, Random Boost may perform better than MART. Furthermore, Random Boost also decorrelates trees through the extra source of randomness, which has a variance reducing the effect on the ensemble.

    The concept of Random Boost constitutes a slight change to MART. I used the sklearn package as a basis for my code. As a result, the algorithm is developed based on sklearn.ensemle.GradientBoostingRegressor and sklearn.ensemle.GradientBoostingClassifier and is used in exactly the same way (i.e. argument names match exactly and CV can be carried out with sklearn.model_selection.GridSearchCV). The only difference is that the RandomBoosting*-object uses max_depth to randomly draw tree depths for each iteration. As an example, you can use it like this:

    rb = RandomBoostingRegressor(learning_rate=0.1, max_depth=4, n_estimators=100)
    rb = rb.fit(X_train, y_train)
    rb.predict(X_test)

    For the full code, check out my GitHub account.

    Random Boost versus MART – A Simulation Study

    In order to compare the two algorithms, I ran a simulation on 25 Datasets generated by a Random Target Function Generator that was introduced by Jerome Friedman in his famous Boosting paper he wrote in 2001 (you can find the details in his paper. Python code can be found here). Each dataset (containing 20,000 observations) was randomly split into a 25% test set and a 75% training set. RB and MART were tuned via 5-fold CV on the same tuning grid.

    • learning_rate = 0.1
    • max_depth = (2, 3, ..., 8)
    • n_estimators = (100, 105, 110, ..., 195)

    For each dataset, I tuned both models to obtain the best parameter constellation. Then, I trained each model on every point of the tuning grid again and saved the test MAE as well as the total training time in seconds. Why did I train every model again and didn’t simply store the prediction accuracy of the tuned models along with the overall tuning time? Well, I wanted to be able to see how training time varies with the tree size parameter.

    A Comparison of Prediction Accuracies

    You can see the distribution of MAEs of the best models on all 25 datasets below.

    absolute MAE

    Evidently, both algorithms perform similarly.

    For a better comparison, I compute the relative difference between the predictive performance of RB and MART for each dataset j, i.e. MAE_{rel,j}=frac{MAE_{RB,j}-MAE_{MART,j}}{MAE_{MART,j}}. If MAE_{rel,j}>0, then RB had a larger mean absolute error than MART on dataset j, and vice versa.

    MAE by dataset with boxplot

    In the majority of cases, RB did worse than MART in terms of prediction accuracy (MAE_{rel}>0). In the worst case, RB had a 1% higher MAE than MART. In the median, RB has a 0.19% higher MAE. I’ll leave itu p to you to decide whether that difference is practically significant.

    A comparison of Training times

    When we look at training time, we get a quite clear picture. In absolute terms, it took 433 seconds to train all parameter combinations of RB on average, as opposed to 803 seconds for MART.

    average training time

    The small black lines on top of each bar are the error bars (2 times the means standard deviation; rather small in this case).

    To give you a better feeling of how each model performed on each dataset, I also plotted the training times for each round.

    total training time

    If you now compute the training time ratio between MART and RB (frac{t_{MART}}{t_{RB}}), you see that RB is roughly 1.8 times faster than MART on average.

    Another perspective on the case is to compute the relative training time t_{rel,j}=frac{t_{RB,j}}{t_{MART,j}} which is just 1 over the speedup. Note that this measure has to be interpreted a bit differently from the relative MAE measure above. If t_{rel,j}=1 then RB is as fast as MART, if t_{rel,j}>1, then it takes longer to train RB than MART, and if t_{rel,j}<1, then RB is faster than MART.

    On average, RB only needs roughly 54% of MART’s tuning time in the median, and it is noticeably faster in all cases. I was also wondering how the relative training time varies with d_{max} and how well the theoretically derived lower bound from above fits the actually measured relative training time. That’s why I computed the relative training time across all 25 datasets by tree size.

    Tree size (max_depth) Actual Training time (RB / MART) Theoretical lower bound
    2 0.751 0.750
    3 0.652 0.583
    4 0.596 0.469
    5 0.566 0.388
    6 0.532 0.328
    7 0.505 0.283
    8 0.479 0.249

    The theoretical figures are optimistic, but the relative performance gain of RB increases with tree size.

    Results in a Nutshell and Next Steps

    As part of my research on tree-based ensemble methods, I developed a new algorithm called Random Boost. Random Boost is based on Jerome Friedman’s MART, with the slight difference that it fits trees of random size. In total, this little change can reduce the problem of overfitting and noticeably speed up computation. Using a Random Target Function Generator suggested by Friedman, I found that, on average, RB is roughly twice as fast as MART with a comparable prediction accuracy in expectation.

    Since running the whole simulation takes quite some time (finding the optimal parameters and retraining every model takes roughly one hour for each data set on my Mac), I couldn’t run hundreds or more simulations for this blog post. That’s the objective for future research on Random Boost. Furthermore, I want to benchmark the algorithm on real-world datasets.

    In the meantime, feel free to look at my code and run the simulations yourself. Everything is on GitHub. Moreover, if you find something interesting and you want to share it with me, please feel free to shoot me an email.

    References

    • Breiman, Leo (2001). Random Forests. Machine Learning, 45, 5–32
    • Chang, Tianqi, and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 785-794
    • Chapelle, Olivier, and Yi Chang. 2011. “Yahoo! learning to rank challenge overview”. In Proceedings of the Learning to Rank Challenge, 1–24.
    • Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.
    • Friedman, J. H. (2002). “Stochastic gradient boosting”. Computational Statistics & Data Analysis 38 (4): 367–378.
    • Geurts, Pierre, Damien Ernst, and Louis Wehenkel (2006). “Extremely randomized trees”. Machine learning 63 (1): 3–42.
    • Rashmi, K. V., and Ran Gilad-Bachrach (2015). Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38.
    Tobias Krabel Tobias Krabel

    In this blog, we will explore the Bagging algorithm and a computational more efficient variant thereof, Subagging. With minor modifications, these algorithms are also known as Random Forest and are widely applied here at STATWORX, in industry and academia.

    Almost all statistical prediction and learning problems encounter a bias-variance tradeoff. This is particularly pronounced for so-called unstable predictors. While yielding low biased estimates due to flexible adaption to the data, those kind of predictors react very sensitive to small changes in the underlying dataset and have hence high variance. A common example are Regression Tree predictors.

    Bagging bypasses this tradeoff by reducing the variance of the unstable predictor while leaving its bias mostly unaffected.

    Method

    In particular, Bagging uses repeated bootstrap sampling to construct multiple versions of the same prediction model (e.g. Regression Trees) and averages over the resulting predictions.

    Let’s see how Bagging works in detail:

    1. Construct a bootstrap sample (y_{i}^{*}, mathbf{x_{i}^{*}}) :(i = 1, dots , n) (with replacement) of the original i.i.d. data at hand (y_{i}, mathbf{x_{i}}) : (i = 1, dots , n).
    2. Fit a Regression Tree to the bootstrap sample – we will denote the tree predictor by hat{theta}_{n}^{*}(mathbf{x}).
    3. Repeat Steps one and two B many times and calculate frac{1}{B}sum_{b=1}^{B}hat{theta}_{n, b}^{*}(mathbf{x}) .

    OK – so let us take a glimpse into the construction phase: We draw in total B different bootstrap samples simultaneously from the original data. Then to each of those samples, a tree is fitted and the (in-sample) fitted values are averaged in Step 3 yielding the Bagged predictor.

    The variance-reduction happens in Step 3. To see this, consider the following toy example.

    Let X_1, dots, X_n be i.i.d. random variables with mu = E[X_1] and sigma^2 = Var[X_1] and let bar{X}= frac{1}{n}sum_{i=1}^{n}X_{i}. Easy re-formulations show that

    • Var[bar{X}]=frac{sigma^2}{n} leq sigma^2
    • E[bar{X}]=mu

    We observe that indeed the variance of the mean is weakly smaller than for the individual random variables while the sample mean is unbiased.

    It is widely discussed in the literature why Bagging works and it remains an open research question. Bühlmann and Yu (2002) propose a subsampling variant of Bagging, called Subagging, which is more traceable from a theoretical point of view.

    In particular, Bühlmann and Yu (2002) replace the bootstrap procedure of Bagging by subsampling without replacement. Essentially, we are only changing Step 1 of our Bagging algorithm by randomly drawing m times without replacement from our original data with m < n and get hence a subset of size m. With this variant at hand, it is possible to state upper bounds for the variance and mean squared error of the predictor given an appropriate choice of the subsample size m.

    Simulation Set-Up

    As the theory is a little bit cumbersome and involves knowledge in real analysis, we simulate the main findings of Bühlmann and Yu (2002).

    Let’s compare the mean-squared prediction error (MSPE) of the Regression Tree, Bagged and Subagged predictor and illustrate the theory part a little bit more.

    In order to do this, we consider the following model

        \[y_{i} = f(mathbf{x}_{i}) + epsilon_{i}\]

    where f(mathbf{x}_{i}) is the regression function, mathbf{x}_{i} sim U^{10}[0,1] is the design matrix generated from a uniform distribution and epsilon_{i}sim N(0,1) is the error term (forall i = 1, dots, n).

    For the true data-generating process (DGP), we consider the following model which is quite frequently used in the machine learning literature and termed “Friedman #1”-model:

        \[f(mathbf{x}) = 10 sin(pi x^{(1)} x^{(2)}) + 20(x^{(3)} - frac{1}{2})^{2} + 10 x^{(4)} + 5 x^{(5)}\]

    where mathbf{x}^{(j)} is the j-th column of the design matrix mathbf{x} (for 1 leq j leq 10).

    As you can see, this model is highly non-linear – Regression Tree models shall therefore be appropriate to approximate our DGP.

    To evaluate the prediction performance of Bagging and Subagging predictors we conduct a Monte Carlo simulation in Python.

    We first import the relevant packages.

    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> numpy <span class="hljs-keyword"><span class="hljs-keyword">as</span></span> np
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.model_selection
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.ensemble
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> simulation_class
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> math
    <span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.metrics <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> mean_squared_error
    <span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.tree <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> DecisionTreeRegressor
    

    The module simulation_class is a user-specified class that we will not discuss in this blog post but in a subsequent one.

    Further, we specify the simulation set-up:

    <span class="hljs-comment"><span class="hljs-comment"># Number of regressors</span></span>
    n_reg = <span class="hljs-number"><span class="hljs-number">10</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Observations</span></span>
    n_obs = <span class="hljs-number"><span class="hljs-number">500</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Simulation runs</span></span>
    n_sim = <span class="hljs-number"><span class="hljs-number">50</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Number of trees, i.e. number of bootstrap samples (Step 1)</span></span>
    n_tree = <span class="hljs-number"><span class="hljs-number">50</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Error Variance</span></span>
    sigma = <span class="hljs-number"><span class="hljs-number">1</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Grid for subsample size</span></span>
    start_grid = <span class="hljs-number"><span class="hljs-number">0.1</span></span>
    end_grid = <span class="hljs-number"><span class="hljs-number">1</span></span>
    n_grid = <span class="hljs-number"><span class="hljs-number">100</span></span>
    
    grid_range = np.linspace(start_grid, end_grid, num = n_grid)
    

    Below we will explain in more detail what we need the grid specification.

    To store our simulation results we set up containers.

    <span class="hljs-comment"><span class="hljs-comment"># Container Set-up</span></span>
    mse_temp_bagging = np.empty(shape = (n_obs, n_sim))
    mse_temp_subagging = np.empty(shape = (n_obs, n_sim))
    
    y_predict_bagging = np.empty(shape = (n_obs, n_sim))
    y_predict_subagging = np.empty(shape = (n_obs, n_sim))
    
    mse_decomp = np.empty(shape = (len(grid_range),<span class="hljs-number"><span class="hljs-number">2</span></span>))
    

    With this initialization at hand, we generate the test and train data by the simulation_class module.

    <span class="hljs-comment"><span class="hljs-comment">#Creation of Simulation-Data</span></span>
    train_setup = simulation_class.simulation(n_reg = n_reg,
                                              n_obs = n_obs,
                                              n_sim = n_sim,
                                              sigma = sigma,
                                              random_seed_design = <span class="hljs-number"><span class="hljs-number">0</span></span>,
                                              random_seed_noise =  <span class="hljs-number"><span class="hljs-number">1</span></span>)
    
    test_setup = simulation_class.simulation(n_reg = n_reg,
                                             n_obs = n_obs,
                                             n_sim = n_sim,
                                             sigma = sigma,
                                             random_seed_design = <span class="hljs-number"><span class="hljs-number">2</span></span>,
                                             random_seed_noise = <span class="hljs-number"><span class="hljs-number">3</span></span>)
    
    f_train = train_setup.friedman_model()
    X_train, y_train = train_setup.error_term(f_train)
    
    f_test = test_setup.friedman_model()
    X_test, y_test = test_setup.error_term(f_test)
    

    As we have generated the data for our “Friedman #1”-model we are now able to simulate the mean squared error of the Bagged predictor and Subagged predictor. In Python, both algorithms are implemented via the BaggingRegressor method of the sklearn.ensemble package. Observe that for the Subagged predictor we need to specify the parameter max_samples in the BaggingRegressor. This ensures that we can draw a subsample size m = a cdot{} n with subsample fraction a from the original data. Indeed, for the subsample fraction a we have already specified the grid above by the variable grid_range .

    <span class="hljs-comment"><span class="hljs-comment">#Subagging-Simulation</span></span>
    <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> index, a <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> enumerate(grid_range):
        <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">0</span></span>, n_sim):
            <span class="hljs-comment"><span class="hljs-comment"># bagged estimator</span></span>
            bagging = sklearn.ensemble.BaggingRegressor(
                bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">True</span></span>,
                n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
    
            y_predict_bagging[:,i] = bagging.fit(
                X_train,
                y_train[:,i]).predict(X_test)
            
            mse_temp_bagging[:,i] = mean_squared_error(
                y_test[:,i], 
                y_predict_bagging[:,i])
            
            <span class="hljs-comment"><span class="hljs-comment"># subagged estimator</span></span>
            subagging = sklearn.ensemble.BaggingRegressor(
                max_samples = math.ceil(a*n_obs),
                bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">False</span></span>,
                n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
            
            y_predict_subagging[:,i] = subagging.fit(
                X_train,
                y_train[:,i]).predict(X_test)
            
            mse_temp_subagging[:,i] = mean_squared_error(
                y_test[:,i],
                y_predict_subagging[:,i])
           
        mse_decomp[index, <span class="hljs-number"><span class="hljs-number">1</span></span>] = np.mean(mse_temp_bagging)
        mse_decomp[index, <span class="hljs-number"><span class="hljs-number">2</span></span>] = np.mean(mse_temp_subagging)
    

    On my GitHub-Account you can find additional code which also calculates the simulated bias and variance for the fully grown tree and the Bagged tree.

    Results

    The results of our above simulation can be found in Figure 1.

    Let us first compare the performance in terms of MSPE of the Regression Tree and the Bagged predictor. Table 1 shows us that Bagging drastically reduces the MSPE by decreasing the variance while almost not affecting the bias. (Recall – the mean squared prediction error is just the sum of the squared bias of the estimate, variance of the estimate and the variance of the error term (not reported).)

    Table 1: Performance of fully grown tree and Bagged Predictor

    Predictor Tree (fully grown) Bagged Tree
    Bias^2 3.47 2.94
    Variance 6.13 0.35
    MSPE 10.61 4.26

    Figure 1 displays the MSPE as a function of the subsample fraction a for the Bagged and Subagged predictor (our above code). Together with Figure 1 and Table 1, we make several observations:

    • We see that both the Bagged and Subagged predictor outperform a single tree (in terms of MSPE).
    • For a subsampling fraction of approximately 0.5, Subagging achieves nearly the same prediction performance as Bagging while coming at a lower computational cost.

    MSPE-Comparison of Bagging and Subagging

    References

    1. Breiman, L.: Bagging predictors. Machine Learning, 24, 123–140 (1996).
    2. Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927–961 (2002).

    Robin Kraft Robin Kraft

    In this blog, we will explore the Bagging algorithm and a computational more efficient variant thereof, Subagging. With minor modifications, these algorithms are also known as Random Forest and are widely applied here at STATWORX, in industry and academia.

    Almost all statistical prediction and learning problems encounter a bias-variance tradeoff. This is particularly pronounced for so-called unstable predictors. While yielding low biased estimates due to flexible adaption to the data, those kind of predictors react very sensitive to small changes in the underlying dataset and have hence high variance. A common example are Regression Tree predictors.

    Bagging bypasses this tradeoff by reducing the variance of the unstable predictor while leaving its bias mostly unaffected.

    Method

    In particular, Bagging uses repeated bootstrap sampling to construct multiple versions of the same prediction model (e.g. Regression Trees) and averages over the resulting predictions.

    Let’s see how Bagging works in detail:

    1. Construct a bootstrap sample (y_{i}^{*}, mathbf{x_{i}^{*}}) :(i = 1, dots , n) (with replacement) of the original i.i.d. data at hand (y_{i}, mathbf{x_{i}}) : (i = 1, dots , n).
    2. Fit a Regression Tree to the bootstrap sample – we will denote the tree predictor by hat{theta}_{n}^{*}(mathbf{x}).
    3. Repeat Steps one and two B many times and calculate frac{1}{B}sum_{b=1}^{B}hat{theta}_{n, b}^{*}(mathbf{x}) .

    OK – so let us take a glimpse into the construction phase: We draw in total B different bootstrap samples simultaneously from the original data. Then to each of those samples, a tree is fitted and the (in-sample) fitted values are averaged in Step 3 yielding the Bagged predictor.

    The variance-reduction happens in Step 3. To see this, consider the following toy example.

    Let X_1, dots, X_n be i.i.d. random variables with mu = E[X_1] and sigma^2 = Var[X_1] and let bar{X}= frac{1}{n}sum_{i=1}^{n}X_{i}. Easy re-formulations show that

    We observe that indeed the variance of the mean is weakly smaller than for the individual random variables while the sample mean is unbiased.

    It is widely discussed in the literature why Bagging works and it remains an open research question. Bühlmann and Yu (2002) propose a subsampling variant of Bagging, called Subagging, which is more traceable from a theoretical point of view.

    In particular, Bühlmann and Yu (2002) replace the bootstrap procedure of Bagging by subsampling without replacement. Essentially, we are only changing Step 1 of our Bagging algorithm by randomly drawing m times without replacement from our original data with m < n and get hence a subset of size m. With this variant at hand, it is possible to state upper bounds for the variance and mean squared error of the predictor given an appropriate choice of the subsample size m.

    Simulation Set-Up

    As the theory is a little bit cumbersome and involves knowledge in real analysis, we simulate the main findings of Bühlmann and Yu (2002).

    Let’s compare the mean-squared prediction error (MSPE) of the Regression Tree, Bagged and Subagged predictor and illustrate the theory part a little bit more.

    In order to do this, we consider the following model

        \[y_{i} = f(mathbf{x}_{i}) + epsilon_{i}\]

    where f(mathbf{x}_{i}) is the regression function, mathbf{x}_{i} sim U^{10}[0,1] is the design matrix generated from a uniform distribution and epsilon_{i}sim N(0,1) is the error term (forall i = 1, dots, n).

    For the true data-generating process (DGP), we consider the following model which is quite frequently used in the machine learning literature and termed “Friedman #1”-model:

        \[f(mathbf{x}) = 10 sin(pi x^{(1)} x^{(2)}) + 20(x^{(3)} - frac{1}{2})^{2} + 10 x^{(4)} + 5 x^{(5)}\]

    where mathbf{x}^{(j)} is the j-th column of the design matrix mathbf{x} (for 1 leq j leq 10).

    As you can see, this model is highly non-linear – Regression Tree models shall therefore be appropriate to approximate our DGP.

    To evaluate the prediction performance of Bagging and Subagging predictors we conduct a Monte Carlo simulation in Python.

    We first import the relevant packages.

    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> numpy <span class="hljs-keyword"><span class="hljs-keyword">as</span></span> np
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.model_selection
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.ensemble
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> simulation_class
    <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> math
    <span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.metrics <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> mean_squared_error
    <span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.tree <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> DecisionTreeRegressor
    

    The module simulation_class is a user-specified class that we will not discuss in this blog post but in a subsequent one.

    Further, we specify the simulation set-up:

    <span class="hljs-comment"><span class="hljs-comment"># Number of regressors</span></span>
    n_reg = <span class="hljs-number"><span class="hljs-number">10</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Observations</span></span>
    n_obs = <span class="hljs-number"><span class="hljs-number">500</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Simulation runs</span></span>
    n_sim = <span class="hljs-number"><span class="hljs-number">50</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Number of trees, i.e. number of bootstrap samples (Step 1)</span></span>
    n_tree = <span class="hljs-number"><span class="hljs-number">50</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Error Variance</span></span>
    sigma = <span class="hljs-number"><span class="hljs-number">1</span></span>
    
    <span class="hljs-comment"><span class="hljs-comment"># Grid for subsample size</span></span>
    start_grid = <span class="hljs-number"><span class="hljs-number">0.1</span></span>
    end_grid = <span class="hljs-number"><span class="hljs-number">1</span></span>
    n_grid = <span class="hljs-number"><span class="hljs-number">100</span></span>
    
    grid_range = np.linspace(start_grid, end_grid, num = n_grid)
    

    Below we will explain in more detail what we need the grid specification.

    To store our simulation results we set up containers.

    <span class="hljs-comment"><span class="hljs-comment"># Container Set-up</span></span>
    mse_temp_bagging = np.empty(shape = (n_obs, n_sim))
    mse_temp_subagging = np.empty(shape = (n_obs, n_sim))
    
    y_predict_bagging = np.empty(shape = (n_obs, n_sim))
    y_predict_subagging = np.empty(shape = (n_obs, n_sim))
    
    mse_decomp = np.empty(shape = (len(grid_range),<span class="hljs-number"><span class="hljs-number">2</span></span>))
    

    With this initialization at hand, we generate the test and train data by the simulation_class module.

    <span class="hljs-comment"><span class="hljs-comment">#Creation of Simulation-Data</span></span>
    train_setup = simulation_class.simulation(n_reg = n_reg,
                                              n_obs = n_obs,
                                              n_sim = n_sim,
                                              sigma = sigma,
                                              random_seed_design = <span class="hljs-number"><span class="hljs-number">0</span></span>,
                                              random_seed_noise =  <span class="hljs-number"><span class="hljs-number">1</span></span>)
    
    test_setup = simulation_class.simulation(n_reg = n_reg,
                                             n_obs = n_obs,
                                             n_sim = n_sim,
                                             sigma = sigma,
                                             random_seed_design = <span class="hljs-number"><span class="hljs-number">2</span></span>,
                                             random_seed_noise = <span class="hljs-number"><span class="hljs-number">3</span></span>)
    
    f_train = train_setup.friedman_model()
    X_train, y_train = train_setup.error_term(f_train)
    
    f_test = test_setup.friedman_model()
    X_test, y_test = test_setup.error_term(f_test)
    

    As we have generated the data for our “Friedman #1”-model we are now able to simulate the mean squared error of the Bagged predictor and Subagged predictor. In Python, both algorithms are implemented via the BaggingRegressor method of the sklearn.ensemble package. Observe that for the Subagged predictor we need to specify the parameter max_samples in the BaggingRegressor. This ensures that we can draw a subsample size m = a cdot{} n with subsample fraction a from the original data. Indeed, for the subsample fraction a we have already specified the grid above by the variable grid_range .

    <span class="hljs-comment"><span class="hljs-comment">#Subagging-Simulation</span></span>
    <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> index, a <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> enumerate(grid_range):
        <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">0</span></span>, n_sim):
            <span class="hljs-comment"><span class="hljs-comment"># bagged estimator</span></span>
            bagging = sklearn.ensemble.BaggingRegressor(
                bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">True</span></span>,
                n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
    
            y_predict_bagging[:,i] = bagging.fit(
                X_train,
                y_train[:,i]).predict(X_test)
            
            mse_temp_bagging[:,i] = mean_squared_error(
                y_test[:,i], 
                y_predict_bagging[:,i])
            
            <span class="hljs-comment"><span class="hljs-comment"># subagged estimator</span></span>
            subagging = sklearn.ensemble.BaggingRegressor(
                max_samples = math.ceil(a*n_obs),
                bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">False</span></span>,
                n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
            
            y_predict_subagging[:,i] = subagging.fit(
                X_train,
                y_train[:,i]).predict(X_test)
            
            mse_temp_subagging[:,i] = mean_squared_error(
                y_test[:,i],
                y_predict_subagging[:,i])
           
        mse_decomp[index, <span class="hljs-number"><span class="hljs-number">1</span></span>] = np.mean(mse_temp_bagging)
        mse_decomp[index, <span class="hljs-number"><span class="hljs-number">2</span></span>] = np.mean(mse_temp_subagging)
    

    On my GitHub-Account you can find additional code which also calculates the simulated bias and variance for the fully grown tree and the Bagged tree.

    Results

    The results of our above simulation can be found in Figure 1.

    Let us first compare the performance in terms of MSPE of the Regression Tree and the Bagged predictor. Table 1 shows us that Bagging drastically reduces the MSPE by decreasing the variance while almost not affecting the bias. (Recall – the mean squared prediction error is just the sum of the squared bias of the estimate, variance of the estimate and the variance of the error term (not reported).)

    Table 1: Performance of fully grown tree and Bagged Predictor

    Predictor Tree (fully grown) Bagged Tree
    Bias^2 3.47 2.94
    Variance 6.13 0.35
    MSPE 10.61 4.26

    Figure 1 displays the MSPE as a function of the subsample fraction a for the Bagged and Subagged predictor (our above code). Together with Figure 1 and Table 1, we make several observations:

    MSPE-Comparison of Bagging and Subagging

    References

    1. Breiman, L.: Bagging predictors. Machine Learning, 24, 123–140 (1996).
    2. Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics 30, 927–961 (2002).

    Robin Kraft Robin Kraft