greedy forest

Regularized Greedy Forest – The Scottish Play (Act I)

Fabian Müller Blog, Data Science, Statistik

Macbeth shall never vanquish'd be until

Great Birnam Wood to high Dunsinane Hill

Shall come against him. (Act 4, Scene 1)

In Shakespeare's The Tragedy of Macbeth, the prophecy of Birnam Wood is one of three misleading prophecies foreshadowing the defeat of the protagonist of the same name. While highly unlikely, the event of a nearby forest moving towards his castle marks the beginning of Macbeth's doom.

This blog post likewise centers on a (greedy) forest having the potential to overthrow a ruling king. It is about a direct decision forest learning algorithm threatening the dominance of Gradient Boosting Decision Trees (GBDT) for problems in machine learning requiring to learn nonlinear functions from training data.

Recap Gradient Boosting Decision Trees

Before turning to the prophecy, let us explore one of the most widely applied machine learning techniques – Gradient Boosting Decision Trees. Given a nonlinear function F(x) with input x, we want to find a suitable prediction function \hat{F}(x) with training data {y,x} by minimising an arbitrary loss function:

    \[\hat{F}=\underset{f}{\mathrm{argmin}}\, \mathcal{L}(y,F(x))\]

Gradient Boosting solves this problem by fitting an additive model:

    \[F(x, \alpha, \beta) = \sum_{m=0}^{M}\alpha_mf(x,\beta_m)\]

with M base learners f(x,\beta) characterised by \beta and step size \alpha. For the case of a quasi least-squares regression (\mathcal{L}=\frac{(y-f(x))^2}{2}), the complete algorithm, can be summarised as follows (Friedman 2001):

  1. Set the initial model \hat{F}_0(x) =\hat{f}_0(x)=\bar{y}, where \bar{y} is the empirical mean of all training data.

  2. For each boosting iteration m=1,2,…M:

    1. Compute the residual vector \tilde{y_i}=y_i-F_{m-1}(x),i=1,2,…,N, which is equivalent to the negative gradient -g(x_i).
    2. Fit a base learner f(x,\beta) to the negative gradient vector and obtain \beta_m by: \beta_m=\underset{\beta}{\mathrm{argmin}}\, \sum_{i=1}^{N}(-g_m(x_i)-f(x_i,\beta_m))^2
    3. Find the optimal step size \alpha_m by \alpha_m = \underset{\alpha}{\mathrm{argmin}}\sum_{i=1}^{N}\mathcal{L}(y_i, F_{m-1}(x_i)+\alpha_mf_m(x,\beta_m))
    4. Update model: F_m(x)=F_{m-1}+\alpha_mf_m(x,\beta_m)

In order to control overfitting, the line 2.4 is normally replaced by F_m(x)=F_{m-1}+\eta\alpha_mf_m(x,\beta_m), where \eta\in(0,1) is the shrinkage parameter (or learning rate).

Motivation

While GBDT is among the most successfully applied machine learning algorithms, its design is not above all doubt. Johnson et al. (2014) point out three shortcomings of the general boosting framework outlined above:

  1. A lack of explizit regularization
  2. Huge models in terms of boosting iterations due to small step sizes
  3. Base learner treated exclusively as black boxes

The first criticism is especially interesting. While modern implementations of boosting, like XGBoost and LightGBM, add explicit regularization terms, the general boosting framework does not. It is often argued, that the combination of small step sizes and frequent boosting iterations act as an implicit regularization (Hastie et al. 2009). While this is certainly true, it is still unclear in which way those implicit regularization techniques interact with each other and whether this interaction is optimal or not (Rosset et al. 2004, Johnson et al. 2014).

As a result of this regularization technique, the small step sizes, lead to more complex models that need long boosting paths. Besides the small step sizes, performing only a partial corrective step in every iteration m is another reason that calls for long boosting paths. More precise, in GBDT the parameter vector (\alpha, \beta) gets optimized only for the current model F_m while all previous models F_{m-1}, ... , F_0 remain untouched (see 2.4). This is what we call forward stagewise additive modeling. Warmuth et al. (2006) argued that a fully-corrective step, where all parameters (\alpha_m, \beta_m){|m=1,…M} get optimized in every iteration m, can significantly accelerate the convergence of boosting procedures, hence making small step sizes redundant.

A Regularized Greedy Forest

The Regularized Greedy Forest (RGF) extends the general boosting framework by the following three ideas:

  1. Add an explicit regularization function
  2. Perform fully-corrective steps
  3. Add structured sparsity based on the forest structure

The Idea of a Decision Forest

As in GBDT, RGF uses a forest \mathcal{F} composted of multiple decision trees T_1, T_2, ... , T_K. Each T_k consists of nodes v and edges e. Each e is associated with a splitting variable k_e and a corresponding splitting threshold t_e. For a better understanding of the later fully-corrective steps, it helps to visualise \mathcal{F} in it's entire structure (see figure below).

tree structure

If \mathcal{F} is a forest with nodes v, it can be written as an additive model over all nodes:

    \[h_{\mathcal{F}}(x)=\sum_{v}\alpha_vb_v(x)\]

where b_v(x) is a nonlinear decision rule (basis function) at node v. Hence, b_v(x) can be summarized as the product of single decisions of the form I(x_{k_e}\leq t_e) | I(x_{k_e}>t_e) along the path from the root to v; \alpha_v is a weight assigned to v (for more details on the weight see Hastie et al. 2009, p. 312). Furthermore, the regularized loss function can be expressed as:

    \[\mathcal{Q}(\mathcal{F}) = \mathcal{L}(h_{\mathcal{F}}(X), Y) + \mathcal{R}(\mathcal{h}_{\mathcal{F}})\]

Implementation of Greedy Optimization

In order to minimize \mathcal{Q}(\mathcal{F}), one can either change the forest structure by changing the basis functions b_v(x) or change the weights \alpha_v. The GBDT algorithm does both in an iterative alternating way.

For the structural change, two options are allowed: (1) split an existing leaf, or (2) start a new tree. Both options are displayed in the figure below, where T_3 is split in two new nodes and T_4 is a new tree starting from the root.

change in tree structure

On the computation side, one step of structure-changing operation from the current forest \mathcal{F} to the new forest \hat{\mathcal{F}}(\beta_1,\beta_2) can be done by considering the loss reduction resulting from splitting a node v with parameters (b, \alpha) into (b_1, \alpha+\delta_1) and (b_2, \alpha+\delta_2). Hence, find \underset{\delta_1, \delta_2}{\mathrm{argmin}}\, \mathcal{Q}(\hat{\mathcal{F}}(\beta_1,\beta_2)). Johnson et al. (2014) suggested to use Newton's method based on the first and second order partial derivatives of \mathcal{Q} to solve the optimisation problem. This is a slightly different approach compared to GBDT where gradient-descent approximation is normally used. Having fixed a new structure of the forest, the second step involves optimising the weights \alpha_v while keeping the structure b_v fixed.

Tree-structured Regularization

Recall that the partial corrective step in combination with small step sizes acts as an implizit regularization for the GBDT. Since the RGF does a fully-corrective step, which does not involve a step size, it needs a more explizit regularization to prevent overfitting.

Johnson et al. (2014) suggested to use a tree-structured penalization technique derived from the grouped lasso (Yuan and Lin 2006). In the grouped lasso, sparsity is enforced by applying the L_1 norm penalty on prior known groups of features instead of individual features. Hence, a group of features shares a common penalty. Zhao et al. (2008) generalized this regularization theme by suggesting to assume the feature to form a tree structure (see als tree-guided group lasso in Kim and Xing (2010)). Johnson et al. (2014) leverage this idea by using the already available forest structure as a grouping schema. More precise, they define a shared regularization penalty for all trees sharing a similar topology. Thus, if two trees {T(\alpha_1)_1, T(\alpha_2)_2} have an equivalent (see the original paper for a detailed explanation of equivalence) structure, their weight vectors {\alpha_1, \alpha_2} share a common penalty term.

Without going to must into details, Johnson et al. (2014) suggested three different tree-structured regularizers including a simple L_2 norm regularizer on the leaf weights \alpha_v, v\in T:

    \[\mathcal{R}(\mathcal{h}_{\mathcal{F}})=\sum_{k=1}^K \mathcal{R}(\mathcal{h}_{T_{k}}) = \sum_{k=1}^K \lambda \sum_{v \in T_k}||\alpha_v||_{2}\]

Summary

Regularized Greedy Forest (RGF) is another tree-based sequential boosting algorithm. It differs from the more common gradient boosting framework in three aspects. First, RGF applies fully-corrective steps, which means updating all parameters and weights for all trees (the forest) in every iteration. Second, RFG optimizes an explicitly regularized loss function. Third, RGF leverages the forest structure by imposing a tree-structured regularization technique.

This blog post summarized the most important theoretical aspects of RGF. In the next post, we will apply RGF to various real world problems and compare its speed and performance with GBDT to see whether RGF is the new go-to algorithm in applied data science. Until then, keep Macbeth's words in good memory:

That will never be.

Who can impress the forest, bid the tree

Unfix his earthbound root? (Act 4, Scene 1)

Appendix – RGF in Competitions

References

  • Johnson, Rie and Tong Zhang. 2014. "Learning Nonlinear Functions Using Regularized Greedy Forest." IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, No. 5.
  • Warmuth, Manfred K., Jun Liao, and Gunnar Rätsch. 2006. "Totally Corrective Boosting Algorithmus that Maximize the Margin." Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA.
  • Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. "The Elements of Statistical Learning." 2nd Edition. Springer, New York.
  • Yuan, Ming and Yi Lin. 2006. "Model selection and estimation in regression with grouped variables." Royal Statistical Society, Vol. 68, No. 1.
  • Zhao, P., Rocha, G., and Yu, B. 2008. "Grouped and hierarchical model selection through composite absolute penalties." Technical Report 703, Department of Statistics, University of California, Berkeley.
  • Kim, Seyoung and Eric P. Xing. 2010. "Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity." Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel.
  • Rosset, Saharon, Ji Zhu, and Trevor Hastie. 2004. "Boosting as a Regularized Path to a Maximum Margin Classifier." Journal of Machine Learning Research, Vol. 5, No. 1.
Über den Autor
Fabian Müller

Fabian Müller

I am the Head of Data Science at STATWORX and responsible for our data science teams and key accounts. In my spare time, I'm into doing sports and fast cars.

ABOUT US


STATWORX
is a consulting company for data science, statistics, machine learning and artificial intelligence located in Frankfurt, Zurich and Vienna. Sign up for our NEWSLETTER and receive reads and treats from the world of data science and AI. If you have questions or suggestions, please write us an e-mail addressed to blog(at)statworx.com.