greedy forest

Regularized Greedy Forest – The Scottish Play (Act II)

Fabian Müller Blog, Data Science

In part one of the blog post, the Regularized Greedy Forest (RGF) was introduced as a contender to the more frequently used technique of Gradient Boosting Decision Trees (GBDT). Now it is time to turn words into actions and find out whether it actually is. Among all GBDT implementations, XGBoost is probably the most commonly used implementation in the field of machine learning. Hence, it serves as a natural benchmark for the RGF.

Regularized Greedy Forest in Python

The authors of the original paper implemented the RGF algorithm in C++. Fortunately, the rgf_python library offers a convenient wrapper which allows fitting directly from Python. rgf_python is available on GitHub and supports both classification and regressions task using RGFClassifier() and RGFRegressor(), respectively.

In addition to the vanilla RGF, the FastRGF implementation, which adds multi-core support, is also included in rgf_python. The latter one uses some approximations to accelerate the training process and should be used with caution. During our test, we found that FastRGF performances significantly worse on some (notably smaller) datasets compared to the vanilla RGF. Therefore, we focus on using the vanilla RGF algorithms as described by Johnson and Zhang (2014).

# Imports
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from rgf.sklearn import RGFClassifier

We use Python’s pandas library to load training and test data from two flat files. To be compatible with the scikit-learn API, one has to create a feature matrix (X_train,X_test) and a vector of class labels (y_train, y_test). X is assumed to be a numeric (potentially one-hot-encoded) matrix and y a vector of zeros and ones.

# Load data
df_train = pd.read_csv("path_to_training.csv")
df_test = pd.read_csv("path_to_test.csv")

# Target and features
y_train = df_train.loc[:, "label"]
X_train = df_train.drop("label", axis=1)
y_test = df_test.loc[:, "label"]
X_test = df_test.drop("label", axis=1)

The RGF algorithms features several hyperparameters affecting both training speed and accuracy. A complete list of all available hyperparameters can be found here.

max_leaf is the maximum number of leafs in the complete forest. Similar to the number of trees in GBDT, this hyperparameter controls the complexity of the model.

In their paper, Johnson and Zhang (2014) discussed several tree-structured penalization techniques. The algorithm hyperparameter controls which technique is used:

  • RGF : RGF with L_2 regularization on leaf-only models
  • RGF_Opt : RGF with min-penalty regularization
  • RGF_Sig : RGF with min-penalty regularization with the zero-sum sibling constraint

Finally, the l2 hyperparameter controls the degree of L_2 regularization added to the model.

# Parameter grid
param_grid = {'max_leaf': [1000, 5000, 10000],
              'algorithm': ['RGF_Sib'],
              'l2': [1.0, 0.1, 0.01]}

We use the scikit-learn API to run a 5-fold cross-validation over the hyperparameter grid.

# Init grid search object
grid_search = GridSearchCV(estimator=RGFClassifier(),
                           param_grid=param_grid,
                           scoring="roc_auc",
                           cv=5,
                           n_jobs=4,
                           verbose=3)

# Run grid search
grid_search.fit(X=X_train, y=y_train)

Given the optimal crossvalidated hyperparameters, the final model is applied to the test dataset. The AUC score on the test dataset is used as the ultimate criteria to judge the model performance.

# Predict on test set
y_predict = grid_search.predict(X=X_test)

# Calculate AUC on test set
test_score = roc_auc_score(y_true=y_test, y_score=y_predict)

RGF vs. XGBoost

The two algorithms fight for the best performance on four datasets from the UCI Machine Learning Repository:

All four datasets are frequently used by us to evaluate various machine learning algorithms. This makes it feasible to compare the RGF not only to XGBoost but to a range of different classes of algorithms without fitting each of them (more on that later).

As for evaluation metrics, we use AUC score for the classification task and mean squared error (MSE) for the regression tasks. The complete code can be found on GitHub.

The skin classification task was solved by both algorithms extremely well. Both achieve AUC scores above 0.999 and only differ on the fourth decimal place. A similar tie was found on the air regression task. The RGF is slightly better with an MSE of 0.061 compared to 0.063 for XGBoost.

In the bike regression task, the RGF clearly outperforms XGBoost (MSE 1.898 vs. 4.899). While XGBoost outperforms RGF in the gas regression task (MSE 1284 vs. 1458). For both datasets, the differences in performance is notable and might also be relevant for real-word applications.

Performance RGF vs. XGB

Overall, the results are mixed and indicate that RGF can outperform XGBoost in certain instances, but is definitely not always the better choice. Given the overall tie in terms of performance, one can consult a second criterium for evaluation – training speed. Here the (vanilla) RGF has trouble dealing with the extremely well implemented and highly parallelized XGBoost. Even with 16 times more hyperparameters, the overall runtime of XGBoost was still almost always below that of the RGF.

RGF vs. All The Rest

Expanding the range of competitors produced some surprising results. Neither RGF nor XGBoost are the dominant algorithms for the selected datasets. Instead, the Extremely Randomized Trees (extraTrees, Geurts et al. 2006), a derivative of the Random Forest, bagged its way to the best performance on three out of four datasets. Seems like the good old bagging way of life is still suitable for most machine learning tasks.

DatasetmetricRGF (Python)XGBoost (Python)GLMNet (R)RF (R)extraTrees (R)blackboost (R)
SkinAUC0.99940.99960.96500.9998
AirMSE0.06120.06320.04520.06100.05880.0640
BikeMSE1.89894.89935.25832.51201.72512.8222
GasMSE1458.71284.08732.91179.9850.302900.9

And What About R?

For R users, there exists the RGF package allowing to call the rgf_python functions from R. The package is based on the reticulate package which provides a convenient R interface to Python. It is available on both GitHub and CRAN. For more details on using RGF in R see this detailed blog post.

Summary

In the end, the prophecy was right once again. The current ruling king of supervised learning algorithms – Gradient Boosting Decision Trees – was overthrown by another forest algorithm. But it was not the greedy one but the extremely randomized. However, and keeping the no free lunch theorem in mind, the Regularized Greedy Forest is yet another algorithm worth trying out. While it is certainly not always the best one, there are situations where it performs considerably better than classic Gradient Boosting Decision Trees.

All hail, extraTrees, that shalt be king hereafter!

References

  • Johnson, Rie and Tong Zhang. 2014. „Learning Nonlinear Functions Using Regularized Greedy Forest.“ IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, No. 5.
  • Geurts, Pierre, Damien Ernst and Louis Wehenkel. 2006. „Extremely Randomized Trees.“ Machine Learning, Vol. 63, No. 1.
Über den Autor
Fabian Müller

Fabian Müller

Fabian ist unser Teamleiter für Data Science und betreut mit seinem Team unsere Großkunden aus der Wirtschaft. In seiner Freizeit treibt er viel Sport und ist ein großer Automobil-Fan.