Motivation
There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms, and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project or problem. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.
While I like reading machine learning research papers, the math is sometimes hard to follow. That is why I am a fan of implementing the algorithms in R by myself. Of course, this means digging through the maths as well. However, you can challenge your understanding of the algorithm directly.
In my two subsequent blog posts, I will introduce two machine learning algorithms in under 150 lines of R Code. The algorithms will cover all core mechanics while being very generic. You can find all code on my GitHub.
This blog post will introduce you to gradient boosted machines. Surely, there are tons of great articles out there which explain gradient boosting theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in the theory and the evolution of the algorithm I would strongly recommend reading the paper of Bühlmann and Hothorn (2007).(1) The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow.
Gathering all puzzle pieces
Gradient boosting is a very special machine learning algorithm because it is rather a vehicle for machine learning algorithms rather than a machine learning algorithm itself. That is because you can incorporate any machine learning algorithm within gradient boosting. I admit that sounds quite confusing, but it will be clear by the end of this post.
Anyway, what do we need to boost those gradients? Well, we need to define our problem (a little maths is necessary 🙂 ).
We have the following problem:
This is just a simple regression equation, where we state that we want to map our features onto a real-valued target with estimator and an error term (residuals) . This is nothing out of the ordinary every machine learning algorithm starts here or somewhere close to this equation. However, unlike other machine learning algorithms, specifying a loss function is mandatory.
# this is our loss function
# y - the target
# yhat - the fitted values
loss <- function(y, yhat) return(mean(1/2*(y-yhat)^2))
Our objective is to find the minimum of this loss function, which is an adaption of the mean squared error. The algorithm exploits the fact that in a minimum there is no slope in our loss function. The gradient of our function, which is the negative partial derivative with respect to yhat
, describes this slope. So we can actually reformulate our objective to searching a gradient of zero or close to zero to ultimately fulfill our goal of minimizing the loss.
# the derivative of our loss function is the gradient
# y - the target
# yhat - the fitted values
gradient <- function(y, yhat) {return(y - yhat)}
Now comes the interesting part of the algorithm. In our case, the gradient coincides with the residuals u = y – yhat
. Remember, we want the gradient to be zero or close to zero. Or put differently, we want the state y = yhat
. The algorithm exploits this fact by simply iteratively fitting (boosting) the residuals (the gradient) instead of y
itself. This means we update our gradient in every iteration to improve our fit in general. This step describes our movement to the minimum of our loss function. We will clarify this observation with a little example once we see the theory in action.
The algorithm
Let us first have a look at the algorithm as a whole.
# grad_boost
#' Fit a boosted linear model
#'
#' @param formula an object of class formula
#' @param data a data.frame or matrix
#' @param nu a numeric within the range of [0,1], the learning rate
#' @param stop a numeric value determining the total boosting iterations
#' @param loss.fun a convex loss function which is continuously differentiable
#' @param grad.fun a function which computes the gradient of the loss function
#' @param yhat.init a numeric value determining the starting point
#'
#' @return itemize{
#' item theta - this is our estimator
#' item u - this is the last gradient value
#' item fit - our fitted values, i.e. X %*% theta
#' item formula - the underlying formula
#' item data - the underlying data
#' }
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
grad_boost <- function(formula, data, nu = 0.01, stop,
grad.fun, loss.fun, yhat.init = 0) {
# coerce to data.frame
data <- as.data.frame(data)
# handle formula
formula <- terms.formula(formula)
# get the design matrix
X <- model.matrix(formula, data)
# extract target
y <- data[, as.character(formula)[2]]
# initial fit
fit <- yhat.init
# initialize the gradient with yhat.init
u <- grad.fun(y = y, yhat = fit)
# initial working parameter (our theta)
# this is just an empty body
theta <- rep(0, ncol(X))
# track loss
loss <- c()
# boost from 1 to stop
for (i in 1:stop) {
# estimate working parameter (via OLS)
# theta_i = (X'X)^-1 X'y
# This is the (base procedure) where you can literally place
# any optimization algorithm
base_prod <- lm.fit(x = X, y = u)
theta_i <- coef(base_prod)
# update our theta
theta <- theta + nu*as.vector(theta_i)
# new fitted values
fit <- fit + nu * fitted(base_prod)
# update gradient
u <- grad.fun(y = y, yhat = fit)
# update loss
loss <- append(loss, loss.fun(y = y, yhat = fit))
}
# naming the estimator
names(theta) <- colnames(X)
# define return
return(list(theta = theta, u = u, fit = fit, loss = loss,
formula = formula, data = data))
}
All the magic happens within the for
loop, which is coded in less than 30 lines of code. Everything else is just input/output handling and so on.
Let us break down what is happening.
# coerce to data.frame
data <- as.data.frame(data)
# handle formula
formula <- terms.formula(formula)
# get the design matrix
X <- model.matrix(formula, data)
# extract target
y <- data[, as.character(formula)[2]]
# initial fit
fit <- f.init
# initialize the gradient with yhat.init
u <- grad.fun(y = y, yhat = fit)
# initial working parameter (our theta)
# this is just an empty body
theta <- rep(0, ncol(X))
# track loss
loss <- c()
The first part is just input handling, but we can see that we have to initialize the gradient, by starting with a best-guess fit yhat.init
, e.g. zero. After initializing, we need some containers to store our loss and our estimator theta
.
The next part is all about our optimization, where we iteratively approach the minimum of our prespecified loss function.
# boost from 1 to stop
for (i in 1:stop) {
# estimate working parameter (via OLS)
# theta_i = (X'X)^-1 X'y
# This is the (base procedure) where you can literally place
# any optimization algorithm
base_prod <- lm.fit(x = X, y = u)
theta_i <- coef(base_prod)
# update our theta
theta <- theta + nu*as.vector(theta_i)
# new fitted values
fit <- fit + nu * fitted(base_prod)
# update gradient
u <- grad.fun(y = y, yhat = fit)
# update loss
loss <- append(loss, loss.fun(y = y, yhat = fit))
}
Unveiling the magic
Alright, let’s get to the gravy. We will go through the first iteration step by step. The first thing we are doing is to estimate the gradient with a simple regression. Yes, a simple regression which you, unfortunately, have to accept here for a moment. Remember how we initialized the gradient. In the first iteration, we are doing ordinary least squares with our features X
and our target y
, because the gradient coincides with our target, i.e. u = y - 0
. What happens next is the last missing puzzle piece to gradient boosting. We are simply updating our estimator theta
the fitted values fit
and our gradient u
. But what does this accomplish? Suppose, this multiplier nu
is equal to one, then the following is quite straightforward. We are stating that our new fit – we initialized it with zero – is plain ordinary least squares and our estimator theta
is the OLS estimate as well. After updating the gradient we observe that the gradient in iteration two is actually the residuals from the OLS fit of iteration one.
Let us recap that in some code real quick. In the first iteration, we were actually fitting y
, since u = y - 0
and in the second iteration we are fitting u = resid(base_prod)
, since u = y – fitted(base_prod)
. Thus, if we repeat this often enough u
should converge to zero, since our fit will improve everytime. But why should it improve every time? The algorithm allows us to fit the gradient – which again are the residuals – at every iteration with completely new estimates theta_i
. Naturally, this guess will be good enough to let the gradient converge towards zero or put differently minimize the distance between our target y
and our fit yhat
. And yes, this means that with gradient boosting we can theoretically reach the equilibrium of y = yhat
, which we desire. But do we though? Remember we could theoretically reach that state in every problem we estimate with gradient boosting. Well, the important point to note here is that we can only do this for the data we have at hand. Chances are very high when observing new data, that our gradient boosted machine is garbage, because our estimator theta
does not cover the underlying effect of our features on the target, but covers rather a combination that minimizes our loss function with the data we fed it. That is why we have to control the boosting iterations. We have to terminate the iterative fitting at the right time, to get a theta
that explains the effect we are interested in rather than minimizing loss functions. There are so-called early-stopping techniques that terminate the learning by exceeding an out-of-sample risk to avoid this phenomenon called overfitting the data.
Anyway, let us come back to the parameter nu
. This is the shrinkage (learning rate) parameter. It controls how big the steps towards the minimum of our loss function are. Suppose, the same scenario as above with nu = 0.1
, that would mean we would not get the full OLS fit, but only 10% of it. As a result, we would need more iterations to converge to the minimum, however, empirically smaller learning rates have been found to yield better predictive accuracies.(2)
In the beginning, I have stated that gradient boosting is rather a vehicle for machine learning algorithms than a machine learning algorithm itself. Essentially, this is because instead of lm.fit
we could have used any other optimization function here. Most gradient boosted machines out there uses tree-based algorithms, e.g. xgboost. This makes the gradient boosted machine a very unique machine learning algorithm.
I have created a little run-through with data from my simulation function on my GitHub, which you can check out and try everything on your own step by step. Make sure to check out my other blog post about coding regression trees from scratch.
References
- Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477-505
- Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.