Boosting is a powerful ensemble learning technique in machine learning (ML) that improves model accuracy by reducing errors. By training sequential models to address prior shortcomings, boosting creates robust predictive systems. This guide covers how boosting works; its advantages, challenges, and applications; and how it compares to bagging.
Table of contents
What is boosting?
Boosting is an ensemble learning technique that trains new, sequential models to correct the errors of the previous models in the ensemble. Ensemble learning techniques are ways of using multiple similar models to improve performance and accuracy. In boosting, the new models are trained solely on the prior errors of the ensemble. Then the new models join the ensemble to help it give more accurate predictions. Any new input is passed through the models and aggregated to reduce the errors over all the models.
Accuracy is a broad concept. Boosting specifically increases model performance by reducing model bias (and, to a lesser extent, variance). Variance and bias are two important ML concepts we’ll cover in the next section.
Bias vs. variance
Bias and variance are two fundamental properties of machine learning as a whole. The goal of any ML algorithm is to reduce the variance and bias of models. Given their importance, we’ll explain more about each and why they’re usually at odds with each other.
To explain each concept, let’s take the example of predicting the sale price of houses given data about their features (e.g., square footage, number of bedrooms, etc.).
Bias
Bias is a measure of how wrong a model is on average. If a house actually sold for $400,000 and the model predicted $300,000, the bias for that data point is −$100,000. Average out the bias over the entire training dataset, and you have a model’s bias.
Bias usually results from models being too simple to pick up on the complex relationships between features and outputs. A too-simple model may learn to only look at square footage and will be wrong consistently, even on the training data. In ML parlance, this is called underfitting.
Variance
Variance measures how much a model’s outputs differ given similar inputs. In most cases, houses in similar neighborhoods and with similar square footage, number of bedrooms, and number of bathrooms should have similar prices. But a model with high variance may give wildly different prices. Why?
The model may have learned spurious relationships from the training data (e.g., thinking that house numbers affect price). These spurious relationships can then drown out the useful relationships in the data. Generally, complex models pick up on these irrelevant relationships, which is called overfitting.
Bias–variance trade-off
Ideally, you want a low-bias, low-variance ML model that will pick up on the true relationships in the data but not anything more. However, this is hard to do in practice.
Increasing a model’s sophistication or complexity can reduce its bias by giving it the power to find deeper patterns in the data. However, this same power can also help it find irrelevant patterns and vice versa, making this bias–variance trade-off hard to resolve.
Boosting improves bias and variance
Boosting is a very popular ensemble learning technique because it can reduce both bias and variance (though variance reduction is not as common).
By correcting prior errors, boosting reduces the average error rate and size of the ensemble of models, lowering bias.
By using multiple models, individual models’ errors can be canceled out, potentially leading to lower variance.
Boosting vs. bagging
In ensemble learning, the two most common techniques are boosting and bagging. Bagging takes the training dataset, makes randomized subsets of it, and trains a different model on each subset. Then the models are used in conjunction to make predictions. This leads to quite a few differences between bagging and boosting, which we detail below.
Bagging | Boosting | |
Model training | Models are trained in parallel on different subsets of data. | Models are trained sequentially, with each model focusing on the errors of the previous model. |
Error reduction focus | Reduces variance | Reduces bias |
Common algorithms | Random forest, bagged decision trees | AdaBoost, gradient boosting, XGBoost |
Overfitting risk | Lower risk of overfitting due to random sampling | Higher risk of overfitting |
Computation complexity | Lower | Higher |
Both techniques are common, but boosting is the more popular choice because it can reduce bias and variance.
How boosting works
Let’s get into how boosting works. Essentially, boosting consists of training each new model on the data points that the previous models got wrong. There are three parts:
- Weighting the training data by errors
- Training a new model on this weighted error dataset
- Adding the new model to the ensemble
To begin with, let’s assume we have trained the initial model (an ensemble of one).
Weighting the training data by errors
We run the training data through the existing ensemble and note which inputs the ensemble gave incorrect predictions for. Then we create a modified version of the training dataset where those troublesome inputs are more represented or more important.
Training the new model
We use the modified dataset we created to train a new model, which is the same type as the other models in the ensemble. However, this new model focuses more on the hard examples from the training data, so it will likely perform better on them. This improvement in error performance is an important part of reducing bias.
Incorporating the new model
The newly trained model is added to the ensemble, and its predictions are weighted according to their accuracy. In parallel, new input is passed to each model in the ensemble, and the final outputs of each model are weighted to get the ensemble’s output.
For classification tasks (usually choosing between two labels in boosting problems), the class with the highest sum of weighted votes for it is chosen as the ensemble’s prediction.
For regression tasks, the ensemble’s prediction is the weighted average of each model’s prediction.
At this point, the process can repeat if the bias is still too high.
Types of boosting algorithms
There are several variants of boosting algorithms, with some hefty differences between them. The most popular are adaptive boosting (AdaBoost), gradient boosting, extreme gradient boosting (XGBoost), and cat boost. We’ll cover each in turn.
AdaBoost
AdaBoost is very similar to the boosting algorithm we laid out earlier: Training data that poses problems for earlier ensembles is weighted more when training the next model. AdaBoost was one of the original boosting algorithms and is known for its simplicity.
AdaBoost is less prone to overfitting than other boosting algorithms since new models see different variations (with hard data points being more common) of the training dataset. But, compared to other boosting techniques, it’s more sensitive to outlier data and doesn’t reduce bias as much.
Gradient boosting
Gradient boosting is a unique approach to boosting. In contrast to adaptive boosting, new models don’t get an error-weighted version of the training dataset. They get the original dataset. However, instead of trying to predict the outputs for the inputs in the dataset, they try to predict the negative gradient of the previous ensemble on each input.
The negative gradient is essentially the direction in which the ensemble’s model weights and predictions would need to move to decrease the error—to get closer to the right answer. The negative gradients are added (with a weighting factor applied) to the prior ensemble’s output prediction to nudge it closer to being correct.
Gradient boosting is far more performant than AdaBoosting, especially on complex data. There are also more hyperparameters to tune, which gives people more control but also increases the need for experimentation.
XGBoost
XGBoost (or extreme gradient boosting) is a highly optimized version of gradient boosting. XGBoost makes gradient boosting training and inference much more parallel. XGBoost also adds regularization (i.e., penalties for complexity) to prevent overfitting and handles missing data much better. Finally, XGBoost is much more scalable for large datasets or workloads.
XGBoost is even more performant than gradient boosting and was one of the most popular ML algorithms in the 2010s. But it’s also harder to interpret and much more computationally expensive to run.
CatBoost
CatBoost is a form of gradient boosting that’s designed to work on categorical data. Categorical data is data where the values can be in a few, limited groups. Here are some examples:
- Yes–no data (e.g., does the house have a garage?)
- Color categories (e.g., red, blue, green)
- Product categories (e.g., electronics, clothing, furniture)
Gradient boosting models don’t generally work well with categorical data, while CatBoost does. CatBoost can also handle continuous data, making it another popular boosting choice. As with other gradient boosting models, CatBoost suffers from computational complexity and overfitting.
Applications of boosting
Boosting can be applied to almost any ML problem since errors and bias are often higher than we’d like. Classification and regression are two of the big subdivisions of ML, and boosting applies to both. Content recommendations and fraud detection are two examples of ML problems facing companies that boosting can also help with.
Classification and regression
Classification and regression are two of the core ML tasks. A user may want to predict whether an image contains a dog or a cat (classification), or they may want to predict the sale price of a house (regression). Boosting works well for both tasks, especially when the underlying models are weak or not complex.
Content recommendations
Boosting enhances content recommendations (e.g., Netflix’s suggested movies for you) by iteratively improving prediction accuracy for user preferences. When a recommender model fails to capture certain viewing patterns (like seasonal preferences or context-dependent choices), boosting creates additional models that specifically focus on these missed patterns. Each new model in the sequence gives extra weight to previously poorly predicted user preferences, resulting in lower errors.
Fraud detection
In fraud detection, a common use case for finance companies, boosting excels by progressively learning from misclassified transactions. If initial models miss sophisticated fraud patterns, the newer boosted models specifically target these troublesome cases. The technique adapts particularly well to changing fraud tactics by giving higher weights to recent misclassifications, allowing the system to maintain high detection rates.
Advantages of boosting
Boosting is excellent at reducing model bias and, to a lesser extent, variance. Compared to other ensemble techniques, it requires less data and gives people more control over overfitting.
Reduced bias and variance
High bias means that models are often wrong. Boosting is a great technique for reducing bias in models. Since each model focuses on correcting the errors of the previous models, the ensemble as a whole reduces its error rate.
Reduced variance also has a side effect: Newer models may have different training data mixes, allowing errors in different models to cancel each other out.
Needs less data
Unlike other ensemble techniques, boosting doesn’t need a huge dataset to work well. Since each new model focuses primarily on the errors of the older ones, it has a narrow goal and doesn’t need a ton of data. The new model can use the existing training data and repeatedly train on the errors.
More control over overfitting
Boosting has a few hyperparameters that control how much each new model contributes to the ensemble prediction. By modifying these hyperparameters, users can downweight the influence of new models. This would increase bias but potentially lower variance, giving users control of where on the bias–variance trade-off they want to land.
Challenges and limitations of boosting
Boosting has its caveats though. It requires more time to train and use, is sensitive to outlier data, and requires more hyperparameter tuning.
Longer training time
In boosting, each new model depends on the previous ensemble’s errors. This means the models must be trained one at a time, leading to long training times. Another downside is that sequential training means you may not know if boosting will be effective until you train a dozen models.
Outlier sensitivity
In boosting, newer models focus solely on the errors of prior models. Some outlier data in the training set that should be ignored may instead become the sole focus of later models. This can degrade the ensemble’s overall performance and waste training time. Careful data processing may be needed to counteract the effects of outliers.
More hyperparameter tuning
The advantage of giving users more control over overfitting also means that users need to tune more hyperparameters to find a good balance between bias and variance. Multiple boosting experiments are often needed, which makes sequential training even more tedious. Boosting requires a lot of computational resources.