The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page).
Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics, including a response by Yoav Freund and Robert Schapire.
I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties.
Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost.
There are two advantages of the “weak to strong learning” view:
- Automatic computational speedups. In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. As a consequence, we know the computation can be quite fast. In the CWGD view, using C4.5 (or some other algorithm) to pick the coordinate is an unmotivated decision. The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups.
- Meta-algorithm based performance gains. Using a nontrivial base learning algorithm seems to improve performance. This is unsurprising—simply consider the limit where only one round of boosting is done. This is not well-accounted for by the CWGD view.
The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest.
This might seem like a “too meta” discussion, but it is very relevant to the process of research. We as researchers in machine learning have a choice of many methods of thinking about developing algorithms. Some methods are harder to use than others, so it is useful to think about what works well. Gradient descent is a core algorithmic tool in machine learning. After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. There are many gradient descent algorithms which are not very useful in machine learning. Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions. It is no coincidence that Adaboost can be understood in this framework.