The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page).
Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics, including a response by Yoav Freund and Robert Schapire.
I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties.
Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost.
There are two advantages of the “weak to strong learning” view:
- Automatic computational speedups. In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. As a consequence, we know the computation can be quite fast. In the CWGD view, using C4.5 (or some other algorithm) to pick the coordinate is an unmotivated decision. The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups.
- Meta-algorithm based performance gains. Using a nontrivial base learning algorithm seems to improve performance. This is unsurprising—simply consider the limit where only one round of boosting is done. This is not well-accounted for by the CWGD view.
The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest.
This might seem like a “too meta” discussion, but it is very relevant to the process of research. We as researchers in machine learning have a choice of many methods of thinking about developing algorithms. Some methods are harder to use than others, so it is useful to think about what works well. Gradient descent is a core algorithmic tool in machine learning. After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. There are many gradient descent algorithms which are not very useful in machine learning. Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions. It is no coincidence that Adaboost can be understood in this framework.
I believe there is much value in the original view, but I’ll play devil’s advocate nevertheless.
I disagree with the procedureal part. The gradient view directly points
to a function call to get the best direction.
The gradient view is more illuminating because:
1) It doesn’t make assumptions about weak learnability: this is an unwarranted assumption that isn’t necessary.
2) It immediately suggests new approaches and new problems. I’ve found
this (personally) very powerful.
3) The “boosting” view turns out in practice to not be very useful. Incremental gradient descent on other loss functions, (logistic) for example, perform as well (often bettere) in practice. In fact, when many people say they use “boosting”, they really use a stagewise coordinate ascent on some other loss or doing an smaller step instead. (Gentleboost, etc….)
4) It immediately fixed the multi-class problem that stymied the classical view.
5) Adding additional regularization becomes easy.
6) It has provided us much more insight into L_1 regularization and the relation between stagewise methods and sparsity/l_1 regularization. This connects other linear methods as well as neural-network style methods
as well.
7) The overfitting problem the method potentially has is immediately clarified and the importance of the effective (algorithmic) regularization comes forward.
I would argue that the stagewise co-ordinate ascent view has been a much
more powerful and fruitful view.
I would also cite as the earliest version:
http://scholar.google.com/scholar?q=author:”Mason”%20intitle:”Functional%20gradient%20techniques%20for%20combining%20hypotheses”%20&num=20&hl=en&lr=&safe=off&oi=scholarr
I think that if you want to argue for the original view, you should take
it from something like “Boosting by Branching Programs” instead of Adaboost.
Some of these comments seem misplaced. The basic claim of this post is not that weak learning view is superior in all ways (which the comments seem aimed at), but rather that CWGD view is not superior in all ways.
1) It isn’t clear to me that weak learnability is unwarranted. At least in the initial steps, I think our learning algorithim often can successfully gain some predictive edge. In more practical terms any theory can be warrented based upon it’s predictive power.
2) Consider two methods of designing algorithms A and B. Method A is a precision instrument where every designed algorithm performs well. Method B suggests many more algorithms, some of which are useful and some of which are not. There may be a place for B when you can’t easily use A, but it seems like A is the preferred first-try. The gradient descent view looks much more like ‘B’ to me while the traditional view is more like ‘A’.
3) Adaboost was derived from the boosting view. Adaboost’ed decision trees where the best single performers in the evaluation paper pointed to. It looks like your experience is contrary to that of others.
I don’t understand (4). There are (and were) several methods to reduce multiclass to binary classification and then apply Adaboost. I have not seen (5) be important, but your experience may vary. (6) may be true. I don’t understand (7) that well.