Hunch.net has shifted to a new server, and wordpress has been updated to the latest version. If anyone notices difficulties associated with this, please comment. (Note that DNS updates can take awhile so the shift may not yet be complete.)
More generally, this is a good time to ask for suggestions. What would make this blog more useful?
What is the best regret transform reduction from multiclass to binary?
This post is about an open problem in learning reductions.
Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e. After subtracting the minimum possible (Bayes) error rate b, we get a regret r = e – b. The PECOC(Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r0.5.
The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r. This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k. For example, we can not rule out the possibility that a reduction R exists with regret transform guaranteeing binary regret r implies at most multiclass regret c(k) r where c(k) is a k dependent constant between 1 and 4.
Difficulty I believe this is a solvable problem, given some serious thought.
Impact The use of some reduction from multiclass to binary is common practice, so a good solution could be widely useful. One thing to be aware of is that there is a common and reasonable concern about the ‘naturalness’ of induced problems. There seems to be no way to address this concern other than via empirical testing. On the theoretical side, a better reduction may help us understand whether classification or l2 regression is the more natural primitive for reduction. The PECOC reduction essentially first turns a binary classifier into an l2 regressor and then uses the regressor repeatedly to make multiclass predictions.
Some background material which may help:
- Dietterich and Bakiri introduce Error Correcting Output Codes.
- Guruswami and Sahai analyze ECOC as an error transform reduction. (see lemma 2)
- Allwein, Schapire, and Singer generalize ECOC to use loss-based decoding.
- Beygelzimer and Langford showed that ECOC is not a regret transform and proved the PECOC regret transform.
NIPS paper evaluation criteria
John Platt, who is PC-chair for NIPS 2006 has organized a NIPS paper evaluation criteria document with input from the program committee and others.
The document contains specific advice about what is appropriate for the various subareas within NIPS. It may be very helpful, because the standards of evaluation for papers varies significantly.
This is a bit of an experiment: the hope is that by carefully thinking about and stating what is important, authors can better understand whether and where their work fits.
Update: The general submission page and Author instruction including how to submit an appendix.
The value of the orthodox view of Boosting
The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page).
Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics, including a response by Yoav Freund and Robert Schapire.
I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties.
Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost.
There are two advantages of the “weak to strong learning” view:
- Automatic computational speedups. In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. As a consequence, we know the computation can be quite fast. In the CWGD view, using C4.5 (or some other algorithm) to pick the coordinate is an unmotivated decision. The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups.
- Meta-algorithm based performance gains. Using a nontrivial base learning algorithm seems to improve performance. This is unsurprising—simply consider the limit where only one round of boosting is done. This is not well-accounted for by the CWGD view.
The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest.
This might seem like a “too meta” discussion, but it is very relevant to the process of research. We as researchers in machine learning have a choice of many methods of thinking about developing algorithms. Some methods are harder to use than others, so it is useful to think about what works well. Gradient descent is a core algorithmic tool in machine learning. After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. There are many gradient descent algorithms which are not very useful in machine learning. Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions. It is no coincidence that Adaboost can be understood in this framework.
Big machine learning
According to the New York Times, Yahoo is releasing Project Panama shortly. Project Panama is about better predicting which advertisements are relevant to a search, implying a higher click through rate, implying larger income for Yahoo. There are two things that seem interesting here:
- A significant portion of that improved accuracy is almost certainly machine learning at work.
- The quantitative effect is huge—the estimate in the article is $600*106.
Google already has such improvements and Microsoft Search is surely working on them, which suggest this is (perhaps) a $109 per year machine learning problem.
The exact methodology under use is unlikely to be publicly discussed in the near future because of the competitive enivironment. Hopefully we’ll have some public “war stories” at some point in the future when this information becomes less sensitive. For now, it’s reassuring to simply note that machine learning is having a big impact.