John Langford – Page 66 – Machine Learning (Theory)

5/16/20065/16/2006

The value of the orthodox view of Boosting

The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page).

Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics, including a response by Yoav Freund and Robert Schapire.

I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties.

Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost.

There are two advantages of the “weak to strong learning” view:

Automatic computational speedups. In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. As a consequence, we know the computation can be quite fast. In the CWGD view, using C4.5 (or some other algorithm) to pick the coordinate is an unmotivated decision. The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups.
Meta-algorithm based performance gains. Using a nontrivial base learning algorithm seems to improve performance. This is unsurprising—simply consider the limit where only one round of boosting is done. This is not well-accounted for by the CWGD view.

The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest.

This might seem like a “too meta” discussion, but it is very relevant to the process of research. We as researchers in machine learning have a choice of many methods of thinking about developing algorithms. Some methods are harder to use than others, so it is useful to think about what works well. Gradient descent is a core algorithmic tool in machine learning. After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. There are many gradient descent algorithms which are not very useful in machine learning. Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions. It is no coincidence that Adaboost can be understood in this framework.

5/8/20065/8/2006

Big machine learning

According to the New York Times, Yahoo is releasing Project Panama shortly. Project Panama is about better predicting which advertisements are relevant to a search, implying a higher click through rate, implying larger income for Yahoo. There are two things that seem interesting here:

A significant portion of that improved accuracy is almost certainly machine learning at work.
The quantitative effect is huge—the estimate in the article is $600*10⁶.

Google already has such improvements and Microsoft Search is surely working on them, which suggest this is (perhaps) a $10⁹ per year machine learning problem.

The exact methodology under use is unlikely to be publicly discussed in the near future because of the competitive enivironment. Hopefully we’ll have some public “war stories” at some point in the future when this information becomes less sensitive. For now, it’s reassuring to simply note that machine learning is having a big impact.

5/5/20065/5/2006

An ICML reject

Hal, Daniel, and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm.

Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers.
Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know.
Speed. Searn has no problem handling much larger datasets than other algorithms we tested against.
Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to implement Searn.
Generality. Searn applies in a superset of the situations where other algorithms apply. It can use (and cope with) arbitrary loss functions over the data. It can also solve new problems not previously thought of as learning problems (and we do so!)

Much of the time, papers are about tradeoffs. A very typical (although often unstated) tradeoff is expending extra computation to gain better predictive performance in practice. In Searn, it seems there is no tradeoff compared to other approaches: you can have your lunch and eat it too.

In addition, this also solves a problem: yes, any classifier can be effectively and efficiently applied on complex structured prediction problems via Searn.

Why reject?
We, as respectfully as possible, simply disagree with the SPC about the stated grounds for rejection.

The paper is difficult to read. This is true to some extent, but doesn’t seem reasonable. Fundamentally, there is a lot going on in terms of new ways of thinking about the problem and that simply requires some patience to read. One endorsement of this comes from a reviewer who said

In the end, I do think there is something there, but I think its introduction should have been like that for CRF. Present the idea, with an understanding of why it should work, with one or two easily replicable examples. Then over a few papers reapplying it to new settings, people would get it.

It isn’t often that a reviewer complains that there is too much content so it should be split across multiple papers.
The SPC stated:

The results, though, which essentially show that good local classifiers imply good global performance, are not that significant, and hold for other approaches that use local classifiers as building blocks. After all, perfect local classifiers provide perfect local accuracy, and therefore provide perfect global accuracy, and again provide perfect global loss of any kind.

Both sentences are simply false in the setting we consider. In particular, no other algorithms appear to have a good local performance to global performance guarantee for general global loss functions. Furthermore, it is not the case that perfect local performance implies perfect global performance except (perhaps) in a noise free world. Most of us believe that the problems we address typically contain fundamental ambiguities and noise (that was certainly our mathematical model). It’s easy to setup a (noisy) distribution over inputs+loss such that best-possible-up-to-the-noise-limit local predictors are globally suboptimal.
The SPC wanted us to contrast with Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP02. We instead contrasted with Michael Collins and Brian Roark, Incremental Parsing with the Perceptron Algorithm, ACL04. I believe any reasonable reading of these and the Searn paper will find the ACL04 paper superceeds the EMNLP02 paper in relevance.
The SPC wanted us to contrast with IBT in V. Punyakanok, D. Roth, W. Yih, and D. Zimak, Learning and Inference over Constrained Output. IJCAI 2005. Here we messed up and confused it with V. Punyakanok and D. Roth, The Use of Classifiers in Sequential Inference. NIPS 2001 (which we did compare with). IBT is a modification of the earlier algorithm which integrates global information (in the form of problem specific constraints) into the local training process yielding performance gains. This modification addresses an orthogonal issue: Searn has not been used in any experiments yet where this side information is available. Searn is made to cope with global loss functions rather than global constraints.

These reasons for rejection seem (1) weak, (2) false, (3) very weak, and (4) weak.

Why post this? There are several reasons.

The biggest personal reason is that we want to build on Searn. For whatever reason, it is psychologically difficult to build on rejected work. Posting it here gives us some sense that it is “out there” which may make it easier to build on it. We are trying to do research and simply don’t want to be delayed for 6 months or a year.
One of the purposes of publishing a paper is to acquire feedback about the research, and there is some opportunity here.
This is a nice algorithm for which we can easily imagine immediate use and general interest. Hal will be releasing the source code shortly which should make it very easy to use.
Communal mental hygiene. It should not be the case, under any circumstances, that an (S)PC/reviewer makes false statements. If you think something is true but aren’t sure, it is appropriate to say “I think …” rather than simply asserting it as a fact. Asserting nonfact as fact is injurious to the process of research because it distorts the understanding of other people.

We aren’t looking for a flamefest. If there are no comments or simply comments about Searn, that’s fine. We aren’t looking for a reversal of the decision. A final decision has to be made for any conference and for better or worse this is it. The PC chairs should not be pestered—they are very busy people. Running ICML is a very difficult task, and we understand that some mistakes are inevitable when there are hundreds of papers involved.

This post violates a standard: avoiding talking about specific papers the poster has been working on. To remedy this, I will consider posts on other rejected papers. Any such post must have a very strong case.

5/1/20065/1/2006

A conversation between Theo and Pat

Pat (the practitioner) I need to do multiclass classification and I only have a decision tree.

Theo (the thoeretician) Use an error correcting output code.

Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them.

Theo Oh? Is your problem a decision list?

Pat No, I don’t think so.

Theo Hmm. Are the classes well separated by axis aligned splits?

Pat Err, maybe. I’m not sure.

Theo Well, if they are, under the IID assumption I can tell you how many samples you need.

Pat IID? The data is definitely not IID.

Theo Oh dear.

Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other.

Theo Ok. What do you know about your problem?

Pat Not much. My friend just gave me the dataset.

Theo Then, no one can help you.

Pat (What a fuzzy thinker. Theo keeps jumping to assumptions that just aren’t true.)

Theo (What a fuzzy thinker. Pat’s problem is unsolvable without making extra assumptions.)

I’ve heard variants of this conversation several times. The fundamental difference in viewpoint is the following:

Theo lives in a world where he chooses the problem to solve based upon learning model (and assumptions) used.
Pat lives in a world where the problem is imposed on him.

I’d love for these confusions to go away, but there is no magic wand. The best advice seems to be: listen carefully and avoid assuming to much in what you hear.

4/30/20064/30/2006

John Langford –> Yahoo Research, NY

I will join Yahoo Research (in New York) after my contract ends at TTI-Chicago.

The deciding reasons are:

Yahoo is running into many hard learning problems. This is precisely the situation where basic research might hope to have the greatest impact.
Yahoo Research understands research including publishing, conferences, etc…
Yahoo Research is growing, so there is a chance I can help it grow well.
Yahoo understands the internet, including (but not at all limited to) experimenting with research blogs.

In the end, Yahoo Research seems like the place where I might have a chance to make the greatest difference.

Yahoo (as a company) has made a strong bet on Yahoo Research. We-the-researchers all hope that bet will pay off, and this seems plausible. I’ll certainly have fun trying.