Big machine learning

According to the New York Times, Yahoo is releasing Project Panama shortly. Project Panama is about better predicting which advertisements are relevant to a search, implying a higher click through rate, implying larger income for Yahoo. There are two things that seem interesting here:

  1. A significant portion of that improved accuracy is almost certainly machine learning at work.
  2. The quantitative effect is huge—the estimate in the article is $600*106.

Google already has such improvements and Microsoft Search is surely working on them, which suggest this is (perhaps) a $109 per year machine learning problem.

The exact methodology under use is unlikely to be publicly discussed in the near future because of the competitive enivironment. Hopefully we’ll have some public “war stories” at some point in the future when this information becomes less sensitive. For now, it’s reassuring to simply note that machine learning is having a big impact.

An ICML reject

Hal, Daniel, and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm.

  1. Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers.
  2. Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know.
  3. Speed. Searn has no problem handling much larger datasets than other algorithms we tested against.
  4. Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to implement Searn.
  5. Generality. Searn applies in a superset of the situations where other algorithms apply. It can use (and cope with) arbitrary loss functions over the data. It can also solve new problems not previously thought of as learning problems (and we do so!)

Much of the time, papers are about tradeoffs. A very typical (although often unstated) tradeoff is expending extra computation to gain better predictive performance in practice. In Searn, it seems there is no tradeoff compared to other approaches: you can have your lunch and eat it too.

In addition, this also solves a problem: yes, any classifier can be effectively and efficiently applied on complex structured prediction problems via Searn.

Why reject?
We, as respectfully as possible, simply disagree with the SPC about the stated grounds for rejection.

  1. The paper is difficult to read. This is true to some extent, but doesn’t seem reasonable. Fundamentally, there is a lot going on in terms of new ways of thinking about the problem and that simply requires some patience to read. One endorsement of this comes from a reviewer who said

    In the end, I do think there is something there, but I think its introduction should have been like that for CRF. Present the idea, with an understanding of why it should work, with one or two easily replicable examples. Then over a few papers reapplying it to new settings, people would get it.

    It isn’t often that a reviewer complains that there is too much content so it should be split across multiple papers.

  2. The SPC stated:

    The results, though, which essentially show that good local classifiers imply good global performance, are not that significant, and hold for other approaches that use local classifiers as building blocks. After all, perfect local classifiers provide perfect local accuracy, and therefore provide perfect global accuracy, and again provide perfect global loss of any kind.

    Both sentences are simply false in the setting we consider. In particular, no other algorithms appear to have a good local performance to global performance guarantee for general global loss functions. Furthermore, it is not the case that perfect local performance implies perfect global performance except (perhaps) in a noise free world. Most of us believe that the problems we address typically contain fundamental ambiguities and noise (that was certainly our mathematical model). It’s easy to setup a (noisy) distribution over inputs+loss such that best-possible-up-to-the-noise-limit local predictors are globally suboptimal.

  3. The SPC wanted us to contrast with Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP02. We instead contrasted with Michael Collins and Brian Roark, Incremental Parsing with the Perceptron Algorithm, ACL04. I believe any reasonable reading of these and the Searn paper will find the ACL04 paper superceeds the EMNLP02 paper in relevance.
  4. The SPC wanted us to contrast with IBT in V. Punyakanok, D. Roth, W. Yih, and D. Zimak, Learning and Inference over Constrained Output. IJCAI 2005. Here we messed up and confused it with V. Punyakanok and D. Roth, The Use of Classifiers in Sequential Inference. NIPS 2001 (which we did compare with). IBT is a modification of the earlier algorithm which integrates global information (in the form of problem specific constraints) into the local training process yielding performance gains. This modification addresses an orthogonal issue: Searn has not been used in any experiments yet where this side information is available. Searn is made to cope with global loss functions rather than global constraints.

These reasons for rejection seem (1) weak, (2) false, (3) very weak, and (4) weak.

Why post this? There are several reasons.

  1. The biggest personal reason is that we want to build on Searn. For whatever reason, it is psychologically difficult to build on rejected work. Posting it here gives us some sense that it is “out there” which may make it easier to build on it. We are trying to do research and simply don’t want to be delayed for 6 months or a year.
  2. One of the purposes of publishing a paper is to acquire feedback about the research, and there is some opportunity here.
  3. This is a nice algorithm for which we can easily imagine immediate use and general interest. Hal will be releasing the source code shortly which should make it very easy to use.
  4. Communal mental hygiene. It should not be the case, under any circumstances, that an (S)PC/reviewer makes false statements. If you think something is true but aren’t sure, it is appropriate to say “I think …” rather than simply asserting it as a fact. Asserting nonfact as fact is injurious to the process of research because it distorts the understanding of other people.

We aren’t looking for a flamefest. If there are no comments or simply comments about Searn, that’s fine. We aren’t looking for a reversal of the decision. A final decision has to be made for any conference and for better or worse this is it. The PC chairs should not be pestered—they are very busy people. Running ICML is a very difficult task, and we understand that some mistakes are inevitable when there are hundreds of papers involved.

This post violates a standard: avoiding talking about specific papers the poster has been working on. To remedy this, I will consider posts on other rejected papers. Any such post must have a very strong case.

A conversation between Theo and Pat

Pat (the practitioner) I need to do multiclass classification and I only have a decision tree.

Theo (the thoeretician) Use an error correcting output code.

Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them.

Theo Oh? Is your problem a decision list?

Pat No, I don’t think so.

Theo Hmm. Are the classes well separated by axis aligned splits?

Pat Err, maybe. I’m not sure.

Theo Well, if they are, under the IID assumption I can tell you how many samples you need.

Pat IID? The data is definitely not IID.

Theo Oh dear.

Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other.

Theo Ok. What do you know about your problem?

Pat Not much. My friend just gave me the dataset.

Theo Then, no one can help you.

Pat (What a fuzzy thinker. Theo keeps jumping to assumptions that just aren’t true.)

Theo (What a fuzzy thinker. Pat’s problem is unsolvable without making extra assumptions.)

I’ve heard variants of this conversation several times. The fundamental difference in viewpoint is the following:

  1. Theo lives in a world where he chooses the problem to solve based upon learning model (and assumptions) used.
  2. Pat lives in a world where the problem is imposed on him.

I’d love for these confusions to go away, but there is no magic wand. The best advice seems to be: listen carefully and avoid assuming to much in what you hear.

John Langford –> Yahoo Research, NY

I will join Yahoo Research (in New York) after my contract ends at TTI-Chicago.

The deciding reasons are:

  1. Yahoo is running into many hard learning problems. This is precisely the situation where basic research might hope to have the greatest impact.
  2. Yahoo Research understands research including publishing, conferences, etc…
  3. Yahoo Research is growing, so there is a chance I can help it grow well.
  4. Yahoo understands the internet, including (but not at all limited to) experimenting with research blogs.

In the end, Yahoo Research seems like the place where I might have a chance to make the greatest difference.

Yahoo (as a company) has made a strong bet on Yahoo Research. We-the-researchers all hope that bet will pay off, and this seems plausible. I’ll certainly have fun trying.

Conferences, Workshops, and Tutorials

This is a reminder that many deadlines for summer conference registration are coming up, and attendance is a very good idea.

  1. It’s entirely reasonable for anyone to visit a conference once, even when they don’t have a paper. For students, visiting a conference is almost a ‘must’—there is no where else that a broad cross-section of research is on display.
  2. Workshops are also a very good idea. ICML has 11, KDD has 9, and AAAI has 19. Workshops provide an opportunity to get a good understanding of some current area of research. They are probably the forum most conducive to starting new lines of research because they are so interactive.
  3. Tutorials are a good way to gain some understanding of a long-standing direction of research. They are generally more coherent than workshops. ICML has 7 and AAAI has 15.