Hal, Daniel, and I have been working on the algorithm Searn for structured prediction. This was just conditionally accepted and then rejected from ICML, and we were quite surprised. By any reasonable criteria, it seems this is an interesting algorithm.
- Prediction Performance: Searn performed better than any other algorithm on all the problems we tested against using the same feature set. This is true even using the numbers reported by authors in their papers.
- Theoretical underpinning. Searn is a reduction which comes with a reduction guarantee: the good performance on a base classifiers implies good performance for the overall system. No other theorem of this type has been made for other structured prediction algorithms, as far as we know.
- Speed. Searn has no problem handling much larger datasets than other algorithms we tested against.
- Simplicity. Given code for a binary classifier and a problem-specific search algorithm, only a few tens of lines are necessary to implement Searn.
- Generality. Searn applies in a superset of the situations where other algorithms apply. It can use (and cope with) arbitrary loss functions over the data. It can also solve new problems not previously thought of as learning problems (and we do so!)
Much of the time, papers are about tradeoffs. A very typical (although often unstated) tradeoff is expending extra computation to gain better predictive performance in practice. In Searn, it seems there is no tradeoff compared to other approaches: you can have your lunch and eat it too.
In addition, this also solves a problem: yes, any classifier can be effectively and efficiently applied on complex structured prediction problems via Searn.
Why reject?
We, as respectfully as possible, simply disagree with the SPC about the stated grounds for rejection.
- The paper is difficult to read. This is true to some extent, but doesn’t seem reasonable. Fundamentally, there is a lot going on in terms of new ways of thinking about the problem and that simply requires some patience to read. One endorsement of this comes from a reviewer who said
In the end, I do think there is something there, but I think its introduction should have been like that for CRF. Present the idea, with an understanding of why it should work, with one or two easily replicable examples. Then over a few papers reapplying it to new settings, people would get it.
It isn’t often that a reviewer complains that there is too much content so it should be split across multiple papers.
- The SPC stated:
The results, though, which essentially show that good local classifiers imply good global performance, are not that significant, and hold for other approaches that use local classifiers as building blocks. After all, perfect local classifiers provide perfect local accuracy, and therefore provide perfect global accuracy, and again provide perfect global loss of any kind.
Both sentences are simply false in the setting we consider. In particular, no other algorithms appear to have a good local performance to global performance guarantee for general global loss functions. Furthermore, it is not the case that perfect local performance implies perfect global performance except (perhaps) in a noise free world. Most of us believe that the problems we address typically contain fundamental ambiguities and noise (that was certainly our mathematical model). It’s easy to setup a (noisy) distribution over inputs+loss such that best-possible-up-to-the-noise-limit local predictors are globally suboptimal.
- The SPC wanted us to contrast with Michael Collins, Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP02. We instead contrasted with Michael Collins and Brian Roark, Incremental Parsing with the Perceptron Algorithm, ACL04. I believe any reasonable reading of these and the Searn paper will find the ACL04 paper superceeds the EMNLP02 paper in relevance.
- The SPC wanted us to contrast with IBT in V. Punyakanok, D. Roth, W. Yih, and D. Zimak, Learning and Inference over Constrained Output. IJCAI 2005. Here we messed up and confused it with V. Punyakanok and D. Roth, The Use of Classifiers in Sequential Inference. NIPS 2001 (which we did compare with). IBT is a modification of the earlier algorithm which integrates global information (in the form of problem specific constraints) into the local training process yielding performance gains. This modification addresses an orthogonal issue: Searn has not been used in any experiments yet where this side information is available. Searn is made to cope with global loss functions rather than global constraints.
These reasons for rejection seem (1) weak, (2) false, (3) very weak, and (4) weak.
Why post this? There are several reasons.
- The biggest personal reason is that we want to build on Searn. For whatever reason, it is psychologically difficult to build on rejected work. Posting it here gives us some sense that it is “out there” which may make it easier to build on it. We are trying to do research and simply don’t want to be delayed for 6 months or a year.
- One of the purposes of publishing a paper is to acquire feedback about the research, and there is some opportunity here.
- This is a nice algorithm for which we can easily imagine immediate use and general interest. Hal will be releasing the source code shortly which should make it very easy to use.
- Communal mental hygiene. It should not be the case, under any circumstances, that an (S)PC/reviewer makes false statements. If you think something is true but aren’t sure, it is appropriate to say “I think …” rather than simply asserting it as a fact. Asserting nonfact as fact is injurious to the process of research because it distorts the understanding of other people.
We aren’t looking for a flamefest. If there are no comments or simply comments about Searn, that’s fine. We aren’t looking for a reversal of the decision. A final decision has to be made for any conference and for better or worse this is it. The PC chairs should not be pestered—they are very busy people. Running ICML is a very difficult task, and we understand that some mistakes are inevitable when there are hundreds of papers involved.
This post violates a standard: avoiding talking about specific papers the poster has been working on. To remedy this, I will consider posts on other rejected papers. Any such post must have a very strong case.