June 2008 – Machine Learning (Theory)

Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past.

Prediction Theory for Classification Tutorial. This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR. The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. Apparently, the paper accidentally went to two different action editors, who each chose distinct reviewers.
Cover Tree. This paper was the first one to give a datastructure for nearest neighbor search for an arbitrary metric which both (a) took logarithmic time under dimensionality constraint and (b) always required space competitive with brute force nearest neighbor search. Previous papers had done (a) or (b), but not both, and achieving both appears key to a practical algorithm, which we backed up with experiments and code.
The cover tree paper suffered a triple rejection, the last one of which seems particularly poor to me. We submitted the draft to SODA, and got back 3 reviews. The first was blank. The second was a paragraph of positive but otherwise uninformative text. The third was blank. The decision was reject. We were rather confused, so we emailed the program chair asking if the decision was right and if so whether there was any more information we could get. We got back only a form letter providing no further information. Since then, the paper was accepted at ICML.
Ranking Reduction. This paper shows that learning how to predict which of a pair of items is better strongly transfers to optimizing a ranking loss, in contrast to (for example) simply predicting a score and ordering according to predicted score.
We submitted this paper to NIPS and it had the highest average review of any learning theory paper. The decision was to reject. Based upon what we could make out from a statement by the program committee, the logic of this decision is mostly kindly describable as badly flawed—somehow they confused the algorithm, the problem, and the analysis into a mess. Later it was accepted at COLT. (A bit of disclosure: I was on the program committee at NIPS that year, although obviously not involved in the decision on this paper.)

In all cases where a rejection occurs, the default presumption is that the correct decision was made because most of the time a good (or at least reasonable) decision was made. Consequently, it seems important to point out that there are some objective signs each of the above cases involved poor decisions.

The tutorial paper is fairly widely cited (Google scholar places it 8th amongst my papers), and I continue to find it useful material for a lecture when teaching a class.
The cover tree is also fairly widely cited, and I know from various emails and download counts that it is used by several people. It also won an award from IBM. To this day, it seems odd that an algorithms paper was only publishable at a machine learning conference.
It’s really too soon to tell with the ranking paper, but it was one of the few COLT papers invited to a journal special issue, and there has since been substantial additional work by Mehryar Mohri and Nir Ailon which broadens the claim to other ranking metrics and makes it more computationally tractable.

One of the reasons you hear for why a paper was rejected and then accepted is that the paper improved in the meantime. That’s often true, but in each of the above cases I don’t believe there were any substantial changes between submissions (and for the tutorial it was a perfect accidental experiment).

Normally reviewing horror stories are the academic equivalent of warstories, but these ones have slightly more point. They have each informed my thinking about how reviewing should be done. Relating these stories might make this thinking a bit more understandable.

Reviewer Choice. The tutorial case brings home the impact of how reviewers are chosen. If a paper is to have 3 reviews, it seems like a good idea to choose the reviewers in diverse ways, rather than one way. For example, at a conference, one reviewer by bidding preference, one reviewer by area chair, and one reviewer by another area chair or the program chair’s choice might reduce variance.
Uniform Author feedback. The standard at NIPS was to have author feedback when the ranking paper was submitted. In effect, the standard was not followed for the ranking paper, and it’s easy to imagine this making a substantial difference given how badly flawed the basis of rejection was. It is also easy to imagine that author feedback might have made a difference in the tutorial rejection, as the reviewer was wrong (author feedback was not the standard then).
Decision Basis. It’s helpful to relate the basis of decision by the program committee, especially when it is not summarized in the reviews. The cover tree case was one of the things which led me to add summaries to some of the NIPS papers when I was on the program committee, and I am committed to doing the same for SODA papers I’m reviewing this year. Not having a summary saves the program committee the embarassment of accidentally admitting mistakes, but it is badly disrepectful of the authors and generally promotes misunderstanding.
Fast decisions are bad. It’s not possible to reliably make good decisions about technical matters quickly. I suspect that the time crunch of the NIPS program committee meeting was a contributing factor in the ranking paper case.

As anyone educated in machine learning or statistics understands, drawing 4 conclusions from 3 datapoints is problematic, so the above should be understood as suggestions subject to further evidence.

This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time.

The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding:
N = E_{x ~ D} f(x)
given samples from D and knowledge of f.

Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q. In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula:
E_{x ~ Q} f(x) D(x)/Q(x)

A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions.
Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1, which is typically substantially better than the sample complexity of the original problem.

This observation underlies the motivation for voluntary importance weighting algorithms. Even under pretty terrible approximations, the logic of “Q(x) is something like f(x) D(x)” often yields substantial improvements over sampling directly from D(x).

Month: June 2008

ICML has a comment system

Reviewing Horror Stories

The Minimum Sample Complexity of Importance Weighting