Reviewing Horror Stories

Essentially everyone who writes research papers suffers rejections. They always sting immediately, but upon further reflection many of these rejections come to seem reasonable. Maybe the equations had too many typos or maybe the topic just isn’t as important as was originally thought. A few rejections do not come to seem acceptable, and these form the basis of reviewing horror stories, a great material for conversations. I’ve decided to share three of mine, now all safely a bit distant in the past.

  1. Prediction Theory for Classification Tutorial. This is a tutorial about tight sample complexity bounds for classification that I submitted to JMLR. The first decision I heard was a reject which appeared quite unjust to me—for example one of the reviewers appeared to claim that all the content was in standard statistics books. Upon further inquiry, several citations were given, none of which actually covered the content. Later, I was shocked to hear the paper was accepted. Apparently, the paper accidentally went to two different action editors, who each chose distinct reviewers.
  2. Cover Tree. This paper was the first one to give a datastructure for nearest neighbor search for an arbitrary metric which both (a) took logarithmic time under dimensionality constraint and (b) always required space competitive with brute force nearest neighbor search. Previous papers had done (a) or (b), but not both, and achieving both appears key to a practical algorithm, which we backed up with experiments and code.

    The cover tree paper suffered a triple rejection, the last one of which seems particularly poor to me. We submitted the draft to SODA, and got back 3 reviews. The first was blank. The second was a paragraph of positive but otherwise uninformative text. The third was blank. The decision was reject. We were rather confused, so we emailed the program chair asking if the decision was right and if so whether there was any more information we could get. We got back only a form letter providing no further information. Since then, the paper was accepted at ICML.

  3. Ranking Reduction. This paper shows that learning how to predict which of a pair of items is better strongly transfers to optimizing a ranking loss, in contrast to (for example) simply predicting a score and ordering according to predicted score.

    We submitted this paper to NIPS and it had the highest average review of any learning theory paper. The decision was to reject. Based upon what we could make out from a statement by the program committee, the logic of this decision is mostly kindly describable as badly flawed—somehow they confused the algorithm, the problem, and the analysis into a mess. Later it was accepted at COLT. (A bit of disclosure: I was on the program committee at NIPS that year, although obviously not involved in the decision on this paper.)

In all cases where a rejection occurs, the default presumption is that the correct decision was made because most of the time a good (or at least reasonable) decision was made. Consequently, it seems important to point out that there are some objective signs each of the above cases involved poor decisions.

  1. The tutorial paper is fairly widely cited (Google scholar places it 8th amongst my papers), and I continue to find it useful material for a lecture when teaching a class.
  2. The cover tree is also fairly widely cited, and I know from various emails and download counts that it is used by several people. It also won an award from IBM. To this day, it seems odd that an algorithms paper was only publishable at a machine learning conference.
  3. It’s really too soon to tell with the ranking paper, but it was one of the few COLT papers invited to a journal special issue, and there has since been substantial additional work by Mehryar Mohri and Nir Ailon which broadens the claim to other ranking metrics and makes it more computationally tractable.

One of the reasons you hear for why a paper was rejected and then accepted is that the paper improved in the meantime. That’s often true, but in each of the above cases I don’t believe there were any substantial changes between submissions (and for the tutorial it was a perfect accidental experiment).

Normally reviewing horror stories are the academic equivalent of warstories, but these ones have slightly more point. They have each informed my thinking about how reviewing should be done. Relating these stories might make this thinking a bit more understandable.

  1. Reviewer Choice. The tutorial case brings home the impact of how reviewers are chosen. If a paper is to have 3 reviews, it seems like a good idea to choose the reviewers in diverse ways, rather than one way. For example, at a conference, one reviewer by bidding preference, one reviewer by area chair, and one reviewer by another area chair or the program chair’s choice might reduce variance.
  2. Uniform Author feedback. The standard at NIPS was to have author feedback when the ranking paper was submitted. In effect, the standard was not followed for the ranking paper, and it’s easy to imagine this making a substantial difference given how badly flawed the basis of rejection was. It is also easy to imagine that author feedback might have made a difference in the tutorial rejection, as the reviewer was wrong (author feedback was not the standard then).
  3. Decision Basis. It’s helpful to relate the basis of decision by the program committee, especially when it is not summarized in the reviews. The cover tree case was one of the things which led me to add summaries to some of the NIPS papers when I was on the program committee, and I am committed to doing the same for SODA papers I’m reviewing this year. Not having a summary saves the program committee the embarassment of accidentally admitting mistakes, but it is badly disrepectful of the authors and generally promotes misunderstanding.
  4. Fast decisions are bad. It’s not possible to reliably make good decisions about technical matters quickly. I suspect that the time crunch of the NIPS program committee meeting was a contributing factor in the ranking paper case.

As anyone educated in machine learning or statistics understands, drawing 4 conclusions from 3 datapoints is problematic, so the above should be understood as suggestions subject to further evidence.

The Minimum Sample Complexity of Importance Weighting

This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time.

The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding:
N = Ex ~ D f(x)
given samples from D and knowledge of f.

Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q. In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula:
Ex ~ Q f(x) D(x)/Q(x)

A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions.
Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1, which is typically substantially better than the sample complexity of the original problem.

This observation underlies the motivation for voluntary importance weighting algorithms. Even under pretty terrible approximations, the logic of “Q(x) is something like f(x) D(x)” often yields substantial improvements over sampling directly from D(x).

Inappropriate Mathematics for Machine Learning

Reviewers and students are sometimes greatly concerned by the distinction between:

  1. An open set and a closed set.
  2. A Supremum and a Maximum.
  3. An event which happens with probability 1 and an event that always happens.

I don’t appreciate this distinction in machine learning & learning theory. All machine learning takes place (by definition) on a machine where every parameter has finite precision. Consequently, every set is closed, a maximal element always exists, and probability 1 events always happen.

The fundamental issue here is that substantial parts of mathematics don’t appear well-matched to computation in the physical world, because the mathematics has concerns which are unphysical. This mismatched mathematics makes irrelevant distinctions. We can ask “what mathematics is appropriate to computation?” Andrej has convinced me that a pretty good answer to this question is constructive mathematics.

So, here’s a basic challenge: Can anyone name a situation where any of the distinctions above (or similar distinctions) matter in machine learning?

Concerns about the Large Scale Learning Challenge

The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured.

From the instructions page, several issues come up:

  1. Large Definition My personal definition of dataset size is:
    1. small A dataset is small if a human could look at the dataset and plausibly find a good solution.
    2. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer.
    3. large A large dataset does not fit in the RAM of a reasonably priced computer.

    By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that.

    The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require random access to examples.

  2. No Loading Time The medium scale nature of the datasets is tacitly acknowledged in the rules which exclude data loading time. My experience is that parsing and loading large datasets is often the computational bottleneck. For example when comparing Vowpal Wabbit to SGD I used wall-clock time which makes SGD look a factor of 40 or so worse than Leon’s numbers only using training time after loading. This timing difference is entirely due to the overhead of parsing, even though the format parsed is a carefully optimized binary language. (No ‘excluding loading time’ number can be found for VW, of course, because loading and learning are intertwined.)
  3. Optimal Parameter Time The rules specify that the algorithm should be timed with optimal parameters. It’s very common for learning algorithms to have a few parameters controlling learning rate or regularization. However, no constraints are placed on the number or meaning of these parameters. As an extreme form of abuse, for example, your initial classifier could be declared a parameter. With an appropriate choice of this initial parameter (which you can freely optimize on the data), training time is zero.
  4. Parallelism One approach to dealing with large amounts of data is to add computers that operate in parallel. This is very natural (the brain is vastly parallel at the neuron level), and there are substantial research questions in parallel machine learning. Nevertheless it doesn’t appear to be supported by the contest. There are good reasons for this: parallel architectures aren’t very standard yet, and buying multiple computers is still substantially more expensive than buying the RAM to fit the dataset sizes. Nevertheless, it’s disappointing to exclude such a natural avenue. The rules even appear unclear on whether or not the final test run is on an SMP machine.

As a consequence of this design, the contest prefers algorithms that load all data into memory then operate on it. It also essentially excludes parallel algorithms. These design decisions discourage large scale algorithms (where large is as defined above) in favor of medium scale learning algorithms. The design also favors highly parameterized learning algorithms over less parameterized algorithms, which is the opposite of my personal preference for research direction.

Many of these issues are eliminatable or at least partially addressable. Limiting the parameter size to ’20 characters on the commandline’ or in some other reasonable way seems essential. It’s probably too late to get large datasets, but using wall-clock time would at least avoid bias against large scale algorithms. If the final evaluation is going to take place on an SMP machine, at least detailing that would be helpful.

Despite these concerns, it’s important to be clear that this is an interesting contest. Even without any rule changes, it’s outcome tells us something about which sorts of algorithms work at a medium scale. That’s good information to know if you are interested in tackling larger scale algorithms. The datasets are also large enough to break every Theta(m2) algorithm. We should also respect the organizers: setting up any contest of this sort is quite a bit of work that’s difficult to nail down perfectly in advance.

update: Soeren has helped setup an SMP parallel track which address some of the concerns above. See the site for details, and see you there.

Watchword: Supervised Learning

I recently discovered that supervised learning is a controversial term. The two definitions are:

  1. Known Loss Supervised learning corresponds to the situation where you have unlabeled examples plus knowledge of the loss of each possible predicted choice. This is the definition I’m familiar and comfortable with. One reason to prefer this definition is that the analysis of sample complexity for this class of learning problems are all pretty similar.
  2. Any kind of signal Supervised learning corresponds to the situation where you have unlabeled examples plus any source of side information about what the right choice is. This notion of supervised learning seems to subsume reinforcement learning, which makes me uncomfortable, because it means there are two words for the same class. This also means there isn’t a convenient word to describe the first definition.

Reviews suggest there are people who are dedicated to the second definition out there, so it can be important to discriminate which you mean.