The Health of COLT

The health of COLT (Conference on Learning Theory or Computational Learning Theory depending on who you ask) has been questioned over the last few years. Low points for the conference occurred when EuroCOLT merged with COLT in 2001, and the attendance at the 2002 Sydney COLT fell to a new low. This occurred in the general context of machine learning conferences rising in both number and size over the last decade.

Any discussion of why COLT has had difficulties is inherently controversial as is any story about well-intentioned people making the wrong decisions. Nevertheless, this may be worth discussing in the hope of avoiding problems in the future and general understanding. In any such discussion there is a strong tendency to identify with a conference/community in a patriotic manner that is detrimental to thinking. Keep in mind that conferences exist to further research.

My understanding (I wasn’t around) is that COLT started as a subcommunity of the computer science theory community. This implies several things:

  1. There was a basic tension facing authors: Do you submit to COLT or to FOCS or STOC which are the “big” theory conferences?
  2. The research programs in COLT were motivated by theoretical concerns (rather than, say, practical experience). This includes motivations like understanding the combinatorics of some models of learning and the relationship with crypto.

This worked well in the beginning when new research programs were being defined and new learning models were under investigation. What went wrong from there is less clear.

  1. Perhaps the community shifted focus from thinking about new learning models to simply trying to find solutions in older models, and this went stale.
  2. Perhaps some critical motivations were left out. Many of the learning models under investigation at COLT strike empirically motivated people as implausibly useful.
  3. Perhaps the conference/community was not inviting enough to new forms of learning theory. Many pieces of learning theory have not appeared at COLT over the last 20 years.

These concerns have been addressed since the low point of COLT, but the long term health is still questionable: ICML has been accepting learning theory with plausible empirical motivations and a mathematical learning theory conference has appeared so there are several choices of venue available to authors.

The good news is that this year’s COLT appeared healthy. The topics covered by the program were diverse and often interesting. Several of the papers seem quite relevant to the practice of machine learning. Perhaps an even better measure is that there were many younger people in attendance.

The Role of Impromptu Talks

COLT had an impromptu session which seemed as interesting or more interesting than any other single technical session (despite being only an hour long). There are several roles that an impromptu session can play including:

  1. Announcing new work since the paper deadline. Letting this happen now rather than later helps aid the process of research.
  2. Discussing a paper that was rejected. Reviewers err sometimes and an impromptu session provides a means to remedy that.
  3. Entertainment. We all like to have a bit of fun.

For design, the following seem important:

  1. Impromptu speakers should not have much time. At COLT, it was 8 minutes, but I have seen even 5 work well.
  2. The entire impromptu session should not last too long because the format is dense and promotes restlessness. A half hour or hour can work well.

Impromptu talks are a mechanism to let a little bit of chaos into the schedule. They will be chaotic in content, presentation, and usefulness. The fundamental advantage of this chaos is that it provides a means for covering material that the planned program did not (or could not). This seems like a “bargain use of time” considering the short duration. One caveat is that it is unclear how well this mechanism can scale to large conferences.

Workshops are not Conferences

… and you should use that fact.

A workshop differs from a conference in that it is about a focused group of people worrying about a focused topic. It also differs in that a workshop is typically a “one-time affair” rather than a series. (The Snowbird learning workshop counts as a conference in this respect.)

A common failure mode of both organizers and speakers at a workshop is to treat it as a conference. This is “ok”, but it is not really taking advantage of the situation. Here are some things I’ve learned:

  1. For speakers: A smaller audience means it can be more interactive. Interactive means a better chance to avoid losing your audience and a more interesting presentation (because you can adapt to your audience). Greater focus amongst the participants means you can get to the heart of the matter more easily, and discuss tradeoffs more carefully. Unlike conferences, relevance is more valued than newness.
  2. For organizers: Not everything needs to be in a conference style presentation format (i.e. regularly spaced talks of 20-30 minute duration). Significant (and variable) question time, different talk durations, flexible rescheduling, and panel discussions can all work well.

Question: “When is the right time to insert the loss function?”

Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time?

When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set.

The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun(applied) and Vladimir Vapnik(theoretical) advocate: “solve the simplest prediction problem that solves the problem”. (One difficulty with this principle is that ‘simplest’ is difficult to define in a satisfying way.)

One reason why it’s unclear is that optimizing an arbitrary loss is not an easy thing for a learning algorithm to cope with. Learning reductions (which I am a big fan of) give a mechanism for doing this, but they are new and relatively untried.

Drew Bagnell adds: Another approach to integrating loss functions into learning is to try to re-derive ideas about probability theory appropriate for other loss functions. For instance, Peter Grunwald and A.P. Dawid present a variant on maximum entropy learning. Unfortunately, it’s even less clear how often these approaches lead to efficient algorithms.

Maximum Margin Mismatch?

John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003.

There are some things I find odd about the paper. For instance, it says of probabilistic models

“cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.”

I’m aware of no such limitations. Also:

“Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.”

This is quite interesting and contradicts my own experience as well as that of a number of people I greatly
respect. I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with?

The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. Further, the authors carry out some nice experiments that speak well for the practicality of their approach. In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. And I think this gets us to the crux of the matter, and ties back to John’s post. CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable.

The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that.)