Here are a few papers from COLT 2008 that I found interesting.

- Maria-Florina Balcan, Steve Hanneke, and Jenn Wortman, The True Sample Complexity of Active Learning. This paper shows that in an asymptotic setting, active learning is
*always*better than supervised learning (although the gap may be small). This is evidence that the only thing in the way of universal active learning is us knowing how to do it properly. - Nir Ailon and Mehryar Mohri, An Efficient Reduction of Ranking to Classification. This paper shows how to robustly rank
*n*objects with*n log(n)*classifications using a quicksort based algorithm. The result is applicable to many ranking loss functions and has implications for others. - Michael Kearns and Jennifer Wortman. Learning from Collective Behavior. This is about learning in a new model, where the goal is to predict how a collection of interacting agents behave. One claim is that learning in this setting can be reduced to IID learning.

Due to the relation with Metric-E^{3}, I was particularly interested in a couple other papers on reinforcement learning in navigation-like spaces.

I also particularly enjoyed Dan Klein‘s talk, which was the most impressive application of graphical model technology I’ve seen.

I also attended the large scale learning challenge workshop and enjoyed Antoine Bordes talk about a fast primal space algorithm that won by a hair over other methods in the wild track. Ronan Collobert‘s talk was also notable in that they are doing relatively featuritis-free NLP.

Regarding the active learning paper, while it is a very interesting theoretical take on things, is it clear that it implies algorithmic statements too? Because it essentially argues that while the algorithm does reduce the error with a lower sample complexity, it cannot be sure of this fact in all cases with a provably low sample complexity. It is not clear in my mind if this gets in the way of designing algorithms with provable guarantees.

Also on a somewhat orthogonal note, the epsilon parameter assumed in typical scenarios seemed more easy to choose than a bound on N, the number of labels allowed. I briefly talked to Steve about it, and he also seemed to agree that there seems to be no good way to settle this point.

It is true that this notion of sample complexity does not actually inform the user (or algorithm) of how many labels to allow the algorithm to request (since it is sometimes unverifiable). However, I feel there should still be practical implications of this work for algorithm design. In particular, I would think we’d want algorithms that have strictly better asymptotic convergence rates than passive learning, even when they cannot directly detect the magnitude of improvements.

In some sense, this is the same issue faced in the analysis of universally consistent learning rules in passive learning (though it’s somewhat worse there). There is typically no way to verify how close to the Bayes error rate a classifier is (verifiable complexity is infinite), yet we still want learning rules whose error rates provably converge to the Bayes error in the limit (unverifiable complexity is a finite function of epsilon and the distribution of (X,Y)), and we often find such methods quite effective in practice. So this is one instance where unverifiable sample complexity seems to be a useful guide in algorithm design.

In active learning we are more fortunate, since the verifiable complexity is finite — and we certainly want algorithms with small verifiable sample complexity; however, the unverifiable complexity still seems relevant, particularly when the verifiable complexity is large. For instance, in the extreme case, there are concept spaces and distributions where the verifiable sample complexity of active learning is Omega(1/epsilon) for all targets, but where it’s possible to get O(1) unverifiable sample complexity for all targets. Not all algorithms that achieve near-optimal verifiable complexity also achieve this O(1) unverifiable complexity, but some do, and it seems reasonable (to me) to prefer the ones that do. In general, we should be able to design algorithms for any given active learning problem that achieve both a verifiable sample complexity that is near optimal and an unverifiable sample complexity that is asymptotically better than passive learning.

Featuritis, maybe. But the systems with the best features tend to win the bakeoffs. The gene mention tagging problem from BioCreative 2 was an interesting exercise in CRFs, in that half the entries were CRFs, with highly varying results depending on which features they used (the final report’s still in press, despite the bakeoff being in 2006 — the biologists are cagey about releasing preprints with results). I blogged about this under the title Why do you hate CRFs?. I’m predicting the same raging featuritis combined with heuristic rules for the i2b2 Obesity challenge (as with Biocreative, the bakeoff’s over, but who knows when the comparative results will be out).

There’s also structur-itis, as in parsing. You have tree-structured CFGs and dependency parsers. This kind of thing tends to get published in NLP venues, where I believe the state of the art can be summarized as “generate hypotheses using a simple model (like PCFG) and rescore/re-rank with a discriminative model (like perceptrons)”. Lots of opportunities for feature tuning, especially if we can get beyond testing on section 23 of the English Penn treebank.

In case you want more features, check out the latest issue of

Computational Linguistics(not open access yet, but it will be as of May 2008).