Here are a few of the papers I enjoyed at ICML.

- Steffen Bickel, Michael BrÃƒÂ¼eckner, Tobias Scheffer, Discriminative Learning for Differing Training and Test Distributions There is a nice trick in this paper: they predict the probability that an unlabeled sample is in the training set vs. the test set, and then use this prediction to importance weight labeled samples in the training set. This paper uses a specific parametric model, but the approach is easily generalized.
- Steve Hanneke A Bound on the Label Complexity of Agnostic Active Learning This paper bounds the number of labels required by the A
^{2}algorithm for active learning in the agnostic case. Last year we figured out agnostic active learning was possible. This year, it’s quantified. Hopefull soon, it will be practical. - Sylvian Gelly, David Silver Combining Online and Offline Knowledge in UCT. This paper is about techniques for improving MoGo with various sorts of learning. MoGo has a fair claim at being the world’s best Go algorithm.

There were also a large number of online learning papers this year, especially if you count papers which use online learning techniques for optimization on batch datasets (as I do). This is expected, because larger datasets are becoming more common, and online learning makes more sense the larger the dataset. Many of these papers are of interest if your goal is learning fast while others are about extending online learning into new domains.

(Feel free to add any other papers of interest in the comments.)

I also liked Steffen Bickel’s paper. One thing I found surprising was that accuracy was significantly better if the p(train/test|x) distribution and p(y|x) distribution were trained simultaneously to maximize joint likelihood, even though that increased potential for overfitting and made the training procedure non-convex

I talked to Tobias about this. My understanding is that this is necessary because a learned predictor of the probability of train vs. test can make a mistake, and these mistakes if uncorrected can result in weights that go to infinity. Training things together avoids this problem when you are in a probabilistic setting.

I don’t know how to fix this instability in a black-box manner, and I’d be interested to learn how.

Not sure I understand…does that have to do with assigning zero p(x is in training set) to data-points that are in the training set?

Yes

One of my favourites was the Non-isometric Manifold Learning paper. I also thought Bickel’s paper on learning under covariate shift paper was interesting.

If anyone’s interested, I’ve done a quick analysis of machine learning trends over at my blog based on ICML papers titles over the last 20 years. The data and scripts are freely available there if anyone else wants to rummage around.