Unfortunately, a scheduling failure meant I missed all of AIStat and most of the learning workshop, otherwise known as Snowbird, when it’s at Snowbird.
At snowbird, the talk on Sum-Product networks by Hoifung Poon stood out to me (Pedro Domingos is a coauthor.). The basic point was that by appropriately constructing networks based on sums and products, the normalization problem in probabilistic models is eliminated, yielding a highly tractable yet flexible representation+learning algorithm. As an algorithm, this is noticeably cleaner than deep belief networks with a claim to being an order of magnitude faster and working better on an image completion task.
Snowbird doesn’t have real papers—just the abstract above. I look forward to seeing the paper. (added: Rodrigo points out the deep learning workshop draft.)
At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.
For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.
The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.
Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.
I attended ALT (“Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength.
One paper that might interest people generally is:
Alexey Chernov and Vladimir Vovk, Prediction with Expert Evaluators’ Advice. The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.
When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:
David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.
Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.
The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.
What other machine learning or learning theory papers from 10 years ago have had a substantial impact?
Here’s a list of papers that I found interesting at ICML/COLT/UAI in 2009.
- Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation.
- Hal Daume, Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective.
- There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thomas Walsh, Istvan Szita, Carlos Diuk, Michael Littman Exploring compact reinforcement-learning representations with linear regression at UAI. The UAI paper covers application to RL as well.
- Ping Li, Improving Compressed Counting at UAI. This paper talks about how to keep track of the moments in a datastream with very little space and computation. I’m not sure I have a use for it yet, but it seems like a cool piece of basic technology.
- Mark Reid and Bob Williamson Surrogate Regret Bounds for Proper Losses at ICML. This paper points out that via the integral characterization of proper losses, proper scoring rules can be reduced to binary classification. The results unify and generalize the Probing and Quanting reductions we worked on previously. This paper is also related to Nicolas Lambert‘s work, which is quite thought provoking in terms of specifying what is learnable.
- Daniel Hsu, Sham M. Kakade and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov Models COLT. This paper shows that a subset of HMMs can be learned using an SVD-based algorithm.
- Samory Kpotufe, Escaping the curse of dimensionality with a tree-based regressor at COLT. This paper shows how to directly applying regression in high dimensional vector spaces and have it succeed anyways because the data is naturally low-dimensional.
- Shai Ben-David, David Pal and Shai Shalev-Shwartz. Agnostic Online Learning at COLT. This paper characterizes the ability to learn when an adversary is choosing features in the online setting as the “Littlestone dimension”.