A Variance only Deviation Bound

At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.

For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.

The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.

Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.

Another 10-year paper in Machine Learning

When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:

David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.

Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.

The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.

What other machine learning or learning theory papers from 10 years ago have had a substantial impact?

Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

  1. Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
  2. Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
  3. Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
  4. Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
  5. Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
  6. Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

  1. Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
  2. Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
  3. Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
  4. Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
  5. Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.

Interesting Papers at COLT 2007

Here are two papers that seem particularly interesting at this year’s COLT.

  1. Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.
  2. Adam Tauman Kalai. Learning Nested Halfspaces and Uphill Decision Trees. Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily challenging as judged by lack of progress over a long period of time. This paper is about regression PAC-learning, and the results appear much more encouraging than exist in classification PAC-learning. Under the assumption that:
    1. The level sets of the correct regressed value are halfspaces.
    2. The level sets obey a Lipschitz condition.

    this paper proves that a good regressor can be PAC-learned using a boosting algorithm. (The “uphill decision trees” part of the paper is about one special case where you don’t need the Lipschitz condition.)

The Missing Bound

Sham Kakade points out that we are missing a bound.

Suppose we have m samples x drawn IID from some distribution D. Through the magic of exponential moment method we know that:

  1. If the range of x is bounded by an interval of size I, a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m0.5) (at least in crude form). A proof is on page 9 here.
  2. If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample S be u(S) and the observed variance be v(S), there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

PrS ~ Dm ( Ex~D x <= f(u(S), v(S) ,d)) >= 1 – d

For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.