The Missing Bound

Sham Kakade points out that we are missing a bound.

Suppose we have m samples x drawn IID from some distribution D. Through the magic of exponential moment method we know that:

  1. If the range of x is bounded by an interval of size I, a Chernoff/Hoeffding style bound gives us a bound on the deviations like O(I/m0.5) (at least in crude form). A proof is on page 9 here.
  2. If the range of x is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample S be u(S) and the observed variance be v(S), there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

PrS ~ Dm ( Ex~D x <= f(u(S), v(S) ,d)) >= 1 – d

For machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.

Conditional Tournaments for Multiclass to Binary

This problem has been cracked (but not quite completely solved) by Alina, Pradeep, and I. The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here:


For the single elimination tournament, we can prove that:
For all multiclass problems D, for all learned binary classifiers c, the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log2 k. Restated:

regmulticlass(D,Filter_tree_test(c)) <= regbinary (Filter_tree_train(D),c)

Here:

  1. Filter_tree_train(D) is the induced binary classification problem
  2. Filter_tree_test(c) is the induced multiclass classifier.
  3. regmulticlass is the multiclass regret (= difference between error rate and minimum possible error rate)
  4. regbinary is the binary regret

This result has a slight dependence on k which we suspect is removable. The current conjecture is that this dependence can be removed by using higher order tournaments such as double elimination, triple elimination, up to log2 k-elimination.

The key insight which makes the result possible is conditionally defining the prediction problems at interior nodes. In essence, we use the learned classifiers from the first level of the tree to filter the distribution over examples reaching the second level of the tree. This process repeats, until the root node is reached. Further details, including a more precise description and some experimental results are in the draft paper.

COLT 2007

Registration for COLT 2007 is now open.

The conference will take place on 13-15 June, 2007, in San Diego, California, as part of the 2007 Federated Computing Research Conference (FCRC), which includes STOC, Complexity, and EC.

The website for COLT: http://www.learningtheory.org/colt2007/index.html

The early registration deadline is May 11, and the cutoff date for discounted hotel rates is May 9.

Before registering, take note that the fees are substantially lower for members of ACM and/or SIGACT than for nonmembers. If you’ve been contemplating joining either of these two societies (annual dues: $99 for ACM, $18 for SIGACT), now would be a good time!

Videolectures.net

Davor has been working to setup videolectures.net which is the new site for the many lectures mentioned here. (Tragically, they seem to only be available in windows media format.) I went through my own projects and added a few links to the videos. The day when every result is a set of {paper, slides, video} isn’t quite here yet, but it’s within sight. (For many papers, of course, code is a 4th component.)