Sham Kakade points out that we are missing a bound.

Suppose we have *m* samples *x* drawn IID from some distribution *D*. Through the magic of exponential moment method we know that:

- If the range of
*x*is bounded by an interval of size*I*, a Chernoff/Hoeffding style bound gives us a bound on the deviations like*O(I/m*(at least in crude form). A proof is on page 9 here.^{0.5}) - If the range of
*x*is bounded, and the variance (or a bound on the variance) is known, then Bennett’s bound can give tighter results (*). This can be a huge improvment when the true variance small.

What’s missing here is a bound that depends on the observed variance rather than a bound on the variance. This means that many people attempt to use Bennett’s bound (incorrectly) by plugging the observed variance in as the true variance, invalidating the bound application. Most of the time, they get away with it, but this is a dangerous move when doing machine learning. In machine learning, we are typically trying to find a predictor with 0 expected loss. An observed loss of 0 (i.e. 0 training error) implies an observed variance of 0. Plugging this into Bennett’s bound, you can construct a wildly overconfident bound on the expected loss.

One safe way to apply Bennett’s bound is to use McDiarmid’s inequality to bound the true variance given an observed variance, and then plug this bound on the true variance into Bennett’s bound (making sure to share the confidence parameter between both applications) on the mean. This is a clumsy and relatively inelegant method.

There should exist a better bound. If we let the observed mean of a sample *S* be *u(S)* and the observed variance be *v(S)*, there should exist a bound which requires only a bounded range (like Chernoff), yet which is almost as tight as the Bennett bound. It should have the form:

*Pr*

_{S ~ Dm}( E_{x~D}x <= f(u(S), v(S) ,d)) >= 1 – dFor machine learning, a bound of this form may help design learning algorithms which learn by directly optimizing bounds. However, there are many other applications both within and beyond machine learning.

(*) Incidentally, sometimes people try to apply the Bennett inequality when they only know the range of the random variable by computing the worst case variance within that range. This is never as good as a proper application of the Chernoff/Hoeffding bound.

This does not really answer the problem, but it is possible to do better(*) than McDiarmid’s inequality when you want to upper bound the true variance of a bounded random variable by using the empirical variance. Basically one can show a Bernstein-type inequality for the variance where the ‘variance of the variance’ is replaced by B.V (B is the bound on the variable, V the variance itself) which you can then easily invert for an empirical bound on V.

When you plug that into Bernstein’s inequality for the mean, you can obtain a bound with the same general form as Bernstein’s but where the true variance is replaced by the empirical one… and with noticeably worse multiplicative constants. Of course getting a tighter inequality in 1 step remains the challenge here.

(*) better in the same sense that Bennett of Bernstein are better than Hoeffding/Chernoff, ie. only when the sqrt(variance) is significantly smaller than the absolute bound on the variable.

What I do like about the Bennet bound is that it gets the right rate for both for the poisson limit and the normal limit. But I often find this difficult to see. So I wrote a little page that describes ways of seeing these limits without. See my notes.

It’s not clear to me from the wikipedia article how to use McDiarmid’s inequality to bound the true variance – could you briefly give or post a pointer to the resulting bound?

If we let u = the empirical mean of m variables, then an unbiased estimate of the variance is given by:

1/(m-1) * sum

_{i}(X_{i}– u)^{2}This function is stable with respect to each input variable, so mcdiarmid’s inequality applies.

A bound of this type was given in Theorem 1 of our with Jean-Yves Audibert and Remi Munos (an early version was presented at this year\’s NIPS workshop). The bound says that with probability

1-3 e^{-x}the difference of the true mean and the sample average can be bounded bysqrt(2 V_t x / t ) + 3 b x / t. HereV_tis the empirical estimate of the variance andbis the size of the interval that contains the samples. There are also some other variants. We were also surprised not to find this inequality in the literature! The proof uses martingale arguments (similar to the standard ones), plus the \”square root trick\”.What do the bounds c_i (using the wikipedia notation) work out to be? It seems like they would be unbounded if you can substitute an arbitrary value for any x_i – is there some restriction on substitute values? Can the x_i not be real-valued?

This is quite interesting.

The result seems to be almost the same as what you get by a careful combination of McdiarmidÃ¢â‚¬â„¢s inequality + the Bennett inequality. In fact, IÃ¢â‚¬â„¢m not sure which approach is better. Have you done a careful comparison? I didnÃ¢â‚¬â„¢t see one in the tech report.

We typically think of the x

_{i}as bounded [0,1]. When the loss isn’t actually bounded in that interval, we rescale the bound to whatever interval it is bounded in.No, we have not compared the two approaches. It would be good to know which is better!

To obtain a bound of order 1/t in the best cases (i.e. when the empirical variance is O(1/t)), it is important not to use McDiarmid’s inequality (which only gives bounds of order 1/sqrt{t}).

So McdiarmidÃ¢â‚¬â„¢s inequality + Bennett’s inequality is not as good as Bennett’s (or BernsteinÃ¢â‚¬â„¢s) inequality on both the sum of the variables and the sum of the squares.

Hello! This blog that it was difficult as for the English for the Japanese who watched blog of English study in various ways now that it came from Japan was interesting

Andreas Maurer has his own derivation of a bound here.

Andreas Maurer’s inequality still bounds the true variance with a probability parameterized by the *true variance*, which isn’t exactly what we want, is it?

The link to paper pointed to by “Comment by Csaba Szepesvari 2007-05-12 18:45:38” seems missing, here is the title of the paper based on some learning: Use of variance estimation in the multi-armed bandit problem