Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are *always* greater than 1.

**Background** One branch of learning theory focuses on theorems which

- Assume samples are drawn IID from an unknown distribution
*D*. - Fix a set of classifiers
- Find a high probability bound on the maximum true error rate (with respect to
*D*) as a function of the empirical error rate on the training set.

Many of these bounds become extremely complex and hairy.

**Current** Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.)

One of the dangers of striving for “functional tightness” is that the bound can depend on strangely interrelated parameters. In fact, apparently these strange interrelations can become so complex that they end up *always* larger than 1 (some bounds here and here).

It seems we should ask the question: “Why are we doing the math?” If it is just done to get a paper accepted under review, perhaps this is unsatisfying. The real value of math comes when it guides us in designing learning algorithms. Math from bounds greater than 1 is a dangerously weak motivation for learning algorithm design.

There is a significant danger in taking this “oops” too strongly.

- There exist some reasonable arguments (not made here) for aiming at functional tightness.
- The value of the research a person does is more related to the best they have done than the worst.

Looking at the referenced paper, this seems to be focussed specifically on the area of kernel learning, and specifically on a single paper by Lanckriet et al. I believe that the vast majority of sample complexity bounds derived are decreasing in the number of samples (while of course not being usably tight for practical data set sizes). More generally, I would suspect that data dependent bounds are much more prone to this sort of trouble.

I’m a bit confused by your use of the term “sample complexity bound”. I was under the impression that the sample complexity measures how many examples are required to reach a particular error rate with high probability. In the paper you are discussing the authors show that the

error boundis always greater than one which is meaningless. I interpret asample complexitybound being greater than one meaning you always need at least one example – entirely reasonable I would have thought.I agree with everything else you say if I read your initial “sample complexity” as “error” (you only talk about “bounds” after the first sentence).

I do not see how the fact that some always-greater-than-one bounds

were accidentally published relates to the usefulness of “functionally

tight” bounds.

Studying the dependence of a bound on various parameters, and how a

bound approaches zero as the sample size increases, can often tell us

quite a bit about the problem.

Take for example our paper whose appendix you mention here. We study

learning a kernel for SVMs (or other kernel-based methods). The

motivation for this paper is understanding what the extra sample cost,

i.e. how much additional data is required, when the kernel is learned

from data rather then fixed. This is very commonly done, e.g. when a

few parameters of the kernel are ‘tuned’, ‘tweaked’ or ‘learned’ based

on the training data, or when a few possible base kernels are

combined. Previous bounds suggest that learning a few kernel

parameters would result in a multiplicative increase in the required

sample size—quite prohibitive. In the paper, we show that in-fact

only a roughly additive number of extra samples are needed to account

for every additional kernel parameter (in a very general sense). This

is a much more reasonable price to pay, and suggests that it is

theoretically “OK” to do this.

Of course, to get any kind of understanding, one must look at the

bound and study its behavior (numerical or functional). When the

bounds is always greater than one, studying the bound should reveal

this, and that no understanding can be gained from it, as was the case

here. This is true both for understanding functional dependencies

through a bound, and for using a bound numerically. However, as Rif

notes, most bounds do in fact go to zero as the sample sizes increases.

(I should also note that the bounds that are always greater than one

are specific cases noted in the papers cited. It is certainly not the

case that all the bounds in those papers are meaningless. See our

appendix for full details.)

Good spot. I was being sloppy in nomenclature. (I tweaked the post.)