Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are always greater than 1.
Background One branch of learning theory focuses on theorems which
- Assume samples are drawn IID from an unknown distribution D.
- Fix a set of classifiers
- Find a high probability bound on the maximum true error rate (with respect to D) as a function of the empirical error rate on the training set.
Many of these bounds become extremely complex and hairy.
Current Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.)
One of the dangers of striving for “functional tightness” is that the bound can depend on strangely interrelated parameters. In fact, apparently these strange interrelations can become so complex that they end up always larger than 1 (some bounds here and here).
It seems we should ask the question: “Why are we doing the math?” If it is just done to get a paper accepted under review, perhaps this is unsatisfying. The real value of math comes when it guides us in designing learning algorithms. Math from bounds greater than 1 is a dangerously weak motivation for learning algorithm design.
There is a significant danger in taking this “oops” too strongly.
- There exist some reasonable arguments (not made here) for aiming at functional tightness.
- The value of the research a person does is more related to the best they have done than the worst.
Looking at the referenced paper, this seems to be focussed specifically on the area of kernel learning, and specifically on a single paper by Lanckriet et al. I believe that the vast majority of sample complexity bounds derived are decreasing in the number of samples (while of course not being usably tight for practical data set sizes). More generally, I would suspect that data dependent bounds are much more prone to this sort of trouble.
I’m a bit confused by your use of the term “sample complexity bound”. I was under the impression that the sample complexity measures how many examples are required to reach a particular error rate with high probability. In the paper you are discussing the authors show that the error bound is always greater than one which is meaningless. I interpret a sample complexity bound being greater than one meaning you always need at least one example – entirely reasonable I would have thought.
I agree with everything else you say if I read your initial “sample complexity” as “error” (you only talk about “bounds” after the first sentence).
I do not see how the fact that some always-greater-than-one bounds
were accidentally published relates to the usefulness of “functionally
tight” bounds.
Studying the dependence of a bound on various parameters, and how a
bound approaches zero as the sample size increases, can often tell us
quite a bit about the problem.
Take for example our paper whose appendix you mention here. We study
learning a kernel for SVMs (or other kernel-based methods). The
motivation for this paper is understanding what the extra sample cost,
i.e. how much additional data is required, when the kernel is learned
from data rather then fixed. This is very commonly done, e.g. when a
few parameters of the kernel are ‘tuned’, ‘tweaked’ or ‘learned’ based
on the training data, or when a few possible base kernels are
combined. Previous bounds suggest that learning a few kernel
parameters would result in a multiplicative increase in the required
sample size—quite prohibitive. In the paper, we show that in-fact
only a roughly additive number of extra samples are needed to account
for every additional kernel parameter (in a very general sense). This
is a much more reasonable price to pay, and suggests that it is
theoretically “OK” to do this.
Of course, to get any kind of understanding, one must look at the
bound and study its behavior (numerical or functional). When the
bounds is always greater than one, studying the bound should reveal
this, and that no understanding can be gained from it, as was the case
here. This is true both for understanding functional dependencies
through a bound, and for using a bound numerically. However, as Rif
notes, most bounds do in fact go to zero as the sample sizes increases.
(I should also note that the bounds that are always greater than one
are specific cases noted in the papers cited. It is certainly not the
case that all the bounds in those papers are meaningless. See our
appendix for full details.)
Good spot. I was being sloppy in nomenclature. (I tweaked the post.)