A question of quantification – Machine Learning (Theory)

This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same.

For all sequences of examples. This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.”
For all training sets. This is the standard quantification for boosting analysis such as adaboost or multiclass boosting.
Standard theorems have the form “for all training sets the error rate inequalities … hold”.
For all distributions over examples. This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”.

It is not quite true that each of these is equivalent. For example, in the online learning setting, quantifying “for all sequences of examples” implies “for all distributions over examples”, but not vice-versa.

However, in the context of either boosting or reductions these are equivalent because the algorithms operate in an element-wise fashion. To see the equivalence, note that:

“For any training set” is equivalent to “For any sequence of examples” because a training set is a sequence and vice versa.
“For any sequence of examples” is equivalent to “For any distribution over examples” when the theorems are about unconditional example transformations because:
1. The uniform distribution over a sufficiently long sequence of examples can approximate any distribution we care about arbitrarily well.
2. If the theorem holds “for all distributions”, it holds for the uniform distribution over the elements in any sequence of examples.

The natural debate here is “how should the theorems be quantified?” It is difficult to answer this debate based upon mathematical grounds because we just showed an equivalence. It is nevertheless important because it strongly influences how we think about algorithms and how easy it is to integrate the knowledge across different theories. Here are the arguments I know.

For all sequences of examples.
1. Learning theory people (at least) are used to thinking about “For all sequences of examples”.
2. (Applied) Machine learning people are not so familiar with this form of quantification.
3. When the algorithm is example-conditional such as in online learning, the quantification is more general than “for all distributions”.
For all training sets.
1. This is very simple.
2. It is misleadingly simple. For example, a version of the adaboost theorem also applies to test sets using the test error rates of the base classifiers. It is fairly common for this to be misunderstood.
For all distributions over examples.
1. Distributions over examples is simply how most people think about learning problems.
2. “For all distributions over examples” is easily and often confused with “For all distributions over examples accessed by IID draws”. It seems most common to encounter this confusion amongst learning theory folks.

What quantification should be used and why?
(My thanks to Yishay Mansour for clarifying the debate.)

One Reply to “A question of quantification”

Yishay Mansour says:

11/30/2005 at 1:22 am

I would like to this interesting discussion on sequences by recalling Yao’s Lemma regarding average case complexity versus randomized algorithms.

The time complexity of a randomized algorithm R is its expected running time for the worse case input. The randomized complexity of a problem is the time complexity of the best randomized algorithm R.

The distributional complexity of a deterministic algorithm A with respect to a distribution D is the expected running time of A when the input is selected according to D. The distributional complexity of a distribution D is the best case over all deterministic algorithms A, and the distributional complexity of a problem is the distributional complexity of the worse distribution D.

Yao’s Lemma states that for any problem the distributional complexity equals the randomized complexity. [The proof is an application of the minimax theorem of zero-sum games.]

Yao’s Lemma gives a very nice connection between a distribution over inputs and a worse case input, but allowing randomization inside an algorithm.

Comments are closed.