The ideal of theoretical algorithm analysis is to construct an algorithm with accompanying optimality theorems proving that it is a useful algorithm. This ideal often fails, particularly for learning algorithms and theory. The general form of a theorem is: If preconditions Then postconditions
When we design learning algorithms it is very common to come up with precondition assumptions such as “the data is IID”, “the learning problem is drawn from a known distribution over learning problems”, or “there is a perfect classifier”. All of these example preconditions can be false for real-world problems in ways that are not easily detectable. This means that algorithms derived and justified by these very common forms of analysis may be prone to catastrophic failure in routine (mis)application.
We can hope for better. Several different kinds of learning algorithm analysis have been developed some of which have fewer preconditions. Simply demanding that these forms of analysis be used may be too strong—there is an unresolved criticism that these algorithm may be “too worst case”. Nevertheless, it is possible to have a learning algorithm that simultaneously provides strong postconditions given strong preconditions, reasonable postconditions given reasonable preconditions, and weak postconditions given weak preconditions. Some examples of this I’ve encountered include:
- Sham, Matthias and Dean showing that some Bayesian regression is robust in a minimax online learning analysis.
- The cover tree which creates an O(n) datastructure for nearest neighbor queries while simultaneously guaranteeing O(log(n)) query time when the metric obeys a dimensionality constraint.
The basic claim is that algorithms with a good fallback analysis are significantly more likely to achieve the theoretical algorithm analysis ideal. Both of the above algorithms have been tested in practice and found capable.
Several significant difficulties occur for anyone working on fallback analysis.
- It’s harder. This is probably the most valid reason—people have limited time to do things. Nevertheless, it is reasonable to hope that the core techniques used by many people have had this effort put into them.
- It is psychologically difficult to both assume and not assume a precondition, for a researcher. A critical valuable resource here is observing multiple forms of analysis.
- It is psychologically difficult for a reviewer to appreciate the value of both assuming and not assuming some precondition. This is a matter of education.
- It is neither “sexy” nore straightforward. In particular, theoretically inclined people 1) get great joy from showing that something new is possible and 1) routinely work on papers of the form “here is a better algorithm to do X given the same assumptions”. A fallback analysis requires a change in assumption invalidating (2) and the new thing that it shows for (1) is subtle: that two existing guarantees can hold for the same algorithm. My hope here is that this subtlety becomes better appreciated in time—making useful algorithms has a fundamental sexiness of it’s own.
One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.
The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.
In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.
In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.
“Science” has many meanings, but one common meaning is “the scientific method” which is a principled method for investigating the world using the following steps:
- Form a hypothesis about the world.
- Use the hypothesis to make predictions.
- Run experiments to confirm or disprove the predictions.
The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run.
Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not.
- Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $109 or more riding on the outcome the temptation to make the outcome “right” is extreme.
- Wrong experiments. When conducting experiments of some new phenomena, it is common for the experimental apparatus to simply not work right. In that setting, throwing out the “bad data” can make the results much cleaner… or it can simply be cheating. Millikan did this in the ‘oil drop’ experiment which measured the electron charge.
Done right, allowing some kinds of “cheating” may be helpful to the progress of science since we can more quickly find the truth about the world. Done wrong, it results in modern nightmares like painkillers that cause heart attacks. (Of course, the more common outcome is that the drugs effectiveness is just overstated.)
A basic question is “How do you do it right?” And a basic answer is “With prediction theory bounds”. Each prediction bound has a number of things in common:
- They assume that the data is independently and identically drawn. This is well suited to experimental situations where experimenters work very hard to make different experiments be independent. In fact, this is a better fit than typical machine learning applications where independence of the data is typically more questionable or simply false.
- They make no assumption about the distribution that the data is drawn from. This is important for experimental testing of predictions because the distribution that observations are expected to come from is a part of the theory under test.
These two properties above form an ‘equivalence class’ over different mathematical bounds where each bound can be trusted to an equivalent degree. Inside of this equivalent class there are several that may be helpful in determining whether deviations from the scientific method are reasonable or not.
- The most basic test set bound corresponds to the scientific method above.
- The Occam’s Razor bound allows a careful reordering of steps (1), (2) and step (3). More “interesting” bounds like the VC-bound and the PAC-Bayes bound allow more radical alterations of these steps. Several are discussed here.
- The Sample Compression bound allows careful disposal of some datapoints.
- Progressive Validation bounds (such as here, here or here) allow hypotheses to be safely reformulated in arbitrary ways as experiments progress.
Scientific experimenters looking for a little extra flexibility in the scientific method may find these approaches useful. (And if they don’t, maybe there is another bound in this equivalence class that needs to be worked out.)
I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.
I just presented the cross validation problem at COLT.
The problem now has a cash prize (up to $500) associated with it—see the presentation for details.
The write-up for colt.