The Everything Ensemble Edge

Rich Caruana, Alexandru Niculescu, Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is:

  1. Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings.
  2. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression.
  3. For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method.

A series of conclusions can be drawn from the observations.

  1. (Calibrated) boosted decision trees appear to perform best, in general although support vector machines and neural networks give credible near-best performance.
  2. The metalearning algorithm which simply chooses the best (based upon a small validation set) performs much better.
  3. A metalearning algorithm which combines the predictors in an ensemble using stepwise refinement of validation set performance appears to perform even better.

There are a number of caveats to this work: it was only applied on large datasets there is no guarantee that the datasets are representative of your problem (although efforts were made to be representative in general), and the size of the training set was fixed rather than using the natural size given by the problem. Despite all these caveats, the story told above seems compelling: if you want maximum performance, you must try many methods and somehow combine them.

The most significant drawback of this method is computational complexity. Techniques for reducing the computational complexity are therefore of significant interest. It seems plausible that there exists some learning algorithm which typically performs well whenever any of the above algorithms can perform well at a computational cost which is significantly less than “run all algorithm on all settings and test”.

A fundamental unanswered question here is “why?” in several forms. Why have the best efforts of many machine learning algorithm designers failed to capture all the potential predictive strength into a single coherent learning algorithm? Why do ensembles give such a significant consistent edge in practice? A great many papers follow the scheme: invent a new way to create ensembles, test, observe that it improves prediction performance at the cost of more computation, and publish. There are several pieces of theory explain individual ensemble methods, but we seem to have no convincing theoretical statement explaining why they almost always work.

Fallback Analysis is a Secret to Useful Algorithms

The ideal of theoretical algorithm analysis is to construct an algorithm with accompanying optimality theorems proving that it is a useful algorithm. This ideal often fails, particularly for learning algorithms and theory. The general form of a theorem is:

If preconditions Then postconditions
When we design learning algorithms it is very common to come up with precondition assumptions such as “the data is IID”, “the learning problem is drawn from a known distribution over learning problems”, or “there is a perfect classifier”. All of these example preconditions can be false for real-world problems in ways that are not easily detectable. This means that algorithms derived and justified by these very common forms of analysis may be prone to catastrophic failure in routine (mis)application.

We can hope for better. Several different kinds of learning algorithm analysis have been developed some of which have fewer preconditions. Simply demanding that these forms of analysis be used may be too strong—there is an unresolved criticism that these algorithm may be “too worst case”. Nevertheless, it is possible to have a learning algorithm that simultaneously provides strong postconditions given strong preconditions, reasonable postconditions given reasonable preconditions, and weak postconditions given weak preconditions. Some examples of this I’ve encountered include:

  1. Sham, Matthias and Dean showing that some Bayesian regression is robust in a minimax online learning analysis.
  2. The cover tree which creates an O(n) datastructure for nearest neighbor queries while simultaneously guaranteeing O(log(n)) query time when the metric obeys a dimensionality constraint.

The basic claim is that algorithms with a good fallback analysis are significantly more likely to achieve the theoretical algorithm analysis ideal. Both of the above algorithms have been tested in practice and found capable.

Several significant difficulties occur for anyone working on fallback analysis.

  1. It’s harder. This is probably the most valid reason—people have limited time to do things. Nevertheless, it is reasonable to hope that the core techniques used by many people have had this effort put into them.
  2. It is psychologically difficult to both assume and not assume a precondition, for a researcher. A critical valuable resource here is observing multiple forms of analysis.
  3. It is psychologically difficult for a reviewer to appreciate the value of both assuming and not assuming some precondition. This is a matter of education.
  4. It is neither “sexy” nore straightforward. In particular, theoretically inclined people 1) get great joy from showing that something new is possible and 1) routinely work on papers of the form “here is a better algorithm to do X given the same assumptions”. A fallback analysis requires a change in assumption invalidating (2) and the new thing that it shows for (1) is subtle: that two existing guarantees can hold for the same algorithm. My hope here is that this subtlety becomes better appreciated in time—making useful algorithms has a fundamental sexiness of it’s own.

Complexity: It’s all in your head

One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.

The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.

In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.

In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
“non-defective distributions”.

So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.

Question: “When is the right time to insert the loss function?”

Hal asks a very good question: “When is the right time to insert the loss function?” In particular, should it be used at testing time or at training time?

When the world imposes a loss on us, the standard Bayesian recipe is to predict the (conditional) probability of each possibility and then choose the possibility which minimizes the expected loss. In contrast, as the confusion over “loss = money lost” or “loss = the thing you optimize” might indicate, many people ignore the Bayesian approach and simply optimize their loss (or a close proxy for their loss) over the representation on the training set.

The best answer I can give is “it’s unclear, but I prefer optimizing the loss at training time”. My experience is that optimizing the loss in the most direct manner possible typically yields best performance. This question is related to a basic principle which both Yann LeCun(applied) and Vladimir Vapnik(theoretical) advocate: “solve the simplest prediction problem that solves the problem”. (One difficulty with this principle is that ‘simplest’ is difficult to define in a satisfying way.)

One reason why it’s unclear is that optimizing an arbitrary loss is not an easy thing for a learning algorithm to cope with. Learning reductions (which I am a big fan of) give a mechanism for doing this, but they are new and relatively untried.

Drew Bagnell adds: Another approach to integrating loss functions into learning is to try to re-derive ideas about probability theory appropriate for other loss functions. For instance, Peter Grunwald and A.P. Dawid present a variant on maximum entropy learning. Unfortunately, it’s even less clear how often these approaches lead to efficient algorithms.

Advantages and Disadvantages of Bayesian Learning

I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of:

  1. Specify a prior over world models.
  2. Integrate using Bayes law with respect to all observed information to compute a posterior over world models.
  3. Predict according to the posterior.

Bayesian learning has many advantages over other learning programs:

  1. Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe of “think harder” (about specifying a prior over world models) and “compute harder” (to calculate a posterior) will eventually succeed. Many other machine learning approaches don’t have this guarantee.
  2. Language Bayesian and near-Bayesian methods have an associated language for specifying priors and posteriors. This is significantly helpful when working on the “think harder” part of a solution.
  3. Intuitions Bayesian learning involves specifying a prior and integration, two activities which seem to be universally useful. (see intuitions).

With all of these advantages, Bayesian learning is a strong program. However, there are also some very significant disadvantages.

  1. Information theoretically infeasible It turns out that specifying a prior is extremely difficult. Roughly speaking, we must specify a real number for every setting of the world model parameters. Many people well-versed in Bayesian learning don’t notice this difficulty for two reasons:
    1. They know languages allowing more compact specification of priors. Acquiring this knowledge takes some signficant effort.
    2. They lie. They don’t specify their actual prior, but rather one which is convenient. (This shouldn’t be taken too badly, because it often works.)
  2. Computationally infeasible Let’s suppose I could accurately specify a prior over every air molecule in a room. Even then, computing a posterior may be extremely difficult. This difficulty implies that computational approximation is required.
  3. Unautomatic The “think harder” part of the Bayesian research program is (in some sense) a “Bayesian employment” act. It guarantees that as long as new learning problems exist, there will be a need for Bayesian engineers to solve them. (Zoubin likes to counter that a superprior over all priors can be employed for automation, but this seems to add to the other disadvantages.)

Overall, if a learning problem must be solved a Bayesian should probably be working on it and has a good chance of solving it.
I wish I knew whether or not the drawbacks can be convincingly addressed. My impression so far is “not always”.