The Everything Ensemble Edge

Rich Caruana, Alexandru Niculescu, Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is:

  1. Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings.
  2. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression.
  3. For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method.

A series of conclusions can be drawn from the observations.

  1. (Calibrated) boosted decision trees appear to perform best, in general although support vector machines and neural networks give credible near-best performance.
  2. The metalearning algorithm which simply chooses the best (based upon a small validation set) performs much better.
  3. A metalearning algorithm which combines the predictors in an ensemble using stepwise refinement of validation set performance appears to perform even better.

There are a number of caveats to this work: it was only applied on large datasets there is no guarantee that the datasets are representative of your problem (although efforts were made to be representative in general), and the size of the training set was fixed rather than using the natural size given by the problem. Despite all these caveats, the story told above seems compelling: if you want maximum performance, you must try many methods and somehow combine them.

The most significant drawback of this method is computational complexity. Techniques for reducing the computational complexity are therefore of significant interest. It seems plausible that there exists some learning algorithm which typically performs well whenever any of the above algorithms can perform well at a computational cost which is significantly less than “run all algorithm on all settings and test”.

A fundamental unanswered question here is “why?” in several forms. Why have the best efforts of many machine learning algorithm designers failed to capture all the potential predictive strength into a single coherent learning algorithm? Why do ensembles give such a significant consistent edge in practice? A great many papers follow the scheme: invent a new way to create ensembles, test, observe that it improves prediction performance at the cost of more computation, and publish. There are several pieces of theory explain individual ensemble methods, but we seem to have no convincing theoretical statement explaining why they almost always work.

Prediction Competitions

There are two prediction competitions currently in the air.

  1. The Performance Prediction Challenge by Isabelle Guyon. Good entries minimize a weighted 0/1 loss + the difference between a prediction of this loss and the observed truth on 5 datasets. Isabelle tells me all of the problems are “real world” and the test datasets are large enough (17K minimum) that the winner should be well determined by ability rather than luck. This is due March 1.
  2. The Predictive Uncertainty Challenge by Gavin Cawley. Good entries minimize log loss on real valued output variables for one synthetic and 3 “real” datasets related to atmospheric prediction. The use of log loss (which can be infinite and hence is never convergent) and smaller test sets of size 1K to 7K examples makes the winner of this contest more luck dependent. Nevertheless, the contest may be of some interest particularly to the branch of learning (typically Bayes learning) which prefers to optimize log loss.

May the best predictor win.

The design of a computing cluster

This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices.

  1. Architecture There are several architectural choices.
    1. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB.
    2. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type.
    3. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive.
    4. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them.
    5. In the end, we decided to go with option (2). Roughly speaking, the AMD system seemed like a better value than Intel. The opteron systems were desirable over the Athlon64 systems because halving the number of nodes aides system setup and maintenance while preserving about the same (or slightly more) cost/cpu. Another concern driving us towards the Opteron system was the ability to expand the RAM at a later date. In the last year or so, CPU speeds have not increased signficantly, instead dual core chips have come out. This trend seems likely to continue which implies the time to obselescence of a machine is driven by the maximum RAM capacity.

  2. Network Gigabit ethernet is cheap, easy, and even built into the motherboard.
  3. Operating System The options are
    1. Windows
    2. Linux

    We chose Linux (and in particular the Fedora Core 3 variant) because Linux means cheaper, less licensing hassles, and you get to work with a system that has been used in clusters for much longer. An additional issue is the version of Linux:

    1. 32bit linux: The advantage here is everything “just works”. The disadvantage is that using more than 4GB of RAM is awkward and you lose out on some minor architectural speedups of 64bit mode.
    2. 64bit linux
    3. We ended up choosing 32bit linux simply for stability and ease-of-setup reasons. The ease-of-setup was particularly compelling with respect to the openMosix choice (below). The exact variant of linux we used was a matter of some initial exploration determined by Don Coleman (TTIs master of machines). It is very plausible that we will want to switch to 64bit linux at some point in the future.

  4. Programming paradigm There are several paradigms for how to use a parallel machine. These are not exclusive.
    1. Use a standard language with a specialized parallel programming library such as mpich. This choice can result in a significant slowdown in programming time, but is necessary to eak every bit of performance out of the system.
    2. Use a batch control system to match jobs with nodes. There are several custom systems around for doing this, and it’s not hard to make up your own script. There is some learning curve here although it is necessarily smaller. With this approach, you can achieve near maximal performance as long as the individual processes do not need to communicate.
    3. Turn the cluster into a large virtual machine via openMosix and then simply launch several processes. This is the worst option performance-wise and the best option convenience-wise. To use it, you simply start processes and the system takes care of distributing them across the cluster. Processes must, however, be designed to minimize disk IO (as well as program IO) in order to achieve high efficiency computation.

    At TTI, we focused on the openMosix approach because this fits well with standard machine learning programs. In addition, starting at the ease-of-use end and then graduating to more difficult paradigms as necessary seems reasonable.

There are a couple things which did not quite work out. Ideally, each of the nodes would be rackmounted (for ease of maintenance) and, except for the “master node”, use ethernet boot on startup. The rackmounting was relatively easy, but the combination of ethernet boot, openmosix, and linux was frustrating. Instead Don ordered some very small hard drives for each node and simply installed linux on them. Another minor surprise is that the opteron motherboard required a video card in order to boot.

In the end, the total cost was about $1000 per CPU and it took (perhaps) a man-week to setup. There are many caveats here—prices change rapidly, software improves, and how you want to use the cluster is important to consider. Nevertheless, this design point is hopefully of some help in calibrating your own designs. (Feel free to add in any of your own experience below.)

Progress in Active Learning

Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning. This is an update on the progress I know of. As a refresher, active learning as meant here is:

  1. There is a source of unlabeled data.
  2. There is an oracle from which labels can be requested for unlabeled data produced by the source.
  3. The goal is to perform well with minimal use of the oracle.

Here is what I’ve learned:

  1. Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier).
  2. Nina, Alina, and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here.
  3. Nicolo, Claudio, and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here. This was published a year or two ago and I recently learned about it.

All of these results are relatively ‘rough’: they don’t necessarily make good algorithms as stated (although the last one has a few experiments). None of these results are directly comparable because the assumptions vary. Comparing the assumptions and the results leads to a number of remaining questions:

  1. Do the sufficient and seminecessary conditions apply to the IID only case? The adversarial case?
  2. Is there a generic algorithm for any hypothesis space that works in the fully adversarial setting?
  3. What are special cases of these algorithms which are computationally tractable and useful?

The Foundations of Active Learning workshop at NIPS should be a good place to discuss these questions.

Fallback Analysis is a Secret to Useful Algorithms

The ideal of theoretical algorithm analysis is to construct an algorithm with accompanying optimality theorems proving that it is a useful algorithm. This ideal often fails, particularly for learning algorithms and theory. The general form of a theorem is:

If preconditions Then postconditions
When we design learning algorithms it is very common to come up with precondition assumptions such as “the data is IID”, “the learning problem is drawn from a known distribution over learning problems”, or “there is a perfect classifier”. All of these example preconditions can be false for real-world problems in ways that are not easily detectable. This means that algorithms derived and justified by these very common forms of analysis may be prone to catastrophic failure in routine (mis)application.

We can hope for better. Several different kinds of learning algorithm analysis have been developed some of which have fewer preconditions. Simply demanding that these forms of analysis be used may be too strong—there is an unresolved criticism that these algorithm may be “too worst case”. Nevertheless, it is possible to have a learning algorithm that simultaneously provides strong postconditions given strong preconditions, reasonable postconditions given reasonable preconditions, and weak postconditions given weak preconditions. Some examples of this I’ve encountered include:

  1. Sham, Matthias and Dean showing that some Bayesian regression is robust in a minimax online learning analysis.
  2. The cover tree which creates an O(n) datastructure for nearest neighbor queries while simultaneously guaranteeing O(log(n)) query time when the metric obeys a dimensionality constraint.

The basic claim is that algorithms with a good fallback analysis are significantly more likely to achieve the theoretical algorithm analysis ideal. Both of the above algorithms have been tested in practice and found capable.

Several significant difficulties occur for anyone working on fallback analysis.

  1. It’s harder. This is probably the most valid reason—people have limited time to do things. Nevertheless, it is reasonable to hope that the core techniques used by many people have had this effort put into them.
  2. It is psychologically difficult to both assume and not assume a precondition, for a researcher. A critical valuable resource here is observing multiple forms of analysis.
  3. It is psychologically difficult for a reviewer to appreciate the value of both assuming and not assuming some precondition. This is a matter of education.
  4. It is neither “sexy” nore straightforward. In particular, theoretically inclined people 1) get great joy from showing that something new is possible and 1) routinely work on papers of the form “here is a better algorithm to do X given the same assumptions”. A fallback analysis requires a change in assumption invalidating (2) and the new thing that it shows for (1) is subtle: that two existing guarantees can hold for the same algorithm. My hope here is that this subtlety becomes better appreciated in time—making useful algorithms has a fundamental sexiness of it’s own.