The Everything Ensemble Edge

Rich Caruana, Alexandru Niculescu, Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is:

  1. Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings.
  2. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression.
  3. For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method.

A series of conclusions can be drawn from the observations.

  1. (Calibrated) boosted decision trees appear to perform best, in general although support vector machines and neural networks give credible near-best performance.
  2. The metalearning algorithm which simply chooses the best (based upon a small validation set) performs much better.
  3. A metalearning algorithm which combines the predictors in an ensemble using stepwise refinement of validation set performance appears to perform even better.

There are a number of caveats to this work: it was only applied on large datasets there is no guarantee that the datasets are representative of your problem (although efforts were made to be representative in general), and the size of the training set was fixed rather than using the natural size given by the problem. Despite all these caveats, the story told above seems compelling: if you want maximum performance, you must try many methods and somehow combine them.

The most significant drawback of this method is computational complexity. Techniques for reducing the computational complexity are therefore of significant interest. It seems plausible that there exists some learning algorithm which typically performs well whenever any of the above algorithms can perform well at a computational cost which is significantly less than “run all algorithm on all settings and test”.

A fundamental unanswered question here is “why?” in several forms. Why have the best efforts of many machine learning algorithm designers failed to capture all the potential predictive strength into a single coherent learning algorithm? Why do ensembles give such a significant consistent edge in practice? A great many papers follow the scheme: invent a new way to create ensembles, test, observe that it improves prediction performance at the cost of more computation, and publish. There are several pieces of theory explain individual ensemble methods, but we seem to have no convincing theoretical statement explaining why they almost always work.

18 Replies to “The Everything Ensemble Edge”

  1. Would you count a PAC-Bayesian approach which could handle directly averaging or voting classifiers as an important step towards the goal of your last sentence? As far as I’m aware, at the moment error rates bounds for these classifiers are calculated by doubling the bound for the corresponding Gibbs sampled classifier. Do you know what progress has been made in a more direct approach? If there were such results, they ought to tie in neatly with the minimum relative entropy approach of Jebara et al.

  2. The link ‘using all methods to make a prediction’ needs fixing. The paper referred to is very interesting. Something perhaps surprising is how small a contribution to the ensemble of models for the two classification problems is made by SVMs (last column of Table 3), when Table 1 shows them performing so well when working alone.

  3. I think the PAC-Bayes approach illustrates the shortcomings very well. Using the PAC-Bayes theorem, we know that it is possible to tightly bound the error rate of many stochastic classifiers, but this does not directly apply to ensemble classifiers. Every attempt to make this apply, so far, has been unappealing.

    The doubling argument uses the pigeonhole principle to show that the error rate of an ensemble is at most twice it’s stochastic average. The reason why this is unappealing in practice is that many problem have a minimum error rate of (say) 0.25. When doubled, a bound of 0.5 is not particularly compelling.

    There is another argument which avoids this doubling of the base error with messier constants. The argument was first made in the original “margins are good” theory paper. Perhaps a simpler version of the argument is in the Pac-Bayes and Margins paper.

    Neither of these arguments really proves that you prefer the ensemble to the stochastic classifier as seems to be absolutely true in practice.

  4. Link fixed (thanks).

    The lack of representation of SVMs in the ensemble might easily be a competition effect. If there are several base systems which get at the same bias, you might expect that the best of them is under-represented because one of the similar systems happens to edge it out on any particular problem.

  5. One way to get a sense of this might be in terms of a finite number of voters. Imagine 101 voters each of whose true error rates is 10%. Then the voting error rate can vary between 0% and 10%, depending on how independent their errors are. It might look as though one could arrange things to get a higher than 10% rate, but if 51 of them made their errors on the same data, there would only be 50 voters left, and even if these people’s errors were made on the same data, it wouldn’t be enough for an overall error.

    The voting error can be higher than the Gibbs error. Imagine 3 voters with error rates 10%, 20%, 30%. The voting error rate could be as high as 30% (and as low as 0%), while the Gibbs rate is 20%. The voting rate seems to be bounded by the maximum individual rate.

    On the other hand, if there is sufficient independence between the voters errors, then one would find much lower voting error rates. So, I suppose that’s why ensembles work well, taking models from different classes of models. The key for an algorithm is to search for voters whose errors aren’t made simultaneously.

  6. First, to be able to generalize beyond training set the ml-algorithm must have bias. For example neural networks have a bias to learn different way (different concepts) than decision trees, SVM:s etc. Second, margin theory tells us how to control the combined complexity of many different hypotheses and prevent over-fitting.

    Intuitive idea goes like this: We have a set H of equally good hypotheses (assume their prediction accuracy is equal) and we can pick N of them to make combined hypothesis. How we do choose those N? We select N hypotheses that are as different as possible (concepts they recognize differ maximally). This can be achieved best if we use different ml-algorithms altogether (if they are roughly equally good).

  7. I fogot to emphasize: combining similar “opinions” is not very effective. The whole idea behind voting methods is that opinions differ.

  8. Everything you say makes sense, but it does not address the theory/practice gap I see.

    The difficulty with thinking “margin theory tells us to use ensembles” is that every margin bound I have studied in detail has the property that I can extract a better non-ensemble bound in any particular application. This holds for the original margin bound, an improvement on it, and the PAC-Bayes margin bound. In each case, you can construct a stochastic classifier which will have a smaller bound than the margin bound.

    Margin theory seems to tell us “ensembles are not as bad as you might think” rather than “margin theory tells us to use ensembles”.

  9. Tighter bounds for ensembles can only come for more information than the individual error rates of their members. If you have a library of classifiers which you know to be 95% accurate, if the world conspires against you, you could end up with 90% voting accuracy. On the other hand, if you knew that a number of classifiers in the library err independently of one another, you could form a very accurate voting classifier. So how does you integrate covariance of error information into a PAC result?

  10. That’s a good point. The first level of analysis here should probable be unprobabilistic. One analysis along these lines is lemma 4 of this paper. It appears that lemma 4 generalizes to something like the following:

    Let b = the expected proportion of bad pairs = Prx,y ~ D, i,j ~ {1,…,n} (hi <> y and hj <> y)

    Let t be any threshold on the number of binary errors required to induce an overall error. (For example, this is 1/2 the number of voters in a strict majority vote.)

    The the following theorem holds:
    For all D, sets of binary classifiers {h}, t, the error rate of the majority vote is less than b |{h}|(|{h}|-1)/[t(t-1)]

    The proof is a counting argument: every error must incur at least t choose 2 bad pairs, and b bounds the rate of bad pairs.

  11. I’ll start a new thread as comments won’t nest further, but referring to your Nov 28, is there any reason a PAC-Bayesian bound won’t go through? We have voting error less than or equal to 4 X double error rate for 2 hypotheses Gibbs sampled.

    With the usual notation, e.g. Seeger’s thesis, the latter is

    E_(w_i,w_j)~Q^2 (E_(x*,y*)[I_intersection{sign(u|w_i, x*) + b) not = y*}],

    where Q is the posterior distribution over W. With finitely many hypotheses one might exclude the diagonal.

    Of course, one might opt for triple errors, etc.

  12. Sorry, I’m not quite following you. What is the precise theorem you have in mind? (With the definitions of the variables?)

  13. Having bounded the voting classifier error by 4 x the probability that 2 random classifier both err, I was wondering if the latter could be bounded along the lines of the PAC-Bayesian results, e.g., theorem 3.1 of Seeger’s thesis,

    http://www.kyb.tuebingen.mpg.de/bs/people/seeger/papers/thesis.html,

    so that it is within such and such of the empirical error rate for a pair of classifiers. One would expect the gap to involve the Kullback-Leibler distance between the prior over W xW and the posterior.

  14. I understand: you want to bound b. And the answer is “yes”, this can be done after making an additional IID data assumption. Typically, we measure the error rate of classifiers, but we could in general measure any parameterized event. This means we could treat hypothesis pairs as hypotheses and then apply the standard bounds (VC, PAC-Bayes, etc…) on the hypothesis pair space. The PAC-Bayes bound is particularly compelling because it becomes very tight when the average is over a significant fraction of the original space.

    I’m pretty satisfied with this explanation now. It seems to mathematically validate the empirical observations here and here and it improves on the theoretical arguments of the last one by eliminating various unnecessary assumptions.

  15. And there’s no need to stop at pairs. The voting error would be bounded by 8 X prob. that 3 random hypotheses err simultaneously. And so on.

    I wonder how good these bounds would be for the GPs Seeger goes on to treat. The K-L distance doesn’t seem to grow inordinately. On the other hand, perhaps this approach would work better with classifiers coming from different algorithms, although VC or PAC-Bayesian treatment might be messier.

    Would it be worth writing this up?

    Another thought: in the maximum entropy discrimination approach to SVMS of Jebara et al, they are able to specify the prior of margins of individual training points. I was wondering if these could be used to encourage the errors of different classifiers not to occur simultaneously.

    Something else to think about is how to use the probabilistic outputs of Bayesian classifiers. Rather than a vote of all experts, you’d rather only those who were fairly confident voted. Like on ‘Who wants to be a millionaire’ when the contestant asks the audience , you would prefer it if you could tell them not to bother voting if they’re only guessing. Maybe this is more like a mixture of experts approach.

  16. Roughly speaking, this is a nice observation which can explain ensemble system performance mathematically. Before writing up we should ask the question: “Can we construct better ensembles?”

    From this viewpoint, exactly how we measure the error rates is irrelevant. Given that we have measured them well, how would we use the information?
    We have a set of n predictors. We know for all predictors, ei = the probability that predictor errs, and for all pairs of predictors, eij = the probability that the both elements in the pair err. It seems our theorem would advocate choosing (only) the pair with the lowest value of eij, which is rather trivial.

    If we backup and admit our imperfect knowledge of eij, optimizing the PAC-Bayes bound would give us a nontrivial ensemble. This may not be completely convincing because it says our lack of knowledge is forcing us to use a large ensemble. That might be part of the story, but it is not the full story.

    Generalizing to higher order statistics, we presumably get that the nth order version of this theorem suggests using the best single n-way ensemble when we have perfect knowledge and randomizing over multiple n-way ensembles according to the PAC-Bayes theorem when we have only partial knowledge.

    If we consider the joint set of all higher order versions with perfect knowledge, a tradeoff seems to arise because the theorem bound depends on 2n bn where bn is the nth-order joint error statistic. Presumably, bn initially decreases faster than 2n rises and then vice-versa with increasing n. Choosing the set of n which minimizes the bound would give us a satisfying answer. Admiting our lack knowledge, the optimal choice of n may be smaller, but we would randomize over many different sets of size n which may make it larger.

    There are some computational issues here, like “How do we compute the KL-divergence between two distributions over n-sets?” This can be addressed using element-wise independence. If we setup the prior and posterior over n objects so that to draw n objects, you draw each at random then the KL-divergence decomposes into a sum of n KL-divergences. If the prior and posterior do not vary with the draw, this becomes simply n*KL-divergence between a 1-element prior and 1-element posterior. This optimization should be tractable.

    There remains a question of trying this out. I’ll pester Rich Caruana to see if he’s interested. If you want to go ahead and write out the math in a more coherent form then a series of comments, that’s probably worthwhile.

Comments are closed.