In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused.
- Confidence = Probability. For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5, but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain.
- Classical Confidence Intervals. These are common in learning theory. The essential idea is that world has some true-but-hidden value, such as the error rate of a classifier. Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. Classical confidence intervals (as applied in machine learning) typically require that observations are independent. They have some drawbacks discussed previously. One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information.
- Bayesian Confidence Intervals. These are common in several machine learning applications. If you have a prior distribution over the way the world creates observations, then you can use Bayes law to construct a posterior distribution over the way the world creates observations. With respect to this posterior distribution, you construct an interval containing the truth with high probability. The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability”. No assumption of independent samples is required. Unlike classical confidence intervals, it’s easy to have a statement conditioned on features. For example, “the probability of disease given the observations is in [0.8,1]”. My principal source of uneasiness with respect to Bayesian confidence intervals is the “If the world is drawn from the prior” clause—I believe it is difficult to know and specify a correct prior distribution. Many Bayesians aren’t bothered by this, but the meaning of a Bayesian confidence interval becomes unclear if you work with an incorrect (or subjective) prior.
- Asymptotic Intervals. This is also common in applied machine learning, which I strongly dislike. The basic line of reasoning seems to be: “Someone once told me that if observations are IID, then their average converges to a normal distribution, so let’s use an unbiased estimate of the mean and variance, assume convergence, and then construct a confidence interval for the mean of a gaussian”. Asymptotic intervals are asymptotically equivalent to classical confidence intervals, but they can differ spectacularly with finite sample sizes. The simplest example of this is when a classifier has zero error rate on a test set. A classical confidence interval for the error rate is [0,log(1/d)/n] where n is the size of the test set and d is the probability that the interval contains the truth. For asymptotic intervals you get [0,0] which is bogus in all applications I’ve encountered.
- Internal Confidence Intervals. This is not used much, except in agnostic active learning analysis. The essential idea, is that we cease to make intervals about the world, and instead make intervals around our predictions of the world. The real world might assign label 0 or label 1 given a particular context x, and we could only discover the world’s truth by actually observing x,y labeled examples. Yet, it turns out to sometimes be easy to infer “our learning algorithm will definitely predict label 1 given features x“. This allowed dependence on x means we can efficiently guide exploration. A basic question is: can this notion of internal confidence guide other forms of exploration?
- Gamesman intervals. Vovk and Shafer have been working on new foundations of probability, where everything is stated in terms of games. In this setting, a confidence interval is (roughly) a set of predictions output by an adaptive rule with the property that it contains the true observation a large fraction of the time. This approach has yet to catch on, but it is interesting because it provides a feature dependent confidence interval without making strong assumptions about the world.
‘The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability‒
I don’t know where this idea comes from, but it seems to be relatively common in machine learning literature. I’ve attended a NIPS workshop a while back where people made similar statements – to my disbelief.
I would say this is not a sensible definition. In general there is no such a thing as “nature’s prior” or “world’s prior”. In most settings it would be nonsense to say that the distribution of a correlation parameter “in the world” is “a priori” [-1, 1]. One can certainly believe that the distribution in the world of a correlation coefficient is a point-mass distribution (i.e., there is a true value), and still have an uniform prior in [-1, 1]. The priors can be widely different from the true distribution (point-mass or not), whatever that means.
The semantics of a Bayesian confidence interval is essentially that “if I previously believed in this prior and now have seen this evidence, then now my belief about the unknown is this posterior, if I am to be coherent”. It is also true this seems somewhat unfulfilling. Religious Bayesians would be satisfied with that, but for the rest of us there are practical ways of checking if your posterior beliefs are anchored in reality (hold out sets, Andrew Gelman’s posterior checks, etc.)
It also true that things can get awry for some families of priors in nonparametric settings – but this is very, very different than saying we should know the solution of the problem before seeing the data! (isn’t knowing “nature’s prior” the same of saying we don’t have anything to learn?) Maybe that’s what you had in mind to begin with? Then I’m just misinterpreting what you wrote. Maybe it’s just the aftertaste of that NIPS workshop that is affecting me.
[I think this initial confusion might come from other forms of conditioning (i.e., if I claim the probability of having disease X is P before seeing a test for the disease, and that the probability of a test being correct given X is such-and-such, and that now I have the outcome of a test, then know I should have a probability P’ for X given the test – which would be beliveable only if the prior was close to correct. But this is a totally different story.]
You do agree that this is a valid interpretation, even if not your preferred one, right?
The subjective interpretation above hinges deeply on what it means to “believe in a prior” which is at the essence of Bayesian philosophy. I’ve never quite understood it.
On a practical matter one ‘unfulfilling’ aspect of a subjective bayesian confidence interval is that your interval and mine can disagree profoundly given the same data due to differing priors. For example, I wouldn’t trust a drug company’s prior on whether or not their newest drug works to match my own.
Hi John,
I’m glad Ricardo said it because I didn’t want to bring it up again. But now that it’s out there:
I think it’s a valid frequentist interpretation of a subjectivist idea.
Indeed: that’s why minimax results in Bayesian analysis are common, or performing the statistics with a reference prior that one hopes convinces you.
To be honest, I don’t agree this definition is valid. It’s just that the premise ‘If the world is drawn from the prior…’ doesn’t make any sense to me, like in the example for priors for correlation coefficients.
Suppose the world draws from a value mu from the uniform distribution on the interval [0,1], but does not announce it. Then the world produces observations IID according to N(mu,1). After some number n of observations, we can apply Bayes law to construct a posterior on mu. From this posterior, we can cut out an interval with measure 0.9 and call it the “confidence set” and draw little error bars.
My impression is that all Bayesian confidence intervals are mathematically equivalent to the above sequence of actions (with different choices of priors and observations plugged in). I believe all disagreements lie in interpretation of the prior.
For your correlation coefficient example, I’m comfortable thinking of it as an unknown constant drawn from a known distribution and then producing observations. In this situation, applying Bayes law to process observations is the right thing to do (even though the constant might be formally fixed at the time the posterior is computed). Everything becomes murky of course when the distribution for the constant is unknown and disagreed on.
>On a practical matter one ‘unfulfilling’ aspect of a subjective bayesian confidence interval is that your interval and mine can disagree profoundly given the same data due to differing priors.
Some of us see this as one of the major advantages of the Bayesian approach! In the situation where different analysts use the same likelihood but obtain substantially different posteriors because they are using different priors one can say that the data are not conclusive. Rational people can genuinely disagree without being incoherent. Any approach that does anything different is, to use Adrian Smith’s phrase, “a travesty of the scientific process”. In this situation one should discuss one’s priors with those who disagree, and collect more data.
>For example, I wouldn’t trust a drug company’s prior on whether or not their newest drug works to match my own.
Sir David Cox, who is not a Bayesian, put it well when he said “Why should I be interested in your prior?” My response to this question is that there could be several reasons.
1) It could be that your prior is similar to my own. For example, the drug company might be using a prior that displays a degree of scepticism that is acceptable to. You don’t want to be taking a drug that doesn’t work, but neither does the drug company want to spend a great deal of money developing a drug that could later be shown to be ineffective.
2) It could be that while my prior is substantially different to yours, my posterior is similar to yours, because the data is sufficiently strong that the conclusion is robust to substantial change of prior. This is the ideal outcome and it would be good to know.
3) It could be that my posterior is substantially different to yours because my prior is different to yours and the data does not overwhelm the prior. Here we need to debate the priors and get more data if we can. In this situation we ae served particularly badly by a frequentist process that delivers a single answer that is supposedly ‘objective’. It’s not. It’s what you get with a non-informative prior. That’s worth looking at, but it’s not the definitive answer.
Thank you for this, it was incredibly helpful. +1 hp : )