In many machine learning papers experiments are done and little confidence bars are reported for the results. This often seems quite clear, until you actually try to figure out what it means. There are several different kinds of ‘confidence’ being used, and it’s easy to become confused.
- Confidence = Probability. For those who haven’t worried about confidence for a long time, confidence is simply the probability of some event. You are confident about events which have a large probability. This meaning of confidence is inadequate in many applications because we want to reason about how much more information we have, how much more is needed, and where to get it. As an example, a learning algorithm might predict that the probability of an event is 0.5, but it’s unclear if the probability is 0.5 because no examples have been provided or 0.5 because many examples have been provided and the event is simply fundamentally uncertain.
- Classical Confidence Intervals. These are common in learning theory. The essential idea is that world has some true-but-hidden value, such as the error rate of a classifier. Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. Classical confidence intervals (as applied in machine learning) typically require that observations are independent. They have some drawbacks discussed previously. One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information.
- Bayesian Confidence Intervals. These are common in several machine learning applications. If you have a prior distribution over the way the world creates observations, then you can use Bayes law to construct a posterior distribution over the way the world creates observations. With respect to this posterior distribution, you construct an interval containing the truth with high probability. The semantics of a Bayesian confidence interval is “If the world is drawn from the prior the interval contains the truth with high probability”. No assumption of independent samples is required. Unlike classical confidence intervals, it’s easy to have a statement conditioned on features. For example, “the probability of disease given the observations is in [0.8,1]”. My principal source of uneasiness with respect to Bayesian confidence intervals is the “If the world is drawn from the prior” clause—I believe it is difficult to know and specify a correct prior distribution. Many Bayesians aren’t bothered by this, but the meaning of a Bayesian confidence interval becomes unclear if you work with an incorrect (or subjective) prior.
- Asymptotic Intervals. This is also common in applied machine learning, which I strongly dislike. The basic line of reasoning seems to be: “Someone once told me that if observations are IID, then their average converges to a normal distribution, so let’s use an unbiased estimate of the mean and variance, assume convergence, and then construct a confidence interval for the mean of a gaussian”. Asymptotic intervals are asymptotically equivalent to classical confidence intervals, but they can differ spectacularly with finite sample sizes. The simplest example of this is when a classifier has zero error rate on a test set. A classical confidence interval for the error rate is [0,log(1/d)/n] where n is the size of the test set and d is the probability that the interval contains the truth. For asymptotic intervals you get [0,0] which is bogus in all applications I’ve encountered.
- Internal Confidence Intervals. This is not used much, except in agnostic active learning analysis. The essential idea, is that we cease to make intervals about the world, and instead make intervals around our predictions of the world. The real world might assign label 0 or label 1 given a particular context x, and we could only discover the world’s truth by actually observing x,y labeled examples. Yet, it turns out to sometimes be easy to infer “our learning algorithm will definitely predict label 1 given features x“. This allowed dependence on x means we can efficiently guide exploration. A basic question is: can this notion of internal confidence guide other forms of exploration?
- Gamesman intervals. Vovk and Shafer have been working on new foundations of probability, where everything is stated in terms of games. In this setting, a confidence interval is (roughly) a set of predictions output by an adaptive rule with the property that it contains the true observation a large fraction of the time. This approach has yet to catch on, but it is interesting because it provides a feature dependent confidence interval without making strong assumptions about the world.