Machine Learning (Theory)


Incompatibilities between classical confidence intervals and learning.

Classical confidence intervals satisfy a theorem of the form: For some data sources D,

PrS ~ D(f(D) > g(S)) > 1-d

where f is some function of the distribution (such as the mean) and g is some function of the observed sample S. The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D“. There are even some confidence intervals which do not require IID samples.

Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter.

There are a number of results in theoretical machine learning which use confidence intervals. For example,

  1. The E3 algorithm uses confidence intervals to learn a near optimal policy for any MDP with high probability.
  2. Set Covering Machines minimize a confidence interval upper bound on the true error rate of a learned classifier.
  3. The A2 uses confidence intervals to safely deal with arbitrary noise while taking advantage of active learning.

Suppose that we want to generalize thse algorithms in a reductive style. The goal would be to train a regressor to predict the output of g(S) for new situations. For example, a good regression prediction of g(S) might allow E3 to be applied to much larger state spaces. Unfortunately, this approach seems to fail badly due to a mismatch between the semantics of learning and the semantics of a classical confidence interval.

  1. It’s difficult to imagine a constructive sampling mechanism. In a large state space, we may never encounter the same state twice, so we can not form meaningful examples of the form “for this state-action, the correct confidence interval is y“.
  2. When we think of succesful learning, we typically think of it in an l1 sense—the expected error rate over the data generating distribution is small. Confidence intervals have a much stronger meaning as we would like to apply them: with high probability, in all applications, the confidence interval holds. This mismatch appears unaddressable.

It is tempting to start plugging in other notions such as Bayesian confidence intervals or quantile regression systems. Making these approaches work at a theoretical level on even simple systems is an open problem, but there is plenty of motivation to do so.

4 Comments to “Incompatibilities between classical confidence intervals and learning.”
  1. Anonymous says:

    Could you elaborate on how and why Bayesian Confidence Intervals do not “work at a theoretical level”?
    Do you mean for some specific choices of f(D)/g(S) ?

  2. jl says:

    Note I did not say “do not work”, but rather “are an open problem”.

    What I mean is that you can’t simply substitute Bayesian confidence intervals for classical confidence intervals in the existing results on learning MDPs. The different meaning of a Bayesian confidence interval implies you can not prove the same thing.

  3. Pier says:

    I’ve problems understanding what you’re trying to say here. I see (but I might be wrong) a bit of confusion between data-models like (quantile|Bayesian) regression, and *functions* of these models like confidence (“frequentist” term) or credible (Bayesian counterpart) intervals for values *predicted* by these models. BTW, you should know that *every* Bayesian procedure satisfies some kind of (not-necessarily-optimal) “frequentist” property and viceversa. Problem is: which property is most important for you and (then) how to prove it.


  4. […] Classical Confidence Intervals. These are common in learning theory. The essential idea is that world has some true-but-hidden value, such as the error rate of a classifier. Given observations from the world (such as err-or-not on examples), an interval is constructed around the hidden value. The semantics of the classical confidence interval is: the (random) interval contains the (determistic but unknown) value, with high probability. Classical confidence intervals (as applied in machine learning) typically require that observations are independent. They have some drawbacks discussed previously. One drawback of concern is that classical confidence intervals breakdown rapidly when conditioning on information. […]

Sorry, the comment form is closed at this time.

Powered by WordPress