A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p, the proportion of the time that 1 is observed is p.
Since there are infinitely many p, this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p.
A great deal of effort has been devoted to strategies for achieving calibrated (such as here) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions.
Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way:
- We want good probabilistic predictions.
- Good probabilistic predictions are calibrated.
- Therefore, we want calibrated predictions.
The “Therefore” step misses the fact that calibration is a necessary but not a sufficient characterization of good probabilities. For example on the sequence “010101010…”, always predicting p=0.5 is calibrated.
This leads to the question: What is a sufficient characterization of good probabilities? There are several candidates:
- From Vohra: Calibrated on all simple subsequences.
- Small squared error: sumx (x-px)2.
- Small log probability: sumx log (1/px)
I don’t yet understand which of these candidates is preferrable.
There is a sense in which none of them can be preferred. In any complete prediction system, the probabilities are used in some manner, and there is some loss (or utility) associated with it’s use. The “real” goal is minimizing that loss. Depending on the sanity of the method using the probabilities, this may even imply that lieing about the probabilities is preferred. Nevertheless, we can hope for a sane use of probabilities and a sufficient mechanism for predicting good probabilities might eventually result in good performance for any sane use.