A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p, the proportion of the time that 1 is observed is p.
Since there are infinitely many p, this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p.
A great deal of effort has been devoted to strategies for achieving calibrated (such as here) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions.
Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way:
- We want good probabilistic predictions.
- Good probabilistic predictions are calibrated.
- Therefore, we want calibrated predictions.
The “Therefore” step misses the fact that calibration is a necessary but not a sufficient characterization of good probabilities. For example on the sequence “010101010…”, always predicting p=0.5 is calibrated.
This leads to the question: What is a sufficient characterization of good probabilities? There are several candidates:
- From Vohra: Calibrated on all simple subsequences.
- Small squared error: sumx (x-px)2.
- Small log probability: sumx log (1/px)
I don’t yet understand which of these candidates is preferrable.
There is a sense in which none of them can be preferred. In any complete prediction system, the probabilities are used in some manner, and there is some loss (or utility) associated with it’s use. The “real” goal is minimizing that loss. Depending on the sanity of the method using the probabilities, this may even imply that lieing about the probabilities is preferred. Nevertheless, we can hope for a sane use of probabilities and a sufficient mechanism for predicting good probabilities might eventually result in good performance for any sane use.
The other necessary condition for good probabilities is ranking ability,
that is, how likely are we to predict a larger probability for an example
of class 1 than for an example of class 0.
One way of measuring ranking ability is using the area under the ROC curve.
Ranking and calibration seem to be complementary. In the example you mentioned,
the probabilities are calibrated but have no ranking ability.
I have a feeling that ranking and calibration taken together are a sufficient
condition for good probabilities, but I don’t know how to formalize this or how
to combine them.
I also don’t know whether minimizing the different measures you listed (cross-entropy
and MSE) amounts to different trade-offs between calibration and ranking.
I spoke to Dean Foster and Rick Vohra about Bianca’s idea of good calibration + good ranking = good probabilities. It seems to be correct, at least in an asymptotic sense.
There are issues with the finite sample case which are not entirely resolved. Part of the difficulty is that the definition of calibration is a bit tricky unless you have many samples because you must introduce some form of discretization, and there is no canonical choice.
http://gliderchair.got.to