The essential problem here is the large gap between experimental observation and theoretical understanding.
Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K. For each fold, a classifier is trained on the other folds and then test on the fold.
Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate.
Past Work (I’ll add more as I remember/learn.)
- Devroye, Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper.
- Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K.
- Avrim Blum, Adam Kalai, and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K.
- Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance.
- Matti Kääriäinen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions.
Some Extreme Cases to Sharpen Intuition
- Suppose on every fold the learned classifier is the same. Then, the cross-validation error should behave something like a test set of size m. This is radically superior to a test set of size m/K.
- Consider leave-one-out cross validation. Suppose we have a “learning” algorithm that uses the classification rule “always predict the parity of the labels on the training set”. Suppose the learning problem is defined by a distribution which picks y=1 with probability 0.5. Then, with probability 0.5, all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip).
(some discussion is here)
Difficulty I consider this a hard problem. I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. Analyzing the dependency structure of cross validation is quite difficult.
Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. But, because cross validation is used extensively, the overall impact of a good solution might be very significant.