# Problem: Cross Validation

The essential problem here is the large gap between experimental observation and theoretical understanding.

Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K. For each fold, a classifier is trained on the other folds and then test on the fold.

Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate.

Past Work (I’ll add more as I remember/learn.)

1. Devroye, Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper.
2. Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K.
3. Avrim Blum, Adam Kalai, and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K.
4. Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance.
5. Matti KÃ¤Ã¤riÃ¤inen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions.

Some Extreme Cases to Sharpen Intuition

1. Suppose on every fold the learned classifier is the same. Then, the cross-validation error should behave something like a test set of size m. This is radically superior to a test set of size m/K.
2. Consider leave-one-out cross validation. Suppose we have a “learning” algorithm that uses the classification rule “always predict the parity of the labels on the training set”. Suppose the learning problem is defined by a distribution which picks y=1 with probability 0.5. Then, with probability 0.5, all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip).

(some discussion is here)

Difficulty I consider this a hard problem. I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. Analyzing the dependency structure of cross validation is quite difficult.

Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. But, because cross validation is used extensively, the overall impact of a good solution might be very significant.

## 11 Replies to “Problem: Cross Validation”

1. rif says:

Very nice example. Thanks. You certainly bring up a lot of good issues.

I’m not quite sure what the practical implications are. Your example is obviously correct, but I wonder to what extent it should be viewed as “pathological.” My experience across a moderate range of algorithms and datasets tends to be that the LOO error tracks the error on a held-out set quite closely.

Do you think there is any value in restricting attention to algorithms which are going to learn in some sense, and trying to derive bounds under that restriction? I’m not sure it’s useful…

2. John Langford says:

That particular example is clearly pathological, but it’s existence seems worrisome.

If you want a less pathological example, then consider the learning rule “predict according to the majority of the labels”. With probability about 1/sqrt(m) the set of labels will be precisely balanced, and you get the same high variance behavior as each individual example is held out.

3. rif says:

Yes, another nice example. That’s one I came up with (I believe Misha Belkin came up with it independently) back when we were talking about algorithmic stability — it’s an easy way to show that empirical risk minimization in a small space needn’t have fast (O(1/n)) uniform stability. I’ll agree that the existance of such examples is possibly worrisome, but I’m not still not sure how relevant they are in practice.
Certainly, the second example is closer to a realistic algorithm, but on the other hand, the probability that you get bitten is at least decaying moderately in the number of data points.

I probably don’t have anything else deep to add to this. Nice food for thought.

4. nikki says:

Why we prefer 10-fold cross validation to 2-fold cross validation?

5. Aleks says:

10 fold cross-validation is the one most habitually used. Generally, one should perform several replications of 2-fold CV (it is slightly unstable), but usually a single replication of 10-fold CV is considered sufficient.

In general, if someone deviates from the standard, people begin to wonder if the number of folds was chosen to show the particular approach in the best light.

In summary, there is no strong reason other than social inertia and the stability of results.

6. Michel Valstar says:

First of all Nikki, you must understand that the confidence interval of our predicted error narrows if we take more folds. That being said, there is a difference between 10 fold cv and multiple 2 fold CVs. If we repeat a 2 fold CV, the test sets are no longer qualify as being drawn independently with respect to the underlying distribution D.

7. jl says:

The confidence interval does not always narrow with more folds. Take a look at this long comment.

8. George Forman says:

We often have limited size training sets. 2-fold –> 50% data used for each classifier. 10-fold –> 90%. I sometimes use 50-fold –> 98%. It matters because when your training set is small, adding another few% training cases can give you a significant improvement in the learning curve. So, the k separate fold classifiers might behave very differently that the ‘final’ classifier trained on 100% of the training set.

9. azam says:

hi
i have a question abut k*2 cross validation,
i know this method is a variation of k fold cross validation .in this method for each fold we should devide dataset to 2 section,d0 and d1,and once we should use d0 for train and d1 for test,followed by d0 for test and d1 for train.but i dont know wheter we should average between them or not?