The simplest bound arises for the classical technique of splitting the data set into two pieces: a training set of size and a test set of size . In this setting, the following simple bound applies:
(Holdout Sample Complexity) Let be the empirical error on the test set and be the true error rate of the hypothesis, then we have: where
PROOF. The proof is just a simple identification with the Binomial. For any distribution over pairs and any hypothesis, , there exists some probability, , that the hypothesis predicts incorrectly. We can regard this event as a coin flip with bias . Since each example is picked independently, the distribution of the empirical error rate will then be a Binomial distribution. Given that the distribution is Binomial we calculate an upper bound which holds with high probability. ▫
There are two immediate corollaries of the holdout theorem ( 4.1.1) which are mathematically simpler although not as tight. The first corollary applies to the limited “realizable” setting where you happen to observe test errors.
PROOF. Specializing theorem 4.1.1 to the zero empirical error case, we get: Setting this equal to and solving for gives us the result. ▫
A second corollary applies to all results, not just those where we observe errors.
PROOF. Loosening theorem 4.1.1 with the Hoeffding approximation for , we get: Using the inversion lemma 3.4.1 we can set this equal to , and solve for to get the result. ▫
How tight is the test sample complexity theorem 4.1.1? The answer is very tight. Let us define as our true error bound. We wish to know how much and differ. Applying the Hoeffding approximation, we know that with high probability, . Thus the region in which is confined with high confidence is of size or smaller.
It is common practice in the field of machine learning to use the gaussian approximation in reporting error bars. The practice is reasonably safe because it is usually pessimistic. However, this can occasionally lead to embarrassing results where error rates such as are reported. The test sample complexity theorem never produces an upper bound greater than or lower bound less than because it uses the fundamental Binomial distribution. This approach is the “right” way to report test-set based errors, given the assumption of independence. Appendix Section 16.1 documents how to apply this bound. Pictorially we can represent this as in figure 4.1.1.
Some results for application of the simple test set bound are presented on page 326 in figure 12.3.3. In summary, the test set bound tends to work quite well (in practice) when sufficient examples are available.
Given that the bounds for the simple holdout technique are so tight, why do we need to engage in further work? There is one serious drawback to the holdout technique—application of the holdout technique requires otherwise unused examples. This can strongly degrade the value of the learned hypothesis because an extra examples for the training set could reduce the true error of the learned hypothesis from to on some learning problems.
There is another reason why training set based bounds are important. Many learning algorithms implicitly assume that the training error “behaves like” the true error in choosing the hypothesis. With an inadequate number of training examples, there may be very little relationship between the behavior of the training error and the true error. Training error based bounds can be used in the training algorithm.
There are two basic approaches to this difficulty:
Before discussing approach (2) we will make a few comments about approach (1) to suggest the variety of theoretical difficulties which occur when using approach (1).
One of the standard techniques for attempting to improve on the holdout bound is cross validation. K-fold cross validation divides the data into folds of size (assume is divisible by for simplicity). Then, for every fold , holdout fold , train on the remainder of the data and test on fold . Let the hypotheses we found by training be known as and their respective holdout errors as . Also let .
There are several variations of cross validation. If , the procedure is often called “leave one out cross validation”. In one variant, you train on all of the data to learn a new hypothesis, , and assume a true error rate near . In another variant, you predict according to . The latter variant is simpler to analyze because linearity of expectation implies that is an unbiased estimate of .
There are strong results known for cross validation on nearest-neighbor, kernel, and histogram classifiers [11]. In general, only very weak results are known about bounds on the variance of cross validation for general classifiers. The “general” results include “Sanity check bounds” [27] which state that cross validation is not much worse than a holdout set and some slightly stronger results [?] and [25].
(Open) Construct a bound on the deviation of cross validation for arbitrary classifiers which is a quantitative improvement on the results of [?].