Bounding the deviation of is more difficult than bounding the deviation of a holdout error. To understand this, we can think of two games, the holdout game and the progressive validation game.
In the holdout game, your opponent chooses a bias and then nature flips coins with that bias. If the deviation of the average number of heads is larger than , then you lose. Otherwise, you win.
In the progressive validation game, the opponent chooses the bias of each coin just before it is flipped. The goal of the opponent remains the same, and the opponent wins if a large deviation is observed.
The progressive validation opponent is at least as strong as the holdout opponent since the progressive validation opponent could choose the same bias for every coin. Nonetheless, we will see that the progressive validation opponent is not much stronger.
There are two ways in which we can show that the progressive validation opponent is not much stronger. The first technique will show that the variance of the progressive validation estimate is smaller than might be expected. Then we will show that the deviations of the progressive validation opponent behave much like the deviations of an independent opponent.
Suppose we test the progressive validation hypothesis on additional examples. Let be the empirical error on these examples. Then, we have:
PROOF. Every example on the left hand side can be though of as a coin with bias The variance of the LHS is then . The right hand side is: where The cross product term is:
Without loss of generality, assume that . What we wish to prove is that the expected value of this quantity is conditional on the values of all random variables other than th or th example. Let be the set of examples minus the th and th example. Also, let and be the th and th labeled examples. If we can show that: then linearity of expectation will imply that:
The value of is fixed after conditioning on (and assuming a deterministic learning algorithm) while the value of is not fixed: it is dependent on the random variable . Let be the derived random variable. Then, the expectation is: (implicitly conditioning on ) For any fixed , we want to show that .With fixed, the true error rate of the th hypothesis is . Therefore, the probability of observing an error () is , and the probability of observing no error () is . This implies:
Putting this together, we get:
Since this sum is zero regardless of the values of all other random variables, the cross product terms all have expectation zero. Note that we can extend this proof to randomized algorithms by conditioning on the random bits of the algorithm in the above argument. Consequently: So, all that we must show is: which follows from Jensen’s inequality and the convexity of on the interval . ▫