Chapter 6
PAC-Bayes bounds

The work presented here is also published in [35].

PAC-Bayes bounds are a generalization of the Occam’s razor bound for algorithms which output a distribution over classifiers rather than just a single classifier. This includes the possibility of a distribution over a single classifier, so it is a generalization. Most classifiers do not output a distribution over base classifiers. Instead, they output either a classifier, or an average over base classifiers. Nonetheless, PAC-Bayes bounds are interesting for several reasons:

  1. PAC-Bayes bounds are much tighter (in practice) than most common VC-related [51] approaches on continuous classifier spaces. This can be shown by application to stochastic neural networks (see section 13) as well as other classifiers. It also can be seen by observation: when specializing the PAC-Bayes bounds on discrete hypothesis spaces, only O(lnm) sample complexity is lost.
  2. Due to the achievable tightness, the result motivates new learning algorithms which strongly limit the amount of overfitting that a learning algorithm will incur.
  3. The result found here will turn out to be useful for averaging hypotheses.

PAC-Bayes bounds were first introduced by McAllester [39].

There are three relatively independent observations in this chapter:

  1. A quantitative improvement of the PAC-Bayes by retrofit with relative entropy Chernoff bound 3.2.1. This retrofit is not as trivial as might be expected, but it can be done. The result is the tightest known PAC-Bayes bound. In addition to the quantitative improvements, this tightening simplifies the proof and adds to our qualitative understanding of the bound.
  2. A method for (partially) derandomizing the PAC-Bayes stochastic hypothesis
  3. A method for stochastic evaluation of the empirical error.

The first observation is the most important. Observation (3) is important for many practical applications because it is safely avoids a (sometimes) very complicated evaluation problem. Observation (2) is of little theoretical interest, but it might interest some people who feel reassured when every classifier randomized over has a low empirical error rate.

Figure 6.0.1 shows what the PAC-Bayes bound looks like as an interactive proof of learning.


Figure 6.0.1: The PAC-Bayes bound can be viewed as a new style for a proof of learning. The learner must commit to a “Prior” as in the Occam’s Razor Bound 4.6.1 before seeing examples, but it does not commit to a single hypothesis. Instead, it commits to a distribution over hypotheses, q(h) and the bound applies to a randomization with respect to the distribution q(h).

 6.1.  PAC-Bayes Basics
 6.2.  A Tighter PAC-Bayes Bound
 6.3.  PAC-Bayes Approximations
   6.3.1.  Approximating the empirical error
   6.3.2.  Derandomizing the PAC-Bayes bound
 6.4.  Application of the PAC-Bayes bound