## Chapter 6

PAC-Bayes bounds

The work presented here is also published in [35].

PAC-Bayes bounds are a generalization of the Occam’s razor bound for algorithms
which output a distribution over classifiers rather than just a single classifier. This
includes the possibility of a distribution over a single classifier, so it is a generalization.
Most classifiers do not output a distribution over base classifiers. Instead, they output
either a classifier, or an average over base classifiers. Nonetheless, PAC-Bayes bounds are
interesting for several reasons:

- PAC-Bayes bounds are much tighter (in practice) than most common
VC-related [51] approaches on continuous classifier spaces. This can be
shown by application to stochastic neural networks (see section 13) as well
as other classifiers. It also can be seen by observation: when specializing
the PAC-Bayes bounds on discrete hypothesis spaces, only $O\left(lnm\right)$
sample complexity is lost.
- Due to the achievable tightness, the result motivates new learning
algorithms which strongly limit the amount of overfitting that a learning
algorithm will incur.
- The result found here will turn out to be useful for averaging hypotheses.

PAC-Bayes bounds were first introduced by McAllester [39].

There are three relatively independent observations in this chapter:

- A quantitative improvement of the PAC-Bayes by retrofit with relative
entropy Chernoff bound 3.2.1. This retrofit is not as trivial as might be
expected, but it can be done. The result is the tightest known PAC-Bayes
bound. In addition to the quantitative improvements, this tightening
simplifies the proof and adds to our qualitative understanding of the
bound.
- A method for (partially) derandomizing the PAC-Bayes stochastic
hypothesis
- A method for stochastic evaluation of the empirical error.

The first observation is the most important. Observation (3) is important for many
practical applications because it is safely avoids a (sometimes) very complicated
evaluation problem. Observation (2) is of little theoretical interest, but it might interest
some people who feel reassured when every classifier randomized over has a low empirical
error rate.

Figure 6.0.1 shows what the PAC-Bayes bound looks like as an interactive proof of
learning.