0$, we have: \[
\textrm{Bin}\left(m,k,p\right)=\Pr_{X^{m}\sim p^{m}}\left(\sum_{i=1}^{m}X_{i}\leq k\right)=\Pr_{X^{m}\sim p^{m}}\left(e^{-m\lambda\frac{1}{m}\sum_{i=1}^{m}X_{i}}\geq e^{-m\lambda\frac{k}{m}}\right).\]
Using Markov's inequality ($X\geq0$, $EX=\mu$, $\Rightarrow\Pr(X\geq\delta)\leq\frac{\mu}{\delta}$):
\[
\leq\frac{E_{X^{m}\sim p^{m}}e^{-\lambda\sum_{i=1}^{m}X_{i}}}{e^{-\lambda k}}.\]
Using independence, we get:\[
=e^{\lambda k}\left(pe^{-\lambda}+(1-p)\right)^{m}\]
and rewriting, we get:\[
=e^{mf(\lambda)}\]
where $f(\lambda)=\lambda\frac{k}{m}+\ln\left(pe^{-\lambda}+1-p\right)$.
$\lambda$ is a free parameter which can be optimized to find the
tightest possible bound. To find the optimal value, find $\lambda^{*}$
so that $f'(\lambda^{*})=0$. \[
0=f'(\lambda^{*})=\frac{k}{m}-\frac{pe^{-\lambda^{*}}}{pe^{-\lambda^{*}}+1-p}\]
\[
\Rightarrow\frac{\frac{k}{m}}{p}\left(pe^{-\lambda^{*}}+1-p\right)=e^{-\lambda^{*}}\]
\[
\Rightarrow\frac{\frac{k}{m}}{p}(1-p)=\left(1-\frac{k}{m}\right)e^{-\lambda^{*}}\]
\[
\Rightarrow e^{\lambda^{*}}=\frac{p\left(1-\frac{k}{m}\right)}{\frac{k}{m}(1-p)}\]
which is valid for $p>\frac{k}{m}$. Using this, we get:\[
f(\lambda^{*})=\frac{k}{m}\ln\frac{p\left(1-\frac{k}{m}\right)}{\frac{k}{m}(1-p)}+\ln\left(\frac{1-p}{1-\frac{k}{m}}\right)\]
\[
=\frac{k}{m}\ln\frac{p}{\frac{k}{m}}+\left(1-\frac{k}{m}\right)\ln\left(\frac{1-p}{1-\frac{k}{m}}\right)=-\textrm{KL}\left(\frac{k}{m}||p\right).\]
\end{proof}
Using the Chernoff bound, we can loosen the test set bound to achieve
a more analytic form.
\begin{cor}
(Agnostic Test Set Bound) For all $D$, for all classifiers $c$,
for all $\delta\in(0,1]$\[
\Pr_{S\sim D^{m}}\left(\textrm{KL}\left(\frac{\hat{c}_{S}}{m}||c_{D}\right)\leq\frac{\ln\frac{1}{\delta}}{m}\right)\geq1-\delta.\]
\end{cor}
\begin{proof}
Loosening the test set bound (theorem \ref{th-hb}) with the Chernoff
approximation for $\frac{k}{m}*\frac{\mu}{\delta})<\delta$),
we get:
\[
\forall P\,\,\Pr_{S\sim D^{m}}\left(E_{c\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{S'}\right)}>\frac{m+1}{\delta}\right)<\delta.\]
\end{proof}
The next lemma shows that a certain expectation is bounded by the
Kullback-Leibler distance between two coin flips, just as for the
relative entropy Chernoff bound (Lemma \ref{lem:Relative-Entropy-Chernoff}).
\begin{lem}
\label{lemma: KL-approx}Fix all example sequences $S$. For all $Q(c)$:
\[
\frac{E_{c\sim Q}\ln\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{s'}\right)}}{m}\geq\textrm{KL}(\hat{Q}_{S}||Q_{D}).\]
\end{lem}
\begin{proof}
\[
\frac{E_{c\sim Q}\ln\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S}=\hat{c}_{s'}\right)}}{m}=\frac{1}{m}E_{c\sim Q}\ln\frac{1}{\left(\begin{array}{c}
m\\
\hat{c}_{S}\end{array}\right)c_{D}^{\hat{c}_{S}}(1-c_{D})^{m-\hat{c}_{S}}}\]
\[
\geq\frac{1}{m}E_{c\sim Q}\ln\frac{1}{\sum_{k=0}^{\hat{c}_{S}}\left(\begin{array}{c}
m\\
k\end{array}\right)c_{D}^{k}(1-c_{D})^{m-k}}\geq E_{c\sim Q}\textrm{KL}\left(\frac{\hat{c}_{S}}{m}||c_{D}\right)\]
where the last inequality follows from the relative entropy Chernoff
bound. Since $\frac{\partial^{2}}{\partial p\partial q}\textrm{KL}(q||p)=-\frac{1}{p}-\frac{1}{1-p}<0$
the function is concave in both arguments. Jensen's inequality ($f(x,y)$
concave $\Rightarrow Ef(x,y)\geq f(Ex,Ey)$) gives us:\[
\geq\textrm{KL}(E_{c\sim Q}\hat{c}_{S}||E_{c\sim Q}c_{D})\]
which completes the proof.
\end{proof}
With these two lemmas, the PAC-Bayes theorem is easy to prove.
\begin{proof}
(Of the PAC-Bayes theorem) Fix a training set $S$. Let \[
P_{G}(c)=\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S'}=\hat{c}_{S}\right)E_{d\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{d}_{S}=\hat{d}_{S'}\right)}}P(c).\]
$P_{G}(c)$ is a normalized distribution because it has the form $\frac{a_{c}}{Ea_{c}}P(c)$
where $P(c)$ is a distribution.\[
\Rightarrow0\leq\textrm{KL}(Q||P_{G})=E_{c\sim Q}\ln\left[\frac{Q(c)}{P(c)}\Pr_{S'\sim D^{m}}\left(\hat{c}_{S'}=\hat{c}_{S}\right)E_{d\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{d}_{S}=\hat{d}_{S'}\right)}\right]\]
\[
=\textrm{KL}(Q||P)-E_{c\sim Q}\ln\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S'}=\hat{c}_{S}\right)}+\ln E_{d\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{d}_{S}=\hat{d}_{S'}\right)}\]
\[
\Rightarrow E_{c\sim Q}\ln\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{c}_{S'}=\hat{c}_{S}\right)}\leq\textrm{KL}(Q||P)+\ln E_{d\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{d}_{S}=\hat{d}_{S'}\right)}.\]
Applying lemma \ref{lemma: KL-approx} on the left hand term we get:
\[
m\textrm{KL}(\hat{Q}_{S}||Q_{D})\leq\textrm{KL}(Q||P)+\ln E_{d\sim P}\frac{1}{\Pr_{S'\sim D^{m}}\left(\hat{d}_{S}=\hat{d}_{S'}\right)}.\]
This holds for all $S$. Applying Lemma \ref{lem:exp-approx} which
randomizes over $S$, we get the theorem.
\end{proof}
\subsection{The PAC-Bayes Bound is sometimes Tight}
\label{sub:How-Tight-is}
Since the PAC-Bayes bound is (almost) a generalization of the Occam's
Razor bound, the tightness result for Occam's Razor also applies to
PAC-Bayes bounds.
\subsection{Application of the PAC-Bayes Bound}
\label{sub:Application-of-the}
Applying the PAC-Bayes bound requires specialization
\citep{PB-margin}. Here, we specialize to classifiers of the form:
\[
c(x)=\textrm{sign}\left(\vec{w}\cdot\vec{x}\right).\]
Note that via the kernel trick, Support Vector Machines also have
this form.
The specialization is naturally expressed in terms of a few derived
quantities:
\begin{enumerate}
\item The cumulative distribution of a Gaussian. Let $\bar{F}(x)=\int_{x}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-x^{2}/2}$.
Here we use $\bar{F}$ rather than $F$ to denote the fact that we
integrate from $x$ to $\infty$ rather than $-\infty$ to $x$.
\item A {}``posterior'' distribution $Q(\vec{w},\mu)$ which is $N(\mu,1)$
for some $\mu>0$ in the direction of $\vec{w}$ and $N(0,1)$ in
all perpendicular directions.
\item The normalized margin of the examples\[
\gamma(\vec{x},y)=\frac{y\vec{w}\cdot\vec{x}}{||\vec{w}||||\vec{x}||}.\]
\item A stochastic error rate, $\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right).$
\end{enumerate}
This last quantity in particular is very important to understand.
Consider the case as $\mu$ approaches $\infty$. When the margin
is negative (indicating an incorrect classification), $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$
approaches $1$. When the margin is positive $\bar{F}\left(\mu\gamma(\vec{x},y)\right)$
approaches $0$. Thus, $\hat{Q}(\vec{w},\mu)_{S}$ is a softened form
of the empirical error $\hat{c}_{S}$ which takes into account the
margin.
\begin{cor}
(PAC-Bayes Margin Bound) For all distributions $D$, for all $\delta\in(0,1]$,
we have: \[
\Pr_{S\sim D^{m}}\left(\forall\vec{w},\mu:\,\,\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||Q(\vec{w},\mu)_{D}\right)\leq\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}\right)\geq1-\delta.\]
\end{cor}
\begin{proof}
The proof is very simple. We just chose the prior $P=N(0,1)^{n}$
and work out the implications.
Since the Gaussian distribution is the same in every direction, we
can reorient the coordinate system of the prior to have one dimension
parallel to $w$. Since the draws in the parallel and perpendicular
directions are independent, we have:\[
\textrm{KL}(Q||P)=\textrm{KL}(Q_{\perp}||P_{\perp})+\textrm{KL}(N(\mu,1)||N(0,1))\]
\[
=\frac{\mu^{2}}{2}\]
as required.
All that remains is calculating the stochastic error rate $\hat{Q}(\vec{w},\mu)_{S}$.
Fix a particular example $(\vec{x},y)$. This example has a natural
decomposition $\vec{x}=\vec{x}_{||}+\vec{x}_{\perp}$ into a component
$\vec{x}_{||}$ parallel to the weight vector $\vec{w}$ and a component
$\vec{x}_{\perp}$ perpendicular to the weight vector.
To classify, we draw weight vector $\vec{w}^{'}$ from $\hat{Q}(\vec{w},\mu)$.
This $\vec{w}^{'}$ consists of three components, $\vec{w}^{'}=\vec{w}_{||}^{'}+\vec{w}_{\perp}^{'}+\vec{w}_{\perp\perp}^{'}$.
Here $\vec{w}_{||}^{'}\sim N(\mu,1)$ is parallel to the original
weight vector, $\vec{w}_{\perp}^{'}\sim N(0,1)$ which is parallel
to $\vec{x}_{\perp}$ and $\vec{w}_{\perp\perp}^{'}$ is perpendicular
to both $\vec{w}$ and $\vec{x}$. We have:\[
\hat{Q}(\vec{w},\mu)_{S}=E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\neq\textrm{sign}\left(\vec{w}^{'}\cdot\vec{x}\right)\right)\]
\[
=E_{\vec{x},y\sim S,\vec{w}^{'}\sim Q(\vec{w},\mu)}I\left(y\vec{w}\cdot\vec{x}\leq0\right).\]
If we let $w_{||}^{'}=||\vec{w}_{||}^{'}||$, $w_{\perp}^{'}=||\vec{w}_{\perp}^{'}||$,
$x_{||}=||\vec{x}_{||}||$, and $x_{\perp}=||\vec{x}_{\perp}||$,
and assume (without loss of generality) that $y=1$ we get: \[
=E_{\vec{x},y\sim S,w_{||}^{'}\sim N(\mu,1),w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\]
\[
=E_{\vec{x},y\sim S}E_{w_{||}^{'}\sim N(\mu,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y(w_{||}^{'}x_{||}+w_{\perp}^{'}x_{\perp})\leq0\right)\]
\[
=E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq-yz^{'}-yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right).\]
Using the symmetry of the Gaussian, this is:\[
=E_{\vec{x},y\sim S}E_{z^{'}\sim N(0,1)}E_{w_{\perp}^{'}\sim N(0,1)}I\left(y\mu\leq yz^{'}+yw_{\perp}^{'}\frac{x_{\perp}}{x_{||}}\right)\]
Using the fact that the sum of two Gaussians is a Gaussian:\[
=E_{\vec{x},y\sim S}E_{v\sim N\left(0,1+\frac{x_{\perp}^{2}}{x_{||}^{2}}\right)}I\left(y\mu\leq yv\right)\]
\[
=E_{\vec{x},y\sim S}E_{v\sim N\left(0,\frac{1}{\gamma(\vec{x},y)^{2}}\right)}I\left(y\mu\leq yv\right)\]
\[
=E_{\vec{x},y\sim S}\bar{F}\left(\mu\gamma(\vec{x},y)\right)\]
finishing the proof.
\end{proof}
Using the corollary, the true error bound $\bar{Q}(\vec{w},\mu)_{D}$
satisfies the equation: \[
\textrm{KL}\left(\hat{Q}(\vec{w},\mu)_{S}||\bar{Q}(\vec{w},\mu)_{D}\right)=\frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta}}{m}.\]
This is an implicit equation for $\bar{Q}$ which can be easily solved
numerically.
The bound is stated in terms of dot products here, so naturally it
is possible to kernelize the result using methods from \citep{kernelize}.
In kernelized form, the bound applies to classifiers (as output by
SVM learning algorithms) of the form: \begin{equation}
c(x)=\textrm{sign}\left(\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right).\label{eq:class-kernel}\end{equation}
Since, by assumption, $k$ is a kernel, we know that $k(x_{i},x)=\vec{\Phi}(x_{i})\cdot\vec{\Phi}(x)$
where $\vec{\Phi}(x)$ is some projection into another space. In kernelized
form, we get $\vec{w}\cdot\vec{x}=\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)$,
$\vec{x}\cdot\vec{x}=k(x,x)$, $\vec{w}\cdot\vec{w}=\sum_{i,j}\alpha_{i}\alpha_{j}k(x_{i},x_{j})$,
defining all of the necessary quantities to calculate the normalized
margin, \[
\gamma(x,y)=\frac{\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)}{\sqrt{k(x,x)\sum_{i,j=1,1}^{m,m}\alpha_{i}\alpha_{j}k(x_{i},x_{j})}}.\]
One element remains, which is the value of $\mu$. Unfortunately the
bound can be nonmonotonic in the value of $\mu$, but it turns out
that for classifiers learned by support vector machines on reasonable
datasets, there is only one value of $\mu$ which is (locally, and
thus globally) minimal. A binary search over some reasonable range
of $\mu$ (say from $1$ to $100$) can find the minima quickly, given
the precomputation of the margins. It is worth noting again here that
we are not {}``cheating''---the bound holds for all values of $\mu$
simultaneously.
The computational time of the bound calculation is dominated by the
calculation of the margins which is $O\left(m^{2}\right)$ where $m$
is the number of support vectors with a nonzero associated $\alpha$.
This computational time is typically dominated by the time of the
SVM learning algorithm.
\subsubsection{Results}
Application of this bound to support vector machines is of significant
importance because SVMs are reasonably effective and adaptable classifiers
in common and widespread use. An SVM learns a kernelized classifier
as per equation (\ref{eq:class-kernel})%
\footnote{Some SVM learning algorithms actually learn a classifier of the form:
$c(x)=\textrm{sign}\left(b+\sum_{i=1}^{m}\alpha_{i}k(x_{i},x)\right)$.
We don't handle this form here.%
}.
We apply the support vector machine to 8 UCI database problems chosen
to fit the criteria {}``two classes'' and {}``real valued input
features''. The problems vary in size over an order of magnitude
from $145$ to $1428$ examples. In Figure \ref{fig:test} we use
a 70/30 train/test split of the data.
In all experiments, we use SVMlight \citep{svmlight} with a Gaussian
kernel and the default bandwidth. Results for other reasonable choices
of the {}``C'', bandwidth%
\footnote{Note that the bandwidth of a Gaussian kernel used by an SVM is not
directly related to the optimized value of $\mu$ we find.%
}, and kernel appear to be qualitatively similar (although of course
they differ quantitatively).
%
\begin{figure}
\includegraphics[%
width=0.60\columnwidth,
angle=270]{test_and_margin.ps}
\caption{\label{fig:test} This figure shows the results of applying SVMlight
to 8 datasets with a Gaussian kernel and a 70/30 train/test split.
The observed test error rate is graphed as an X. On the test set,
we calculate a Binomial confidence interval (probability of bound
failure = $0.01$) which upper bounds the true error rate. On the
training set we calculate the PAC-Bayes margin bound for an optimized
choice of $\mu$. }
\end{figure}
It is important to note that the PAC-Bayes margin bound is \emph{not}
precisely a bound (or confidence interval) on the true error rate
of the learned classifier. Instead, it is a true error rate bound
on an associated stochastic classifier chosen so as to have a similar
test error rate. These bounds can be regarded as bounds for the original
classifier only under an additional assumption: that picking a classifier
according to the majority vote of this stochastic distribution does
not worsen the true error rate. This is not true in general, but may
be true in practice.
It is of course unfair to compare the train set bound with the test
set bound on a 70/30 train/test split because a very tight train set
bound would imply that it is unnecessary to even have a test set.
In Figure \ref{fig:full-train} we compare the true error bounds on
all of the data to the true error bounds generated from the 70/30
train/test split.
%
\begin{figure}
\includegraphics[%
width=0.60\columnwidth,
angle=270]{test_vs_margin.ps}
\caption{\label{fig:full-train} In addition to comparing with everything
in Figure \ref{fig:test}, we graph the margin bound when \emph{all}
of the data is used for the train set. Note that it improves somewhat
on the margin bound calculated using the 70\% train set (7/10 margin
bound), but not enough to compete with the test set bound. }
\end{figure}
The results show that the PAC-Bayes margin bound is tight enough to
give useful information, but still not competitive with the test set
bounds. This is in strong contrast with a tradition of quantitatively
impractical margin bounds. There are several uses available for bounds
which provide some information but which are not fully tight.
\begin{enumerate}
\item They might be combined with a train/test bound \citep{TnT}.
\item The train set bound might easily become tighter for smaller sample
sizes. This was observed in \citep{TnT}.
\item The train set bound might still have the right {}``shape'' for choosing
an optimal parameter setting, such as {}``C'' in a Support Vector
Machine.
\end{enumerate}
\section{Sample Compression Bound}
\label{sec:Sparsity-Bound}
The sample compression bound \citep{LW}, \citep{FW} is like the PAC-Bayes
bound in that it applies to arbitrary precision continuous valued
classifiers. Unlike the PAC-Bayes bound, it applies meaningfully to
nonstochastic classifiers. Mainstream learning algorithms do not optimize
the sample compression metric, so the bound application is somewhat
rarer. Nonetheless, there do exist some reasonably competitive learning
algorithms for which the sample compression bound produces significant
results.
The section is organized as follows:
\begin{enumerate}
\item Subsection \ref{sub:The-Sample-Compression-Bound} states and proves
the sample compression bound.
\item Subsection \ref{sub:SC-tightness} shows that the sample compression
bound is nearly as tight as possible given the observations.
\item Subsection \ref{sub:SC-application} discusses results from the application
of the sample compression bound to support vector machines.
\end{enumerate}
\subsection{The Sample Compression Bound}
\label{sub:The-Sample-Compression-Bound}
The Sample Compression bound \citep{LW} \citep{FW} stated here differs
from older results by generalization and simplification but the bound
behavior is qualitatively identical.
Suppose we have a learning algorithm $A(S)$ whose training is {}``sparse''%
\footnote{This is satisfied, for example, by the Support Vector Machine algorithm
which only depends upon the set of support vectors.%
} in the sense that the output classifier is dependent upon only a
subset of the data, $A(S)=A(S')$ for $S'\subseteq S$. The sample
compression bound is dependent on the errors, $\hat{c}_{S-S'}$ on
the subset $S-S'$. The motivation here is that the examples which
the learning algorithm does \emph{not} depend upon are {}``almost''
independent and so we can {}``almost'' get a test set bound. In
general, the bound becomes tighter as the dependent subset $S'$ becomes
smaller and as the error number on the nondependent subset $S-S'$
becomes smaller.
%
\begin{figure}
\includegraphics[%
scale=0.5]{sample_compression.eps}
\caption{\label{fig:SC-POL} The interactive proof of learning for the sample
compression bound. Note that the learning algorithm is arbitrary here,
similar to the test set bound.}
\end{figure}
Viewed as an interactive proof of learning (in Figure \ref{fig:SC-POL}),
the sample compression bound is unique amongst training set bounds
because it does not require \emph{any} initial commitment to a measure
over the classifiers%
\footnote{However, we can regard the commitment to a learning algorithm as an
implicit commitment to a measure over classifiers which is dependent
on the learning algorithm and the distribution generating the data.
Viewed from this perspective, the sample compression bound is the
Occam's Razor bound, except for the minor detail that the set of evaluating
examples varies.%
}.
\begin{thm}
(Sample Compression Bound) For all $\delta\in(0,1]$, $D$, $A$:\[
\Pr_{S\sim D^{m}}\left(\forall S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}\leq\overline{\textrm{Bin}}\left(|S-S'|,\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\delta.\]
\end{thm}
\begin{proof}
Suppose we knew in advance that the learning algorithm will not depend
upon some subset of the examples. Then, the {}``undependent'' subset
acts like a test set and gives us a test set bound:\[
\forall S'\subseteq S,\,\, c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}\leq\overline{\textrm{Bin}}\left(|S-S'|,\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)\geq1-\frac{\delta}{m{{m \choose |S-S'|}}}.\]
(Note that, technically, it is possible to refer to $S'$ unambiguously
before randomizing over $S$ by specifying the indexes of $S$ contained
in $S'$.) Negating this, we get:
\[
\forall S'\subseteq S,\,\, c=A(S'):\Pr_{S\sim D^{m}}\left(c_{D}>\overline{\textrm{Bin}}\left(|S-S'|,\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\frac{\delta}{m{{m \choose |S-S'|}}}\]
and using the union bound ($\Pr(A\textrm{ or }B)\leq\Pr(A)+\Pr(B)$)
over each possible subset, $S'$, we get:\[
\Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(|S-S'|,\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)<\delta.\]
Negating this again gives us the proof.
\end{proof}
\subsection{The Sample Compression Bound is Sometimes Tight}
\label{sub:SC-tightness}
We can construct a learning algorithm/learning problem pair such that
the Sample compression bound is provably near optimal, as a function
of its observables.
\begin{thm}
(Sample Compression Tightness) For all $\delta\in(0,1]$, $m$, $k$,
there exists a distribution $D$ and learning algorithm $A$ s.t.\[
\Pr_{S\sim D^{m}}\left(\exists S'\subseteq S\,\,\,\textrm{with }c=A(S'):\,\,\, c_{D}>\overline{\textrm{Bin}}\left(|S-S'|,\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)\right)>\delta-\delta^{2}.\]
furthermore, if $S^{*}$ minimizes $\overline{\textrm{Bin}}\left(|S-S'|,\hat{c}_{S-S'},\frac{\delta}{m{{m \choose |S-S'|}}}\right)$,
then \[
\Pr_{S\sim D^{m}}\left(c^{*}=A(S^{*}):\,\,\, c_{D}^{*}>\overline{\textrm{Bin}}\left(|S-S^*|,\hat{c}_{S-S^{*}}^{*},\frac{\delta}{m{{m \choose |S-S^{*}|}}}\right)\right)>\delta-\delta^{2}.\]
\end{thm}
\begin{proof}
The proof is constructive and similar to the Occam's Razor tightness
result. In particular, we show how to construct a learning algorithm
which outputs classifiers that err independently depending on the
subset $S'$ used.
Consider an input space $X=\{0,1\}^{2^{m}}$. Each variable in the
input space $x_{S'}$ can be thought of as indexing a unique subset
$S'\subseteq S$ of the examples. In the rest of the proof, we index
variables by the subset they correspond to.
Draws from the distribution $D$ can be made by first flipping an
unbiased coin to get $y=1$ with probability $0.5$ and $y=-1$ with
probability $0.5$. The distribution on $X$ consists of a set of
independent values after conditioning on $y$. Choose \[
\Pr(x_{S'}\neq y)=\overline{\textrm{Bin}}\left(m,k,\frac{\delta}{m{{m \choose |S-S'|}}}\right).\]
Now, the learning algorithm $A(S')$ is very simple---it just outputs
the classifier $c(x)=x_{S'}$. On the set $S-S'$, we have:\[
\forall S'\,\,\,\Pr_{S\sim D^{m}}\left(\hat{c}_{S-S'}\geq\frac{k}{m}\right)=1-\frac{\delta}{m{{m \choose |S-S'|}}}.\]
Using independence, we get:\[
\Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}\geq\frac{k}{m}\right)=\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right).\]
Negating, we get:\[
\Pr_{S\sim D^{m}}\left(\forall S'\,\,\,\hat{c}_{S-S'}<\frac{k}{m}\right)=1-\prod_{S'}\left(1-\frac{\delta}{m{{m \choose |S-S'|}}}\right)\]
and doing some algebra, we get the result.
\end{proof}
\subsection{Application of the Sample Compression Bound}
\label{sub:SC-application}
One obvious application of the Sample Compression bound is to support
vector machines, since the learned classifier is only dependent on
the set of support vectors. If $S'$ is the set of support vectors
then $S-S'$ is the set of nonsupport vectors. Unfortunately, it turns
out that this does not work so well, as observed in Figure \ref{fig:sc-svm}.
%
\begin{figure}
\includegraphics[%
width=0.60\columnwidth,
angle=270]{test_and_compress.ps}
\caption{\label{fig:sc-svm}The sample compression bound applied to the output
of a support vector machine with a Gaussian kernel. Here we use $\delta=0.01$}
\end{figure}
There are other less common learning algorithms for which the sample
compression bound works well. The Set Covering machine \citep{SCM}
has an associated bound which is a variant of the Sample Compression
Bound.
\section{Discussion}
Here, we discuss several aspects and implications of the presented
bounds.
\subsection{Learning algorithm design}
\emph{Every} train set bound implies a learning algorithm: choose
the classifier which minimizes the true error bound. This sounds like
a rich source of learning algorithms, but there are some severe caveats
to that statement.
\begin{enumerate}
\item It is important to note that the form of a train set bound does \emph{not}
imply that this minimization is a good idea. Choosing between two
classifiers based upon their true error bound implies a better worst-case
bound on the true error. It does not imply an improved true error.
In many situations, there is some other metric of comparison (such
as train error) which in fact creates better behavior.
\item Another strong caveat is that, historically, train set bounds have
simply not been tight enough on real datasets for a nonvacuous application.
This is changing with new results, but more progress is necessary.
\item Often the optimization problem is simply not very tractable. In addition
to sample complexity, learning algorithms must be concerned with run
time and space usage.
\end{enumerate}
\subsection{Philosophy}
Train set bounds teach us about ways in which verifiable learning
is possible, a subject which borders on philosophy. The train set
bound presented here essentially shows that a reasonable person will
be convinced of learning success when a short-description classifier
does well on train set data. The results here do \emph{not} imply
that this is the only way to convincingly learn. In fact, the (sometimes
large) looseness of the Occam's Razor bound suggests that other methods
for convincing learning processes exist. This observation is partially
shown by the other train set bounds which are presented.
\subsection{Conclusion}
This introduction to prediction theory covered two styles of bound:
the test set bound and the train set bound. There are two important
lessons here:
\begin{enumerate}
\item Test set bounds provide a better way to report error rates and confidence
intervals on future error rates than some current methods.
\item Train set bounds can provide useful information.
\end{enumerate}
It is important to note that the train set bound and test set bound
techniques are not mutually exclusive. It is possible to use both
simultaneously \citep{TnT}, and doing so is often desirable. Test
set bounds are improved by the {}``free'' information about the
training error and train set bounds become applicable, even when not
always tight.
\acks{ Many people were critical to this tutorial. This includes Sam
Roweis who started this, anonymous reviewers who were patient and
capable, several coauthors on previous papers, Imre Kondor, Arindam
Banerjee, Eric Bax, Alina Beygelzimer, Varsha Dani, Tom Hayes, Timothy
Ross, Robert Schapire, and Belinda Thom. I would also like to thank
IBM which supported me during much of the preparation of this
tutorial. }
\appendix
\section{}
\label{sec-model}
For those interested in comparing models, uniform convergence
\citep{Vapnik} additionally requires the axiom of choice (to achieve
empirical risk minimization) and a hypothesis space of bounded
complexity. Typical theorems are of the form {}``after $m$ examples,
all training errors are near to true errors''.
The PAC learning model \citep{Valiant} requires a polynomial time complexity
learning algorithm and the assumption that the learning problem comes
from some class. Theorems are of the form {}``after $m$ examples
learning will be achieved''.
Both of these models can support stronger statements than the basic
prediction theory model presented here. Results from both of these
models can apply here after appropriate massaging.
The online learning model \citep{Online} makes \emph{no} assumptions.
Typical theorems have the form {}``This learning algorithm's performance
will be nearly as good as anyone of a set of classifiers.'' The online
learning model has very general results and no%
\footnote{Note, there do exist online to batch conversions, but these conversions
always occur under an assumption of i.i.d. samples, essentially changing
the setting to the one described here.%
} ability to answer questions about future performance as we address
here.
The prediction theory model can most simply be understood as a slight
refinement of the information theory model.
%\begin{thebibliography}{label}
%\bibitem[Bartlett et al 2004]{BBM}Peter L. Bartlett, Olivier Bousquet, and Shahar Mendelson. Local Rademacher
%Complexities. Annals of Statistics 2004 (to appear).
%\bibitem[Blumer et al. 1987]{BEHW}A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth. {}``Occam's Razor.''
%Information Processing Letters 24: 377-380, 1987.
%\bibitem[Blum et al 1999]{Progressive}Avrim Blum, Adam Kalai, and John Langford 1999. Beating the Holdout:
%Bounds for K-Fold and Progressive Cross-Validation. Computational
%Learning Theory (COLT) 1999. http://www.cs.cmu.edu/\textasciitilde{}jcl/papers/progressive\_validation/coltfinal.ps
%\bibitem[Chernoff 1952]{chernoff}H. Chernoff, {}``A Measure of the Asymptotic Efficiency of Tests
%of a Hypothesis Based Upon the Sum of the Observations'', Annals
%of Mathematical Statistics 23:493-507, 1952.
%\bibitem[Clopper and Pearson 1934]{CnP}C.J. Clopper and E. S. Pearson, {}``The use of confidence intervals
%or fiducial limits illustrated in the case of the binomial'', Biometrika
%26 (1934) 404-413.
%\bibitem[Devroye et al 1996]{Devroye}Luc Devroye, Laszlo Gyorfi, Gabor Lugosi, {}``A Probabilistic Theory
%of Pattern Recognition'' Springer-Verlag New York, 1996.
%\bibitem[Floyd and Warmuth 1995]{FW}Sally Floyd and Manfred Warmuth, {}``Sample Compression, Learnability,
%and the Vapnik-Chervonenkis Dimension'', Machine Learning, Vol.21
%(3), pp. 269--304, December 1995 http://www.cse.ucsc.edu/\textasciitilde{}manfred/pubs/sallycompr.forgalleys.ps
%\bibitem[Herbrich and Graepel 2001]{kernelize}R. Herbrich and T. Graepel. {}``Large scale Bayes point machines''
%In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in
%Neural Information System Processing 13, pages 528-534, Cambridge,
%MA, 2001.
%\bibitem[Hoeffding 1963]{Hewwwewoeffding}W. Hoeffding, {}``Probability inequalities for sums of bounded random
%variables'', Journal of the American Statistical Association, 1963,
%58:13-30.
%\bibitem[Joachims ??]{svmlight}Thorsten Joachims, program SVMlight, available at http://svmlight.joachims.org/
%\bibitem[Kearns and Ron 1999]{Sanity}Michael Kearns and Dana Ron, {}``Algorithmic Stability and Sanity-Check
%Bounds for Leave-One-Out Cross-Validation.'' Neural Computation 11(6),
%pages 1427-1453, 1999.
%\bibitem[Kivinen and Warmuth 1997]{Online}J. Kivinen and M. Warmuth, \char`\"{}Additive Versus Exponentiated
%Gradient Updates for Linear Prediction,\char`\"{} Information and
%Computation, vol. 132, no. 1, pp. 1-64, January 1997. http://www.cse.ucsc.edu/\textasciitilde{}manfred/pubs/lin.ps
%\bibitem[Langford 2004]{bound}John Langford, Program {}``bound'', http://www-2.cs.cmu.edu/\textasciitilde{}jcl/programs/bound/bound.html
%\bibitem[Langford and Blum 1999]{MC_journal}John Langford and Avrim Blum 1999. Microchoice Bounds and Self Bounding
%learning algorithms. Machine Learning. http://www.cs.cmu.edu/\textasciitilde{}jcl/papers/microchoice/journal/journal\_final.ps
%\bibitem[Langford and McAllester 2000]{Shell}John Langford and David McAllester, {}``Computable Shell Decomposition
%Bounds'' Computational Learning Theory 2000. http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/computable\_shell/colt\_final.ps
%\bibitem[Langford and Seeger 2001]{averaging}John Langford and Matthias Seeger, {}``Bounds for Averaging Classifiers''
%Technical report, Carnegie Mellon 2001 http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/averaging/averaging\_tech.ps
%\bibitem[Langford and Shawe-Taylor 2002]{PB-margin}John Langford and John Shawe-Taylor, {}``PAC-Bayes \& Margins'',
%Neural Information Processing Systems 2002. http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/PB-margin/PB-margin.ps
%\bibitem[Langford 2002]{TnT}John Langford, {}``Combining Train Set and Test Set Bounds'', International
%Conference on Machine Learning 2002. http://www-2.cs.cmu.edu/\textasciitilde{}jcl/papers/tnt\_icml/train\_n\_test\_final.ps
%\bibitem[Littlestone and Warmuth ??]{LW}Nick Littlestone and Manfred Warmuth, {}``Relating Data Compression
%and Learnability'' Unpublished Manuscript http://www.cse.ucsc.edu/\textasciitilde{}manfred/pubs/lrnk-olivier.ps
%\bibitem[McAllester 1999]{PB}David McAllester, ``PAC-Bayesian Model Averaging'' COLT 1999. http://www.autoreason.com/posterior01.ps
%\bibitem[Marchand and Shawe-Taylor 2001]{SCM}Mario Marchand and John Shawe-Taylor, {}``The Set Covering Machine'',
%International Conference on Machine Learning 2001. http://www.ai.mit.edu/projects/jmlr/papers/volume3/marchand02a/marchand02a.pdf
%\bibitem[Seeger 2002]{Matthias}Matthias Seeger, {}``PAC-Bayesian Generalization Error Bounds for
%Gaussian Process Classification'', Journal of Machine Learning Research
%3 (2002), 233--269. http://www.dai.ed.ac.uk/homes/seeger/papers/seeger02a.ps.gz
%\bibitem[Seung ??]{Chernoff}Sebastian Seung, Unpublished notes
%\bibitem[Shawe-Taylor et al 1998]{SBWA}John Shawe-Taylor, Peter Bartlett, Robert Williamson, and Martin Anthory, {}``Structural
%Risk Minimization over Data-Dependent Hierarchies'', IEEE Transactions
%on Information Theory, 1998, vol. 44, no. 5, 1926-1940.
%\bibitem[Valiant 1984]{Valiant}L.G. Valiant. {}``A Theory of the Leuarnable.'' Communications of
%the ACM 27(11):1134-1142, November 1984
%\bibitem[Vapnik and Chervonenkis 1971]{Vapnik}V. N. Vapnik and A. Y. Chervonenkis. {}``On the uniform convergence
%of relative frequencies of events to their probabilities.'' Theory
%of Probability and its Applications, 16(2):264-280, 1971. \end{thebibliography}
\bibliography{langford05a}
\end{document}
*