\documentclass{article}
\usepackage{jmlr2e}
\usepackage{natbib}
%\usepackage[T1]{fontenc}
%\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
%\providecommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\@}
%\newcommand{\noun}[1]{\textsc{#1}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
%\theoremstyle{plain}
\newtheorem{thm}{Theorem}[section]
%\numberwithin{equation}{section} %% Comment out for sequentially-numbered
%\numberwithin{figure}{section} %% Comment out for sequentially-numbered
%\theoremstyle{plain}
\newtheorem{lem}[thm]{Lemma} %%Delete [thm] to re-start numbering
\newtheorem{con}{Conjecture}[section]
\newenvironment{proof_outline}{\par{\bf Proof Outline:}}{\qed \par}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
%\input{preamble}
%\input{psfig}
\textwidth = 6.5in
\textheight = 8.5in
\setlength\topmargin{-0.25in}
\setlength\oddsidemargin{-0.25in}
\setlength\evensidemargin{-0.25in}
\newcommand{\ignore}[1]{}
\newcommand{\mtt}[1]{\mbox{\tt #1}}
\newcommand{\calh}{{\mathcal H}}
\newcommand{\calx}{{\matccal X}}
\newcommand{\tuple}[1]{{\mbox{$\langle#1\rangle$}}}
\newcommand{\hateps}{\hat{e}}
\newcommand{\hate}{\hat{e}}
\newcommand{\hats}{\hat{s}}
\newcommand{\hathats}{\hat{s}^+}
\newcommand{\tils}{\bar{s}}
\newcommand{\hatL}{\hat{L}}
\newcommand{\hatl}{\hat{\ell}}
\newcommand{\hatR}{\hat{R}}
\newcommand{\hathateps}{\hat{\hat{e}}}
\newcommand{\Cgamma}{{\cal C}_\gamma}
\newcommand{\bareps}{\overline{\epsilon}}
\newcommand{\hatP}{\hat{P}}
\newcommand{\hatZ}{\hat{Z}}
\newcommand{\myfloor}[1]{\lfloor#1\rfloor}
\newcommand{\myceil}[1]{\lceil#1\rceil}
\newcommand{\dceil}[1]{\lceil \lceil #1 \rceil \rceil}
\newcommand{\qed}{\mbox{$\;\;\;\Box$}}
\newcommand{\lub}{\mtt{lub}}
\newcommand{\glb}{\mtt{glb}}
%\newlength{\linelength}
%\setlength{\linelength}{0.25\textwidth}
\title{Computable Shell Decomposition Bounds}
\author{ {\bf
John Langford} \\
TTI-Chicago\\
jcl@cs.cmu.edu \\
\AND
{\bf David McAllester} \\
TTI-Chicago \\
dmac@autoreason.com \\
}
\begin{document}
\editor{Leslie Pack Kaelbling and David Cohn}
\maketitle
\begin{abstract}
Haussler, Kearns, Seung and Tishby introduced the notion of a shell
decomposition of the union bound as a means of understanding certain
empirical phenomena in learning curves such as phase transitions. Here
we use a variant of their ideas to derive an upper bound on the
generalization error of a hypothesis computable from its training
error and the histogram of training errors for the hypotheses in the
class. In most cases this new bound is significantly tighter than
traditional bounds computed from the training error and the
cardinality of the class. Our results can also be viewed as providing
a rigorous foundation for a model selection algorithm proposed by
Scheffer and Joachims.
\end{abstract}
\begin{keywords}
Sample Complexity, Classification, True Error Bounds, Shell bounds
\end{keywords}
\section{Introduction}
For an arbitrary finite hypothesis class we consider the hypothesis of
minimal training error. We give a new upper bound on the
generalization error of this hypothesis computable from the training
error of the hypothesis and the histogram of the training errors of
the other hypotheses in the class. This new bound is typically much
tighter than more traditional upper bounds computed from the training
error and cardinality of the class.
As a simple example, suppose that we observe that all but one
empirical error in a hypothesis space is $1/2$ and one empirical error
is $0$. Furthermore, suppose that the sample size is large enough
(relative to the size of the hypothesis class) that with high
confidence we have that, for all hypotheses in the class, the true
(generalization) error of a hypothesis is within $1/5$ of its training
error. This implies, that with high confidence, hypotheses with
training error near $1/2$ have true error in $[3/10,7/10]$.
Intuitively, we would expect the true error of the hypothesis with
minimum empirical error to be very near to $0$ rather than simply less
than $1/5$ because none of the hypotheses which produced an empirical
error of $1/2$ could have a true error close enough to $0$ that there
exists a significant probability of producing $0$ empirical error.
The bound presented here validates this intuition. We show that you
can ignore hypotheses with training error near $1/2$ in calculating an
``effective size'' of the class for hypotheses with training error
near $0$. This new effective class size allows us to calculate a
tighter bound on the difference between training error and true error
for hypotheses with training error near $0$. The new bound is proved
using a distribution-dependent application of the union bound similar
in spirit to the shell decomposition introduced by Haussler, Kearns,
Seung and Tishby \cite{HKST96}.
We actually give two upper bounds on generalization error --- an
uncomputable bound and a computable bound. The uncomputable bound is
a function of the unknown distribution of true error rates of the
hypotheses in the class. The computable bound is, essentially, the
uncomputable bound with the unknown distribution of true errors
replaced by the known histogram of training errors. Our main
contribution is that this replacement is sound, i.e., the computable
version remains, with high confidence, an upper bound on
generalization error.
When considering asymptotic properties of learning theory bounds it is
important to take limits in which the cardinality (or VC dimension) of
the hypothesis class is allowed to grow with the size of the sample.
In practice, more data typically justifies a larger hypothesis class.
For example, the size of a decision tree is generally proportional the
amount of training data available. Here we analyze the asymptotic
properties of our bounds by considering an infinite sequence of
hypothesis classes $\calh_m$, one for each sample size $m$, such that
$\frac{\ln|\calh_m|}{m}$ approaches a limit larger than zero. This
kind of asymptotic analysis provides a clear account of the
improvement achieved by bounds that are functions of error rate
distributions rather than simply the size (or VC dimension) of the
class.
We give a lower bound on generalization error showing that the
uncomputable upper bound is asymptotically as tight as possible ---
any upper bound on generalization error given as a function of the
unknown distribution of true error rates must asymptotically be
greater than or equal to our uncomputable upper bound.
Our lower bound on generalization error also shows that
there is essentially no loss in working with an upper bound computed from the
true error distribution rather than expectations computed from this
distribution as used by Scheffer and Joachims \cite{Tobias}.
Asymptotically, the computable bound is simply the uncomputable bound
with the unknown distribution of true errors replaced with the
observed histogram of training errors. Unfortunately, we can show
that in limits where $\frac{\ln|\calh_m|}{m}$ converges to a value
greater than zero, the histogram of training errors need not converge
to the distribution of true errors --- the histogram of training
errors is a ``smeared out'' version of the distribution of true
errors. This smearing loosens the bound even in the large-sample
asymptotic limit. We give a precise asymptotic characterization of
this smearing effect for the case where distinct hypotheses have
independent training errors. In spite of the divergence between these
bounds, the computable bound is still significantly tighter than
classical bounds not involving error distributions.
The computable bound can be used for model selection. In the case of
model selection we can assume an infinite sequence of finite model
classes $\calh_0, \calh_1, ... $ where each $\calh_j$ is a finite
class with $\ln|\calh_j|$ growing linearly in $j$. To perform model
selection we find the hypothesis of minimal training error in each
class and use the computable bound to bound its generalization error.
We can then select, among these, the model with the smallest upper
bound on generalization error. Scheffer and Joachims propose
(without formal justification) replacing the distribution of true errors
with the histogram of training errors.
Under this replacement, the model selection algorithm based on our
computable upper bound is asymptotically identical to the algorithm
proposed by Scheffer and Joachims.
The shell decomposition is a distribution-dependent use of the union
bound. Distribution-dependent uses of the union bound have been
previously exploited in so-called self-bounding algorithms. Freund
\cite{Freund} defines, for a given learning algorithm \emph{and data
distribution}, a set \( S \) of hypotheses such that with high
probability over the sample, the algorithm always returns a hypothesis
in that set. Although \( S \) is defined in terms of the unknown data
distribution, Freund gives a way of computing a set \( S' \) from the
given algorithm and the sample such that, with high confidence, \( S'
\) contains \( S \) and hence the ``effective size'' of the hypothesis
class is bounded by \( |S'| \). Langford and Blum \cite{Microchoice}
give a more practical version of this algorithm. Given an algorithm
and data distribution they conceptually define a weighting over the
possible executions of the algorithm. Although the data distribution
is unknown, they give a way of computing a lower bound on the weight
of the particular execution of the algorithm generated by the sample
at hand. In this paper we consider distribution dependent union
bounds defined independent of any particular learning algorithm.
The bounds given in this paper apply to finite concept classes. Of
course more sophisticated measures of the complexity of a concept
class, such as VC dimension or Rademacher complexity, are possible and
can sometimes result in tighter bounds.
However, insight into finite classes remains useful in at least two
ways. Finite class analysis is useful as a pedagogical tool, teaching
about directions in which to look for the removal of slack from these
more sophisticated bounds. Indeed, various localized Rademacher
complexity results \cite{LRad} and the ``peeling" technique
\cite{Peel} appear to (roughly) correspond to the orthogonal
combination of shell bounds and earlier Rademacher complexity results.
One advantage of the shell bounds is the KL-divergence form of the
bounds which smoothly interpolates between the linear bounds of the
realizable case and the quadratic bounds of the unrealizable case.
This realizable-unrealizable interpolation is orthogonal to the shell
principle that concepts with large empirical error are unlikely to be
confused with concepts with low error rate. The shell bound also
supports intuitions that are difficult to achieve in more complex
settings. For example, the simple shell bounds clearly exhibit phase
transitions in the learning bound, something which does not appear to
be well-elucidated for localalized Rademacher bounds. In summary, the
simplicity of finite classes (and a shell bound analysis on a finite
class) provides a clarity that is difficult to achieve with more
complex structure-exploiting bounds.
Finite class analysis is also useful in a more practical sense. In
practice a finite VC dimension class usually has a finite
parameterization. Given that these real parameters are typically
represented as 32 bit floating point numbers, the class becomes finite
and the log of the class size is linear in the number of parameters.
Since many of the more sophisticated infinite-class techniques are
loose by large multiplicative constants, a finite class analysis
applied to a VC class discretized to a small number of bits can
actually yield \emph{tighter} bounds as shown in figure
\ref{fig:compare}.
\begin{figure}
\includegraphics[angle=270]{vc_vs_occam.ps}
\caption{\label{fig:compare} A graph comparing the (infinite
hypothesis) VC bound to the finite hypothesis Occam's razor bound.
For all curves we use VC dimension $d=10$, bound failure probability,
$\delta = 0.1$, and $m=1000$ examples. For the VC bound calculation
(see \cite{AWM_tutorial} for details) the formula is $\mbox{true
error} \leq \mbox{train error} + \sqrt{(d \ln \frac{2m}{d} + \ln
\frac{4}{\delta})/m}$. For the Occam's Razor Bound (ORB) (see
\cite{JCL_tutorial} for details) calculation, we use a uniform
distribution over the $2^{kd}$ discrete classifiers which might be
representable when we discretize $d$ parameters to $k = 8,16,32$ bits
per dimension. The basic formula is: $ \mbox{KL}(\mbox{train
error}||\mbox{true error}) \leq (k d \ln 2 + \ln \frac{1}{\delta})/m$.
This graph is approximatly the same for any similar ratio of $d/m$
with smaller values favoring the Occam's Razor Bound. }
\end{figure}
\section{Mathematical Preliminaries}
For an arbitrary measure on an arbitrary sample space we use the
notation\footnote{This can be read as ``for all but $\delta$ sets $S$,
the predicate $\Phi [S,\delta]$ holds'' of ``with probability
$1-\delta$ over the draw of $S$, the predicate $\Phi [S,\delta]$ holds''.} \( \forall ^{\delta }S\; \Phi
[S,\delta] \) to mean that with probability at least \( 1-\delta \)
over the choice of the sample \( S \) we have that \( \Phi [S,\delta]
\) holds. In practice \( S \) is the training sample of a learning
algorithm. Note that \( \forall x\; \forall ^{\delta }S\; \Phi [x,\;
S,\; \delta] \) does not imply \( \forall ^{\delta }S\; \forall x\;
\Phi [x,\; S,\; \delta] \). If $X$ is a finite set, and for all $x
\in X$ we have the assertion $\forall \delta > 0 \;\forall^\delta
S\;\Phi[S,x,\delta]$ then by a standard application of the union bound
we have the assertion $\forall \delta > 0 \;\forall^\delta S\;\forall
x \in X \;\Phi[S,x,\frac{\delta}{|X|}]$. We call this the
quantification rule. If $\forall \delta > 0 \;\forall^\delta
S\;\Phi[S,\delta]$ and $\forall \delta > 0 \;\forall^\delta
S\;\Psi[S,\;\delta]$ then by a standard application of the union bound
we have $\forall \delta > 0 \;\forall^\delta
S\;\Phi[S,\frac{\delta}{2}] \wedge \Psi[S,\frac{\delta}{2}]$. We call
this the conjunction rule.
The KL-divergence of $p$ from $q$, denoted \( D(q||p) \), is \( q\ln
(\frac{q}{p})+(1-q)\ln (\frac{1-q}{1-p}) \) with \( 0 \ln
(\frac{0}{p}) =0 \) and \( q \ln (\frac{q}{0}) = \infty \). Let \(
\hat{p} \) be the fraction of heads in a sequence \( S \) of \( m \)
tosses of a biased coin where the probability of heads is \( p \). We
have the following inequality given by Chernoff in 1952 \cite{Ch52}.
\begin{equation}
\label{eqn:relent-prelim}
\forall q\in [p,1]:\; \; Pr(\hat{p}\geq q)\; \; \; \leq \; \; \; e^{-mD(q||p)}
\end{equation}
This bound can be rewritten as follows.
\begin{equation}
\label{eqn:relent}
\forall \delta >0\; \forall ^{\delta }S\; \; \; D(\max(\hat{p},p)||p)\leq \frac{\ln (\frac{1}{\delta })}{m}
\end{equation}
To derive (\ref{eqn:relent}) from (\ref{eqn:relent-prelim}) note that
$Pr(D(\max(\hat{p},p)||p) \geq \frac{\ln (\frac{1}{\delta })}{m})$
equals $Pr(\hat{p} \geq q)$ where $q \geq p$ and $D(q||p) = \frac{\ln (\frac{1}{\delta})}{m}$.
By (\ref{eqn:relent-prelim}) we then have that this probability
is no larger than $e^{-mD(q||p)} = \delta$.
It is just as easy to derive (\ref{eqn:relent-prelim}) from (\ref{eqn:relent})
so the two statements are equivalent.
By duality, i.e., by considering the problem defined by replacing $p$ by $1-p$, we get the following.
\begin{equation}
\label{eqn:relent2}
\forall \delta >0\; \forall ^{\delta }S\; \; \; D(\min(\hat{p},p)||p)\leq \frac{\ln ( \frac{1}{\delta })}{m}
\end{equation}
Conjoining (\ref{eqn:relent}) and (\ref{eqn:relent2}) yields the following
corollary of (\ref{eqn:relent-prelim}).
\begin{equation}
\label{eqn:relent-main}
\forall \delta >0\; \forall ^{\delta }S\; \; \; D(\hat{p}||p)\leq \frac{\ln (\frac{2}{\delta })}{m}
\end{equation}
%This bound can be thought of as essentially the same
%as the standard Chernoff bound except that instead of an $l_2$ metric the
%more natural $D(\hat{p}||p)$ is used.
Using the inequality \( D(q||p)\geq 2(q-p)^{2} \) one can show that
(\ref{eqn:relent-main}) implies the following better known form of the Chernoff bound.
\begin{equation}
\label{eqn:chernoff}
\forall \delta >0\; \forall ^{\delta }S\; \; \; |p-\hat{p}| \leq \sqrt{\frac{\ln (\frac{2}{\delta })}{2m}}
\end{equation}
Using the inequality $D(q||p) \geq \frac{(p-q)^2}{2q}$, which holds for $q \leq p$, we can show
that (\ref{eqn:relent2}) implies the following.\footnote{A derivation of this formula
can be found in \cite{Mc1} or \cite{Mc2}. To see the need for the last term consider the case
where $\hat{p}=0$.}
\begin{equation}
\label{eqn:chernoff2}
\forall \delta >0\; \forall ^{\delta }S\; \; \; p \leq \hat{p} + \sqrt{\frac{2\hat{p}\ln (\frac{1}{\delta })}{m}}
+ \frac{2\ln(\frac{1}{\delta})}{m}
\end{equation}
Note that for small values of $\hat{p}$ formula (\ref{eqn:chernoff2}) gives a tighter upper bound on $p$ than does
(\ref{eqn:chernoff}).
The upper bound on $p$ implicit in (\ref{eqn:relent-main}) is somewhat tighter than
the minimum of the bounds given by (\ref{eqn:chernoff}) and (\ref{eqn:chernoff2}).
We now consider a formal setting for hypothesis learning. We assume a finite
set ${\mathcal H}$ of hypotheses and a space
\( {\mathcal{X}} \) of instances. We assume that each hypothesis represents
a function from \( {\mathcal{X}} \) to \( \{0,1\} \) where we write \( h(x) \)
for the value of the function represented by hypothesis \( h \) when applied
to instance \( x \). We also assume a distribution \( D \) on pairs \( \tuple {x,\; y} \)
with $x \in {\mathcal X}$ and $y \in \{0,1\}$.
For any hypothesis \( h \) we define the error rate of \( h \), denoted \( e(h) \),
to be \( P_{\tuple {x,\; y}\sim D}(h(x)\not =y) \). For a given sample \( S \)
of \( m \) pairs drawn from \( D \) we write \( \hateps (h) \) to denote
the fraction of the pairs \( \tuple {x,\; y} \) in \( S \) such that \( h(x)\not =y \).
Quantifying over $h \in \calh$ in (\ref{eqn:relent-main}) yields the following
second corollary of (\ref{eqn:relent-prelim}).
\begin{equation}
\label{eqn:tight-naive}
\forall ^{\delta }S\; \; \; \forall h \in \calh\; \; \; D(\hateps (h)||e(h))\leq
\frac{\ln|{\mathcal H}|+\ln (\frac{2}{\delta})}{m}
\end{equation}
By considering bounds on $D(q||p)$ we can derive the following more
well known corollary of (\ref{eqn:tight-naive}).
\begin{equation}
\label{eqn:naive2}
\forall ^{\delta }S\; \; \; \forall h\in \calh \; \; \;|e(h)-\hateps(h)|
\leq \sqrt{\frac{\ln|\calh| +\ln (\frac{2}{\delta })}{2m}}
\end{equation}
These two formulas both limit the distance between $\hateps(h)$ and
$e(h)$. In this paper we work with (\ref{eqn:tight-naive}) rather
than (\ref{eqn:naive2}) because it yields an (implicit) upper bound on
generalization error that is optimal up to asymptotic equality.
\section{The Upper Bound}
\label{sec:upper}
Our goal now is to improve on (\ref{eqn:tight-naive}). Our first step
is to divide the hypotheses in $\calh$ into $m$ disjoint sets based on their
true error rates. More specifically, for $p \in [0,1]$ define $\dceil{p}$
to be $\frac{\max(1,\myceil{mp})}{m}$. Note that $\dceil{p}$ is of the form $\frac{k}{m}$
where either $p = 0$ and $k = 1$ or $p > 0$ and $p \in (\frac{k-1}{m},\;\frac{k}{m}]$.
In either case we have $\dceil{p} \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$ and
if $\dceil{p} = \frac{k}{m}$ then $p \in [\frac{k-1}{m},\;\frac{k}{m}]$.
Now we define $\calh(\frac{k}{m})$ to be the set of $h \in \calh$ such that $\dceil{e(h)} = \frac{k}{m}$.
We define $s(\frac{k}{m})$ to be $\ln(\max(1,|\calh(\frac{k}{m})|))$.
We now have the following lemma.
\begin{lem}
\label{lem:main1} With high probability over the draw of $S$, for all hypotheses,
the deviation between the empirical error $\hat{e}(h)$, and true error $e(h)$, of every
hypothesis is bounded by $s(\dceil{e(h)})$. More precisely,
$\forall \delta > 0 \;
\forall^\delta S\;\;\; \forall h \in \calh$
$$D(\hateps(h)||e(h)) \leq \frac{s(\dceil{e(h)}) + \ln(\frac{2m}{\delta})}{m}$$
\end{lem}
\begin{proof}
Quantifying over $p \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$ and $h \in \calh(p)$ in
(\ref{eqn:relent-main}) gives $\forall \delta > 0$, $\forall^\delta S$,
$\forall p \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$, $\forall h \in \calh(p)$,
\medskip
$$D(\hateps(h)||e(h)) \leq \frac{\ln s(p) + \ln(\frac{2m}{\delta})}{m}$$
But this implies the lemma.
\end{proof}
Lemma~\ref{lem:main1} imposes a constraint, and hence a bound, on
$e(h)$. More specifically, we have the following
where $\lub\;\{x:\;\Phi[x]\}$ denotes the least upper bound
(the maximum) of the set $\{x:\;\Phi[x]\}$.
{\footnotesize \begin{equation}
\label{eqn:main1-equation}
e(h) \leq \lub\;\{q: D(\hateps(h)||q) \leq \frac{s(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}\}
\end{equation}}
This is our uncomputable bound. It is uncomputable because
the $m$ numbers $s(\frac{1}{m})$, $\ldots$, $s(\frac{m}{m})$ are
unknown. Ignoring this problem, however, we can see that this bound
is typically significantly tighter than (\ref{eqn:tight-naive}). More
specifically, we can rewrite (\ref{eqn:tight-naive}) as follows.
\begin{equation}
\label{eqn:tight-naive2}
e(h) \leq \lub\;\{q :D(\hateps(h)||q) \leq \frac{\ln|\calh| + \ln(\frac{2}{\delta})}{m}\}
\end{equation}
Since $s(\frac{k}{m}) \leq \ln|\calh|$, and since $\frac{\ln m}{m}$ is
small for large $m$, we have that (\ref{eqn:main1-equation}) is never
significantly looser than (\ref{eqn:tight-naive2}). Now consider a
hypothesis $h$ such that the bound on $e(h)$ given by
(\ref{eqn:tight-naive}), or equivalently, (\ref{eqn:tight-naive2}), is
significantly less than 1/2. Assuming $m$ is large, the bound given
by (\ref{eqn:main1-equation}) must also be significantly less than
1/2. But for $q$ significantly less than 1/2 we typically have that
$s(\dceil{q})$ is significantly smaller than $\ln|\calh|$. For
example, suppose $\calh$ is the set of all decision trees of size
$m/10$. For large $m$, a random decision tree of this size has error
rate near 1/2. The set of decision trees with error rate
significantly smaller than 1/2 is an exponentially small fraction of
the set of all possible trees. So for $q$ small compared to 1/2 we
get that $s(\dceil{q})$ is significantly smaller than $\ln|H|$. This
makes the bound given by (\ref{eqn:main1-equation}) significantly
tighter than the bound given by (\ref{eqn:tight-naive2}).
We now show that the distribution of true errors can be replaced, essentially,
by the histogram of training errors. We first introduce the following definitions.
{\scriptsize \begin{eqnarray*}
\hat{\calh}\left(\frac{k}{m},\delta\right)\!\!\!\!\! & \equiv & \!\!\left\{h\in \calh:\;\; \left|\hateps(h) - \frac{k}{m}\right|
\leq \frac{1}{m} + \sqrt{\frac{\ln(\frac{16m^2}{\delta})}{2m-1}}\right\} \\
\\
\hats\left(\frac{k}{m},\;\delta\right) \!\! & \equiv &
\ln\left(\max\left(1,\;2\left|\hat{\calh}\left(\frac{k}{m},\;\delta\right)\right|\right)\right)
\end{eqnarray*}}
The definition of $\hats\left(\frac{k}{m},\;\delta\right)$ is motivated by the following lemma.
\begin{lem}
\label{lem:main2} With high probability over the draw of $S$, for all $q$,
$s(q) \leq \hat{s}(q,2\delta)$. More precisely,
$\forall \delta > 0$, $\forall^\delta S$, $\forall q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$,
\medskip
$$s(q) \leq \hats(q,\;2\delta)$$
\end{lem}
Before proving lemma ~\ref{lem:main2} we note that
by conjoining (\ref{eqn:main1-equation}) and lemma~\ref{lem:main2}
we get the following. This is our main result.
\begin{thm}
\label{thm:main1} With high probability over the draw of $S$, for all hypotheses,
the deviation between the empirical error $\hat{e}(h)$, and true error $e(h)$, of every
hypothesis is bounded by $\hat{s}(\dceil{q},\delta)$. More precisely,
$\forall \delta > 0$, $\forall^\delta S$, $\forall h \in \calh$,
{\footnotesize $$e(h) \leq \lub\;\left\{q: D(\hateps(h)||q)
\leq \frac{\hats(\dceil{q},\;\delta) + \ln (\frac{4m}{\delta})}{m}\right\}
$$}
\end{thm}
As for lemma~\ref{lem:main1}, the bound implicit in theorem~\ref{thm:main1} is typically
significantly tighter than the bound in (\ref{eqn:tight-naive}) or its equivalent
form (\ref{eqn:tight-naive2}).
The argument
for the improved tightness of theorem~\ref{thm:main1} over (\ref{eqn:tight-naive2})
is similar to the argument for (\ref{eqn:main1-equation}). More specifically,
consider a hypothesis $h$ for which the bound in (\ref{eqn:tight-naive2}) is significantly
less than 1/2. Since $\hats(\dceil{q},\;\delta) \leq \ln|\calh|$,
the set of values of $q$ satisfying the condition in theorem~\ref{thm:main1}
must all be significantly less than 1/2. But for large $m$ we have that
$\sqrt{\frac{\ln(16m^2/\delta)}{2m-1}}$ is small. So if $q$ is significantly less
than 1/2 then all hypotheses in $\hat{\calh}(\dceil{q}, \delta)$ have empirical
error rates significantly less than 1/2. But for most hypothesis classes, e.g., decision trees,
the set of hypotheses with empirical error rates far from 1/2 should be an exponentially small
fraction of the class. Hence we get that $\hats(\dceil{q},\;\delta)$
is significantly less than $\ln|\calh|$ and theorem~\ref{thm:main1} is tighter
than (\ref{eqn:tight-naive2}).
The remainder of this section is a proof of lemma~\ref{lem:main2}.
Our departure point for the proof is the following lemma from \cite{PACBayes2}.
\begin{lem}[McAllester 99]
\label{lem:helper1}
For any measure on any hypothesis class we have
the following where $\mtt{E}_h f(h)$ denotes the expectation of $f(h)$ under the
given measure on $h$.
\[ \forall \delta > 0 \;\forall ^{\delta }S \;\;\;\mtt{E}_h e^{(2m-1)(\hat{e}(h)-e(h))^2}
\leq \frac{4m}{\delta } \]
\end{lem}
Intuitively, this lemma states that with high confidence over the
choice of the sample most hypotheses have empirical error near their
true error. This allows us to prove that $\hats(\dceil{q},\;\delta)$
bounds $s(\dceil{q})$. More specifically, by considering the uniform
distribution on $\calh(\frac{k}{m})$, lemma \ref{lem:helper1} implies
the following. {\scriptsize
\begin{eqnarray*}
\mtt{E}_{h\sim \calh\left(\frac{k}{m}\right)} \left( e^{(2m-1)(\hat{e}(h)-e(h))^2} \right)
& \leq & \frac{4m}{\delta } \\
\\
Pr_{h\sim \calh\left(\frac{k}{m}\right)} \left(e^{(2m-1)(\hat{e}(h)-e(h))^2} \geq \frac{8m}{\delta }\right)
& \leq & \frac{1}{2} \\
\\
Pr_{h\sim \calh\left(\frac{k}{m}\right)} \left(e^{(2m-1)(\hat{e}(h)-e(h))^2} \leq \frac{8m}{\delta }\right)
& \geq & \frac{1}{2} \\
\\
\left|\left\{h\in \calh(\frac{k}{m}):\;|\hateps(h)-e(h)|
\leq \sqrt{\frac{\ln(\frac{8m}{\delta})}{2m-1}}\right\}\right|
& \geq & \frac{1}{2}|\calh(\frac{k}{m})| \\
\\
\left|\left\{h\in \calh(\frac{k}{m}):\;|\hateps(h) - \frac{k}{m}| \leq \frac{1}{m} +
\sqrt{\frac{\ln(\frac{8m}{\delta})}{2m-1}}\right\}\right|
& \geq & \frac{1}{2}|\calh(\frac{k}{m})| \\
\end{eqnarray*}
$$\left|\calh(\frac{k}{m})\right| \leq 2\left|\hat{\calh}\left(\frac{k}{m},\;2m\delta\right)\right|$$
}
Lemma~\ref{lem:main2} now follows by quantification over $q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$.
\qed
\section{Asymptotic Analysis and Phase Transitions}
\label{sec:asym-upper}
This section and the two that follow give an asymptotic analysis of
the bounds presented earlier. The asymptotic analysis is stated in
theorem~\ref{thm:upper-limit} and statement~\ref{thm:smearing}. To
develop the asymptotic analysis, however, a preliminary discussion is
needed regarding the phenomenon of phase transitions. The bounds
given in (\ref{eqn:main1-equation}) and theorem~\ref{thm:main1}
exhibit phase transitions. More specifically, the bounding expression
can be discontinuous in $\delta$ and $m$, e.g., arbitrarily small
changes in $\delta$ can cause large changes in the bound. To see how
this happens consider the following constraint on the quantity $q$.
\begin{equation}\label{eqn:constraint}
D(\hateps(h)||q) \leq \frac{s(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}
\end{equation}
The bound given by~(\ref{eqn:main1-equation}) is the least upper bound
of the values of $q$ satisfying~(\ref{eqn:constraint}). Assume that
$m$ is sufficiently large that we can think of
$\frac{s(\dceil{q})}{m}$ as a continuous function of $q$ which we
write as $\tils(q)$. We can then rewrite (\ref{eqn:constraint}) as
follows where $\lambda$ is a quantity not depending on $q$ and
$\tils(q)$ does not depend on $\delta$.
\begin{equation}
\label{eqn:constraint2}
D(\hateps(h)||q) \leq \tils(q) + \lambda
\end{equation}
For $q \geq \hateps(h)$ we know that $D(\hateps(h)||q)$ is a
monotonically increasing function of $q$. It is reasonable to assume
that for $q \leq 1/2$ we also have that $\tils(q)$ is a monotonically
increasing function of $q$. But even under these conditions it is
possible that the feasible values of $q$, i.e., those satisfying
(\ref{eqn:constraint2}), can be divided into separated regions.
Furthermore, increasing $\lambda$ can cause a new feasible region to
come into existence. When this happens the bound, which is the least
upper bound of the feasible values, can increase discontinuously. At
a more intuitive level, consider a large number of high error concepts
and smaller number of lower error concepts. At a certain confidence
level the high error concepts can be ruled out. But as the confidence
requirement becomes more stringent suddenly (and discontinuously) the
high error concepts must be considered. A similar discontinuity can
occur in sample size. Phase transitions in shell decomposition bounds
are discussed in more detail by Haussler et al. \cite{HKST96}.
Phase transitions complicate asymptotic analysis. But asymptotic
analysis illuminates the nature of phase transitions. As mentioned in
the introduction, in the asymptotic analysis of learning theory bounds
it is important that one does not hold $\calh$ fixed as the sample
size increases. If we hold $\calh$ fixed then $\lim_{m\rightarrow
\infty} \frac{\ln|\calh|}{m} = 0$. But this is not what one expects
for large samples in practice. As the sample size increases one
typically uses larger hypothesis classes. Intuitively, we expect that
even for very large $m$ we have that $\frac{\ln|\calh|}{m}$ is far
from zero.
For the asymptotic analysis of the bound in~(\ref{eqn:main1-equation})
we assume an infinite sequence of hypothesis classes
$\calh_1$, $\calh_2$, $\calh_3$ $\ldots$ and an infinite sequence
of data distributions
$D_1$, $D_2$, $D_3$, $\ldots$. Let $s_m(\frac{k}{m})$ be $s(\frac{k}{m})$
defined relative to $\calh_m$ and $D_m$.
In the asymptotic analysis we assume that
the sequence of functions $\frac{s_m(\dceil{q})}{m}$, viewed as functions of $q \in [0,1]$, converge uniformly to
a continuous function $\tils(q)$. This means that for any $\epsilon > 0$ there
exists a $k$ such that for all $m > k$ we have the following.
$$\forall q \in [0,1] \;\; |\frac{s_m(\dceil{q})}{m} - \tils(q)| \leq \epsilon$$
Given the functions $\frac{s_m(\dceil{p})}{m}$ and their limit function
$\tils(p)$, we define the following functions of an empirical error
rate $\hat{e}$.
{\footnotesize \begin{eqnarray*}
B_m(\hat{e}) & \equiv & \lub\;\left\{q:D(\hat{e}||q) \leq \frac{s_m(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}\right\} \\
\\
B(\hat{e}) & \equiv & \lub\;\{q:D(\hat{e}||q) \leq \tils(q)\}
\end{eqnarray*}}
The function $B_m(\hat{e})$ corresponds directly to the upper bound in (\ref{eqn:main1-equation}).
The function $B(\hat{e})$ is intended to be the large $m$ asymptotic limit of $B_m(\hat{e})$.
However, phase transitions complicate asymptotic analysis. The bound $B(\hat{e})$
need not be a continuous function of $\hat{e}$. A value of $\hat{e}$ where
the bound $B(\hat{e})$ is discontinuous corresponds to a phase transition in the bound.
At a phase transition the sequence $B_m(\hat{e})$ need not converge.
Away from phase transitions, however, we have the following theorem.
\begin{thm}
\label{thm:upper-limit}
If the bound $B(\hat{e})$ is continuous at the point $\hat{e}$ (so we
are not at a phase transition), and the functions (parameterized by
$m$) $\frac{s_m(\dceil{q})}{m}$, viewed as functions of $q \in [0,1]$,
converge uniformly to a continuous function $\tils(q)$, then we have
the following.
$$\lim_{m\rightarrow \infty} B_m(\hat{e}) = B(\hat{e})$$
\end{thm}
\begin{proof}
Define the set $F_m(\hat{e})$ as follows.
{\footnotesize $$F_m(\hat{e}) \equiv \left\{q: \;D(\hat{e}||q)
\leq \frac{s_m(\dceil{q}) + \ln(\frac{2m}{\delta})}{m}\right\}$$}
This gives the following.
$$B_m(\hat{e}) = \mtt{lub}\; F_m(\hat{e})$$
Similarly, define $F(\hat{e},\;\epsilon)$ and $B(\hat{e},\;\epsilon)$
as follows.
\begin{eqnarray*}
F(\hat{e},\;\epsilon) & \equiv & \{q \in [0,1]:\;D(\hat{e}||q) \leq \tils(q) + \epsilon\} \\
B(\hat{e},\;\epsilon) & \equiv & \mtt{lub}\; F(\hat{e},\;\epsilon)
\end{eqnarray*}
We first show that the continuity of $B(\hat{e})$ at the point $\hat{e}$
implies the continuity of $B(\hat{e},\;\epsilon)$ at the point
$\tuple{\hat{e},\;0}$.
We note that there exists a continuous function $f(\hat{e},\;\epsilon)$
with $f(\hat{e},\;0) =\hat{e}$ and such that
for any $\epsilon$ sufficiently near 0 we have the following.
$$D(f(\hat{e},\;\epsilon)||q) = D(\hat{e}||q) - \epsilon$$
We then get the following equation.
$$B(\hat{e},\;\epsilon) = B(f(x,\;\epsilon))$$
Since $f$ is continuous,
and $B(\hat{e})$ is continuous at the point $\hat{e}$, we get that
$B(\hat{e},\;\epsilon)$ is continuous at the point $\tuple{\hat{e},\;0}$.
We now prove the lemma.
The functions of the form $\frac{s_m(\dceil{q}) + \ln\frac{2m}{\delta}}{m}$ converge uniformly
to the function $\tils(q)$. This implies that for any $\epsilon > 0$
there exists a $k$ such that for all $m > k$ we have the following.
$$F(\hat{e},\;-\epsilon) \subseteq F_m(\hat{e}) \subseteq F(\hat{e},\;\epsilon)$$
But this in turn implies the following.
\begin{equation}
\label{eqn:proof-bounds}
B(\hat{e},\;-\epsilon) \leq B_m(\hat{e}) \leq B(\hat{e},\;\epsilon)
\end{equation}
The lemma now follows from the continuity of the function $B(\hat{e},\;\epsilon)$
at the point $\tuple{\hat{e},\;0}$.
\end{proof}
Theorem~\ref{thm:upper-limit} can be interpreted as saying that for large sample
sizes, and for values of $\hat{e}$ other than the special phase transition values,
the bound has a well defined value independent of the confidence parameter $\delta$
and determined only by a smooth function $\tils(q)$.
A similar statement can be made for the bound in
theorem~\ref{thm:main1} --- for
large $m$, and at points other than phase transitions, the bound is
independent of $\delta$ and is determined by a smooth limit curve.
For the asymptotic analysis of theorem~\ref{thm:main1}
we assume
an infinite sequence $\calh_1$, $\calh_2$, $\calh_3$, $\ldots$ of hypothesis
classes and an infinite sequence $S_1$, $S_2$, $S_3$, $\ldots$ of samples
such that sample $S_m$ has size $m$. Let $\calh_m(\frac{k}{m},\;\delta)$
and $\hats_m(\frac{k}{m},\;\delta)$ be $\calh(\frac{k}{m},\;\delta)$ and
$\hats(\frac{k}{m},\;\delta)$ respectively defined relative to hypothesis class $\calh_m$
and sample $S_m$.
Let $U_m(\frac{k}{m})$ be the set of hypotheses in $\calh_m$ having an empirical
error of exactly $\frac{k}{m}$ in the sample $S_m$.
Let $u_m(\frac{k}{m})$ be $\ln(\max(1,\;|U_m(\frac{k}{m})|)$.
In the analysis of theorem~\ref{thm:main1} we allow that the functions
$\frac{u_m(\dceil{q})}{m}$ are only locally uniformly convergent
to a continuous function $\bar{u}(q)$, i.e., for
any $q \in [0,1]$ and any $\epsilon> 0$ there exists an integer
$k$ and real number $\gamma > 0$ satisfying the following.
$$\forall m > k, \;\forall p \in (q-\gamma,\;q+\gamma)\; | \frac{u_m(\dceil{p})}{m} - \bar{u}(p)| \leq \epsilon$$
Locally uniform convergence plays a role in the analysis in section~\ref{sec:smearing}.
\begin{thm}
\label{thm:upper-agreement}
If the functions $\frac{u_m(\dceil{q})}{m}$ converge locally
uniformly to a continuous function $\bar{u}(q)$ then,
for any fixed value of $\delta$,
the functions $\frac{\hats_m(\dceil{q},\;\delta)}{m}$ also converge locally
uniformly to $\bar{u}(q)$. If the convergence of $\frac{u_m(\dceil{q})}{m}$
is uniform, then so is the convergence of $\frac{\hats_m(\dceil{q},\;\delta)}{m}$.
\end{thm}
\begin{proof} Consider an arbitrary value $q \in [0,1]$ and $\epsilon > 0$.
We construct the desired $k$ and $\gamma$. More specifically, select
$k$ sufficiently large and $\gamma$ sufficiently small that we have
the following properties. {\footnotesize
$$\forall m > k, \;\forall p \in (q-2\gamma,\;q+2\gamma)\; \left| \frac{u_m(\dceil{p})}{m} - \bar{u}(p)\right|
< \frac{\epsilon}{3}$$}
\medskip
$$\forall p \in (q-2\gamma,\;q+2\gamma)\;\; |\bar{u}(p)-\bar{u}(q)| \leq \frac{\epsilon}{3}$$
\medskip
$$\frac{1}{k} + \sqrt{\frac{\ln(\frac{16k^2}{\delta})}{2k-1}} < \gamma$$
\medskip
$$\frac{\ln k}{k} \leq \frac{\epsilon}{3}$$
Consider an $m > k$ and $p \in (q-\gamma,\;q+\gamma)$.
It now suffices to show the following.
$$\left|\frac{\hats_m(\dceil{p},\;\delta)}{m} - \bar{u}(p)\right| \leq \epsilon$$
Because $U_m(\dceil{p})$ is a subset of $\calh_m(\dceil{p},\;\delta)$
we have the following.
$$\frac{\hats_m(\dceil{p},\;\delta)}{m} \geq \frac{u_m(\dceil{p})}{m} \geq \bar{u}(p) - \frac{\epsilon}{3}$$
We can also upper bound $\frac{\hats_m(\dceil{p},\;\delta)}{m}$
as follows.
\begin{eqnarray*}
|\calh_m(\dceil{p},\delta)| & \leq & \sum_{|\frac{k}{m} - p|
\leq \gamma} \left|U_m\left(\frac{k}{m}\right)\right| \\
\\
& \leq & \sum_{|\frac{k}{m} - p| \leq \gamma} e^{u_m(\frac{k}{m})} \\
\\
& \leq & \sum_{|\frac{k}{m} - p| \leq \gamma} e^{m(\bar{u}(\frac{k}{m}) + \frac{\epsilon}{3})} \\
\\
& \leq & \sum_{|\frac{k}{m} - p| \leq \gamma} e^{m(\bar{u}(p) + \frac{2\epsilon}{3})} \\
\\
& \leq & me^{m(\bar{u}(p) + \frac{2\epsilon}{3})} \\
\\
\frac{\hats(\dceil{p},\;\delta)}{m} & \leq & \bar{u}(p) + \frac{2\epsilon}{3} + \frac{\ln m}{m} \\
\\
& \leq & \bar{u}(p) + \epsilon
\end{eqnarray*}
A similar argument shows that if $\frac{u_m(\dceil{q})}{m}$ converges
uniformly to $\bar{u}(q)$ then so does $\frac{u_m(\dceil{q})}{m}$.
\end{proof}
Given quantities $\frac{\hats_i(\dceil{q},\;\delta)}{m}$ that converge uniformly to $\bar{u}(q)$
the remainder of the analysis is identical to that for the asymptotic analysis of
(\ref{eqn:main1-equation}). We define the following upper bounds.
{\scriptsize \begin{eqnarray*}
\hat{B}_m(\hat{e}) &\equiv & \lub\;\left\{q: D(\hat{e}||q)
\leq \frac{\hats_m(\dceil{q},\;\delta) + \ln\left(\frac{4m}{\delta}\right)}{m}\right\} \\
\\
\hat{B}(\hat{e}) & \equiv & \lub\;\{q: D(\hat{e}||q) \leq \bar{u}(q)\}
\end{eqnarray*}}
Again we say that $\hat{e}$ is at a phase transition if the function $\hat{B}(\hat{e})$
is discontinuous at the value $\hat{e}$.
We then get the following whose proof is identical to that of theorem~\ref{thm:upper-limit}.
\begin{thm}
\label{thm:upper-limit2}
If the bound $\hat{B}(\hat{e})$ is continuous at the point $\hat{e}$ (so we are not at
a phase transition), and the functions $\frac{u_m(\dceil{q})}{m}$
converge uniformly to $\bar{u}(q)$, then
we have the following.
$$\lim_{m\rightarrow \infty} \hat{B}_m(\hat{e}) = \hat{B}(\hat{e})$$
\end{thm}
\section{Asymptotic Optimality of (\ref{eqn:main1-equation})}
\label{sec:lower}
Formula~(\ref{eqn:main1-equation}) can be viewed as providing an upper
bound on $e(h)$ as a function of $\hateps(h)$ and the function $s$.
In this section we show that for any curve $s$ and value $\hateps$
there exists a hypothesis class and data distribution such that the
upper bound in (\ref{eqn:main1-equation}) is realized up to asymptotic
equality. Up to asymptotic equality, (\ref{eqn:main1-equation}) is
the tightest possible bound computable from $\hateps(h)$ and the $m$
numbers $s(\frac{1}{m})$, $\ldots$, $s(\frac{m}{m})$.
The classical VC dimensions bounds are nearly optimal over bounds
computable from the chosen hypothesis error rate $\hateps(h^*)$ and
the class $\calh$. The $m$ numbers $s(\frac{1}{m})$, $\ldots$,
$s(\frac{m}{m})$ depend on both $\calh$ and the data distribution.
Hence the bound in (\ref{eqn:main1-equation}) uses information about
the distribution and hence can be tighter than classical VC bounds. A
similar statement applies to the bound in theorem (\ref{thm:main1})
computed from the empirically observable numbers $\hats(\frac{1}{m})$,
$\ldots$, $\hats(\frac{m}{m})$. In this case, the bound uses more
information from the sample than just $\hateps(h)$. The optimality
theorem given here also differs from the traditional lower bound
results for VC dimension in that here the lower bounds match the upper
bounds up to asymptotic equality.
The departure point for our optimality analysis
is the following lemma from~\cite{TnC}.
\begin{lem}[Cover and Thomas]
\label{lem:ct2}
If $\hat{p}$ is the fraction of heads out of $m$ tosses of a coin
where the true probability of heads is $p$ then for $q \geq p$ we have
the following.
$$Pr(\hat{p} \geq q) \geq \frac{1}{m+1} e^{-mD(q||p)}$$
\end{lem}
This lower bound on $Pr(\hat{p} \geq q)$ is very close to Chernoff's 1952
upper bound~(\ref{eqn:relent-prelim}). The tightness of (\ref{eqn:main1-equation})
is a direct reflection of the tightness (\ref{eqn:relent-prelim}). To exploit
Lemma~\ref{lem:ct2} we need to construct hypothesis classes and data distributions
where distinct hypotheses have independent training errors. More specifically,
we say that a set of hypotheses $\{h_1,\;\ldots,\;h_n\}$ has independent training errors
if the random variables
$\hateps(h_1)$, $\ldots$, $\hateps(h_n)$ are independent.
By an argument similar to the derivation of (\ref{eqn:relent2}) from (\ref{eqn:relent-prelim})
we can prove the following from Lemma \ref{lem:ct2}.
{\footnotesize \begin{equation}
\label{eqn:lower-relent}
Pr\left(D(\min(\hat{p},p)||p) \geq \frac{\ln(\frac{1}{\delta}) - \ln(m+1)}{m}\right) \geq \delta
\end{equation}}
\begin{lem}
\label{lem:exists}
Let $X$ be any finite set, $S$ a random variable,
and $\Theta[S,x,\delta]$ a formula such that
for every $x\in X$ and $\delta >0$
we have \( Pr(\Theta [S,x,\delta ])\geq \delta \), and
$Pr(\forall x\in X\; \Theta [S,\; x,\; \delta])=\prod _{x\in X}Pr(\Theta [S,\; x,\; \delta])$.
We then have $\forall \delta >0 \;\forall ^{\delta }S \;\;\;\exists x\in X \;\;\Theta [S,x,\frac{\ln (\frac{1}{\delta })}{|X|}]$.
\end{lem}
\begin{proof}
\[
\begin{array}{ccc}
Pr(\Theta [S,x,\frac{\ln (\frac{1}{\delta })}{|X|}]) & \geq & \frac{\ln (\frac{1}{\delta })}{|X|}\\
Pr(\neg \Theta [S,x,\frac{\ln(\frac{1}{\delta })}{|X|}]) & \leq & 1-\frac{\ln (\frac{1}{\delta })}{|X|}\\
& \leq & e^{-\frac{\ln (\frac{1}{\delta })}{|X|}}\\
Pr(\forall x \in X \neg \Theta[S,\;x,\frac{\ln(\frac{1}{\delta })}{|X|}]) & \leq & e^{-\ln ({\frac{1}{\delta}})} = \delta
\end{array}\]
\end{proof}
Now define $h^*(\frac{k}{m})$ to be the hypothesis of minimal training
error in the set $\calh(\frac{k}{m})$. Let $\glb\;\{x:\;\Phi[x]\}$
denote the greatest lower bound (the minimum) of the set $\{x:\;\Phi[x]\}$.
We now have the following lemma.
\begin{lem}
\label{lem:ct-consequence}
If the hypotheses in the class $\calh(\dceil{q})$ are independent then
$\forall \delta > 0$, $\forall^\delta S$, $\forall q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$,
{\scriptsize
$$\hateps(h^*(q)) \leq \glb\;\left\{\hateps: \begin{array}{l}
D(\min(\hateps,\;q-\frac{1}{m})||q)
\leq \frac{s(q) - \ln(m+1) - \ln(\ln(\frac{m}{\delta}))}{m}
\end{array}\right\}$$
}
\end{lem}
\begin{proof}
To prove lemma~\ref{lem:ct-consequence} let $q$ be a fixed rational number of the form $\frac{k}{m}$.
Assuming independent hypotheses we can applying Lemma~\ref{lem:exists} to (\ref{eqn:lower-relent})
to get $\forall \delta > 0$, $\forall^\delta S$, $\exists h \in \calh(\frac{k}{m})$,
{\scriptsize $$D(\min(\hateps(h),e(h))||e(h)) \geq \frac{s(q) - \ln(m+1) -
\ln(\ln(\frac{1}{\delta}))}{m}$$}
Let $w$ be the hypothesis in $\calh(q)$ satisfying this formula.
We now have $\hateps(h^*(q)) \leq \hateps(w)$
and $q-\frac{1}{m} \leq e(w) \leq q$.
These two conditions imply $\forall \delta > 0$, $\forall^\delta S$,
$$\begin{array}{l}
D(\min(\hateps(h^*(q)),q-\frac{1}{m})||q)
\geq \frac{s(q) - \ln(m+1) -\ln(\ln\frac{1}{\delta})}{m}
\end{array}$$
This implies the following.
{\footnotesize
$$\hateps(h^*(q)) \leq \glb\;\left\{\hateps: \begin{array}{l}
D(\min(\hateps,\;q-\frac{1}{m})||q)
\leq \frac{s(q) - \ln(m+1) - \ln(\ln(\frac{1}{\delta}))}{m}
\end{array}\right\}$$}
Lemma~\ref{lem:ct-consequence} now follows by quantification over
$q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$.
\end{proof}
For $q \in [0,1]$ we have that
lemma~\ref{lem:main1} implies the following.
{\scriptsize
$$\hateps(h^*(\dceil{q})) \geq \glb\;\left\{\hateps:\begin{array}{l}
D\left(\hateps||\dceil{q}-\frac{1}{m}\right)
\leq \frac{s(\dceil{q}) + \ln\left(\frac{2m}{\delta}\right)}{m}
\end{array}\right\}$$
}
We now have upper and lower bounds on the quantity $\hateps(h^*(\dceil{q}))$
which agree up to asymptotic equality --- in a large $m$ limit where
$\frac{s_m(\dceil{q})}{m}$ converges (pointwise) to a continuous function
$\tils(q)$ we have that the upper and lower bound on
$\hateps(h^*(\dceil{q}))$ both converge (pointwise) to the following.
$$\hateps(h^*(q)) = \glb\;\{\hateps: D(\hateps||q) \leq \tils(q)\}$$
This asymptotic value of $\hateps(h^*(q))$ is a continuous
function of $q$. Since $q$ is held fixed in calculating the bounds on $\hat{e}(\dceil{q})$,
phase transitions are not an issue and uniform
convergence of the functions $\frac{s_m(\dceil{q})}{m}$ is not required.
Note that for
large $m$ and independent hypotheses we get that $\hateps(h^*(q))$ is determined as a
function of the true error rate $q$ and $\frac{s(\dceil{q})}{m}$.
The following lemma states that any limit function $\tils(p)$ is consistent
with the possibility that hypotheses are independent. This, together with
lemma~\ref{lem:ct-consequence} implies that no uniform bound on $e(h)$
as a function of $\hateps(h)$ and $|\calh(\frac{1}{m})|$, $\ldots$,
$|\calh(\frac{m}{m})|$ can be asymptotically tighter than
(\ref{eqn:main1-equation}).
\begin{thm}
\label{thm:construction}
Let $\tils(p)$ be any continuous function of $p \in [0,1]$.
There exists an infinite sequence of hypothesis spaces $\calh_1$, $\calh_2$,
$\calh_3$, $\ldots$, and sequence of data distributions $D_1$, $D_2$, $D_3$, $\ldots$
such that each class $\calh_m$ has independent hypotheses for data distribution
$D_m$ and such that
$\frac{s_m(\dceil{p})}{m}$ converges (pointwise) to $\tils(p)$.
\end{thm}
\begin{proof}
First we show that if $|\calh_m(\frac{i}{m})| =
e^{m\tils(\frac{i}{m})}$ then the functions
$\frac{s_m(\dceil{p})}{m}$ converge (pointwise) to $\tils(p)$.
Assume $|\calh_m(\frac{i}{m})| =
e^{m\tils(\frac{i}{m})}$. In this case
we have the following.
$$\frac{s_m(\dceil{p})}{m} = \tils(\dceil{p})$$
Since $\tils(p)$ is continuous, for any fixed value of $p$ we get that
$\frac{s_m(\dceil{p})}{m}$ converges to $\tils(p)$.
Recall that $D_m$ is a probability distribution on pairs $\tuple{x,\;y}$
with $y \in \{0,1\}$ and $x \in X_m$ for some set $X_m$.
We take $\calh_m$ to be a disjoint union of sets
$\calh_m(\frac{k}{m})$ where $|\calh_m(\frac{k}{m})|$ is selected as above.
Let $f_1$, $\ldots$, $f_N$ be the elements of $\calh_m$
with $N = |\calh_m|$.
Let $X_m$ be the set of all $N$-bit bit strings and define
$f_i(x)$ to be the value of $i$th bit of the bit vector $x$.
Now define the distribution $D_m$ on pairs $\tuple{x,\;y}$
by selecting $y$ to be 1 with probability 1/2 and then selecting
each bit of $x$ independently where the $i$th bit is selected to disagree
with $y$ with probability $\frac{k}{m}$ where $k$ is such that $f_i \in \calh_m(\frac{k}{m})$.
\end{proof}
\section{Relating $\hat{s}$ and $s$}
\label{sec:smearing}
In this section we show that in large $m$ limits of
the type discussed in section~\ref{sec:asym-upper} the
histogram of empirical errors need not converge to the histogram
of true errors. So even in the large $m$ asymptotic limit,
the bound given by theorem~\ref{thm:main1} is significantly weaker
than the bound given by~(\ref{eqn:main1-equation}).
To show that $\hats(\dceil{q},\;\delta)$ can be asymptotically
different from $s(\dceil{q})$ we consider the case of independent
hypotheses. More specifically, given a continuous function $\tils(p)$
we construct an infinite sequence of
hypothesis spaces $\calh_1$, $\calh_2$, $\calh_3$, $\ldots$ and an
infinite sequence of data distributions $D_1$, $D_2$, $D_3$, $\ldots$
using the construction in the proof of theorem~\ref{thm:construction}.
We note that if $\tils(p)$ is differentiable with
bounded derivative then the functions $\frac{s_m(\dceil{p})}{m}$
converge uniformly to $\tils(p)$.
For a given infinite sequence data distributions
we generate an infinite sample sequence
$S_1$, $S_2$, $S_3$, $\ldots$, by selecting $S_m$ to consists of $m$
pairs $\tuple{x,\;y}$ drawn IID from distribution $D_m$. For a given sample
sequence and $h \in \calh_m$ we define $\hateps_m(h)$ and
$\hats_m(\frac{k}{m},\;\delta)$ in a manner similar to $\hateps(h)$
and $\hats(\frac{k}{m},\;\delta)$ but for sample $S_m$.
The main result of this section is the following.
\begin{con}
\label{thm:smearing}
If each $\calh_m$ has independent hypotheses under data distribution $D_m$,
and the functions $\frac{s_m(\dceil{p})}{m}$ converge uniformly
to a continuous function $\tils(p)$, then for any $\delta > 0$ and $p \in [0,1]$,
we have the following with probability 1 over the generation of the sample sequence.
$$\lim_{m\rightarrow \infty} \frac{\hats_m(\dceil{p},\delta)}{m} = \sup_{q\in[0,1]} \tils(q) - D(p||q)$$
\end{con}
We call this a conjecture rather than a theorem because the proof has
not been worked out to a high level of rigor. Nonetheless, we believe
the proof sketch given below can be expanded to a fully rigorous
argument.
Before giving the proof sketch we note that the limiting value of
$\frac{\hats_m(\dceil{p},\;\delta)}{m}$ is independent of $\delta$. This is
consistent with theorem~\ref{thm:upper-agreement}. Define
$\bar{\hats}(p)$ as follows.
$$\bar{\hats}(p) \equiv \sup_{q \in [0,1]}\; \tils(q) - D(p||q)$$
Note that $\bar{\hats}(p) \geq \tils(p)$. This gives an
asymptotic version of lemma~\ref{lem:main2}. But since $D(p||q)$ can
be locally approximated as $c(p-q)^2$ (up to its second order Taylor
expansion), if $\tils(p)$ is increasing at the point $p$ then
we also get that $\bar{\hats}(p)$ is strictly larger than $\tils(p)$.
\begin{proof_outline}
To prove statement~\ref{thm:smearing} we first define $\calh_m(p,\;q)$
for $p,q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}$ to be the set of
all $h \in \calh_m(q)$ such that $\hateps_m(h) = p$. Intuitively,
$\calh_m(p,\;q)$ is the set of concepts with true error rate near $q$
that have empirical error rate $p$. Ignoring factors that are only
polynomial in $m$, the probability of a hypothesis with true error
rate $q$ having empirical error rate $p$ can be written as
(approximately) $e^{-mD(p||q)}$. So the expected size of
$\calh_m(p,\;q)$ can be written as $|\calh_m(q)|e^{-mD(p||q)}$, or
alternatively, (approximately) as $e^{m\tils(q)}e^{-mD(p||q)}$ or
$e^{m(\tils(q)-D(p||q))}$. More formally, we have the following for
any fixed value of $p$ and $q$.
$$\lim_{m \rightarrow \infty} \frac{\ln(\max(1,\;\mtt{E}(|\calh_m(\dceil{p},\;\dceil{q})|)))}{m} $$
$$ = \max(0,\;\tils(q) - D(p||q))$$
We now show that the expectation can be eliminated from the above
limit.
First, consider distinct values of $p$ and $q$ such that $\tils(q) - D(p||q) > 0$.
Since $p$ and $q$ are distinct, the probability that a fixed
hypothesis in $\calh_m(\dceil{q})$ is in $\calh_m(\dceil{p},\;\dceil{q})$ declines
exponentially in $m$. Since $\tils(q) -D(p||q) > 0$ the expected size of
$\calh_m(\dceil{p},\;\dceil{q})$ grows exponentially in $m$.
Since the hypotheses are independent, the distribution of possible values of
$|\calh_m(\dceil{p},\;\dceil{q})|$ becomes essentially a Poisson mass distribution
with an expected number of arrivals growing exponentially in $m$.
The probability that $|\calh_m(\dceil{p},\;\dceil{q})|$ deviates
from its expectation by as much as a factor of 2 declines exponentially in $m$.
We say that a sample sequence is safe after $k$ if for all $m > k$
we have that $|\calh_m(\dceil{p},\;\dceil{q})|$ is within a factor of 2 of its expectation.
Since the probability of being unsafe at $m$ declines exponentially in $m$,
for any $\delta$ there exists a $k$
such that with probability at least $1-\delta$ the sample sequence is
safe after $k$. So for any $\delta > 0$ we have that with probability at least $1-\delta$
the sequence is safe after some $k$. But since this holds for
all $\delta > 0$, with probability 1 such a $k$ must exist.
$$\lim_{m\rightarrow \infty} \;\frac{\ln(\max(1,|\calh_m(\dceil{p},\;\dceil{q})|))}{m} $$
$$= \tils(q) - D(p||q)$$
We now define $s_m(\dceil{p},\;\dceil{q})$ as follows.
$$s_m(\dceil{p},\;\dceil{q}) \equiv \ln(\max(1,\;|\calh_m(\dceil{p},\;\dceil{q})|))$$
It is also possible to show for $p=q$ we have that with probability 1
we have that $\frac{s_m(\dceil{p},\;\dceil{q})}{m}$ approaches $\tils(p)$
and that for distinct $p$ and $q$ with $\tils(q)-D(p||q) \leq 0$ we have
that $\frac{s_m(\dceil{q},\;\dceil{q})}{m}$ approaches 0. Putting these together
yields that with probability 1 we have the following.
{\footnotesize \begin{equation}
\label{eqn:smearing-limit1}
\lim_{m\rightarrow \infty} \;\frac{s_m(\dceil{p},\;\dceil{q})}{m} =
\max(0,\;\tils(q) - D(p||q))
\end{equation}}
Define $U_m(\frac{k}{m})$ and $u_m(\frac{k}{m})$ as in section~\ref{sec:asym-upper}.
We now have the following equality.
$$U_m(p) = \cup_{q \in \{\frac{1}{m},\;\ldots,\;\frac{m}{m}\}}
\calh_m(p,\;q)$$ We now show that with probability $1$,
$\frac{u_m(p)}{m}$ approaches $\bar{\hats}(p)$. First, consider a $p
\in [0,1]$ such that $\bar{\hats}(p) > 0$. Let Since
$\tils(q)-D(q||p)$ is a continuous function, and $[0,1]$ is a compact
set, $\sup_{q\in[0,1]} \tils(q) - D(p||q)$ must be realized at some
value $q^* \in [0,1]$. Let $q^*$ be such that $\tils(q^*)-D(p||q^*)$
equals $\bar{\hats}(p)$. We have that $u_m(\dceil{p}) \geq
s_m(\dceil{p},\;\dceil{q^*})$. This, together with
(\ref{eqn:smearing-limit1}), implies the following.
$$\liminf_{m \rightarrow \infty} \; \frac{u_m(\dceil{p})}{m} \geq
\bar{\hats}(p)$$ The sample sequence is ``safe'' at $m$ and
$\frac{k}{m}$ if $|\calh_m(\dceil{p},\;\dceil{\frac{k}{m}})|$ does not
exceed twice the expectation of $|\calh_m(\dceil{p},\;\dceil{q^*})|$.
Assuming uniform convergence of $\frac{s_m(\dceil{p})}{m}$, the
probability of not being safe at $m$ and $\frac{k}{m}$ declines
exponentially in $m$ at a rate at least as fast as the rate of decline
of the probability of not being safe at $m$ and $\dceil{q^*}$. By the
union bound this implies that for a given $m$ the probability that
there exists an unsafe $\frac{k}{m}$ also declines exponentially. We
say that the sequence is safe after $N$ if it is safe for all $m$ and
$\frac{k}{m}$ with $m > N$. The probability of not being being safe
after $N$ also declines exponentially with $N$. By an argument
similar to that given above, this implies that with probability 1 over
the choice of the sequence there exists a $N$ such that the sequence
is safe after $N$. But if we are safe at $m$ then $|U_m(\dceil{p})|
\leq 2mE|\calh_m(p,\;\dceil{q^*})|$. This implies the following.
$$\limsup_{m \rightarrow \infty} \; \frac{u_m(\dceil{p})}{m} \leq \bar{\hats}(p)$$
Putting the two bounds together we get the following.
$$\lim_{m \rightarrow \infty} \; \frac{u_m(\dceil{p})}{m} = \bar{\hats}(p)$$
The above argument establishes (to some level of rigor) pointwise
convergence of $\frac{u_m(\dceil{p})}{m}$ to $\bar{\hats}(p)$. It is
also possible to establish a convergence rate that is a continuous
function of $p$. This implies that the convergence of
$\frac{u_m(\dceil{p})}{m}$ can be made locally uniform.
Theorem~\ref{thm:upper-agreement} then implies the desired result.
\end{proof_outline}
\section{Improvements}
Theorem~\ref{thm:main1} has been improved
in various ways \cite{thesis}:
\begin{itemize}
\item Removing the discretization of true errors.
\item Using one-sided bounds.
\item Using nonuniform union bounds over discrete values of the form $\frac{k}{m}$.
\item Tightening the Chernoff bound using direct calculation of
Binomial coefficients.
\item Improving Lemma \ref{lem:helper1}.
\end{itemize}
These improvements allow the removal of all but one $\ln(m)$ terms from
the statement of the bound. However, they do not improve the asymptotic equations given by
theorem \ref{thm:upper-limit} and statement \ref{thm:smearing}.
A practical difficulty with the bound in theorem~\ref{thm:main1} is
that it is usually impossible to enumerate the elements of an
exponentially large hypothesis class and hence impractical to compute
the histogram of training errors for the hypotheses in the class. In
practice the values of $s(\frac{k}{m})$ might be estimated using some
form of Monte-Carlo Markov chain sampling over the hypotheses. For
certain hypothesis spaces it might also be possible to directly
calculate the empirical error distribution without evaluating every
hypothesis. For example, this can be done with ``partition rules''
which, given a fixed partition of the input space, make predictions
which are constant on each partition. If there are $n$ elements in
the partition then there are $2^n$ partition rules. For a fixed
partition, the histogram of empirical errors for the $2^n$ partition
rules can be computed in polynomial time. Note that the class of
decision trees is a union of partition rules where the structure of a
tree defines a partition and the labels at the leaves of the tree
define a particular partition rule relative to that partition. Taking
advantage of this, it is suprisingly easy to compute a shell bound
for small decision trees \cite{thesis}.
\section{Discussion \& Conclusion}
\ignore{
Given that a failure has occurred, what can be said for a typical
sample complexity result? With an application of the conditional
limit theorem \cite{TnC} we can say that with high probability ``bad''
events typically are not ``very bad'' i.e. the true error is not far
from the true error upper bound. This is no longer the case when using
a shell bound - ``bad'' events can be ``very bad'' (an error
arbitrarily large) with high probability when the bound is used.
There are two caveats to this statement. First, whenever ``very bad''
errors can occur the corresponding standard sample complexity bound
has a ``very large'' true error bound. The other caveat is that the
``very bad'' errors can only occur when a phase transition in the
shell bound occurs at a larger $m$.}
Traditional PAC bounds are stated in terms of the training error and
class size or VC dimension. The computable bound given here is
sometimes much tighter because it exploits the additional information
in the histogram of training errors. The uncomputable bound uses the
additional (unavailable) information in the distribution of true
errors. Any distribution of true errors can be realized in a case
with independent hypotheses. We have shown that in such cases this
uncomputable bound is asymptotically equal to actual generalization
error. Hence this is the tightest possible bound, up to asymptotic
equality, over all bounds expressed as functions of $\hate(h^*)$ and
the distribution of true errors. We have also shown that the use of
the histogram of empirical errors results in a bound that, while still
tighter than traditional bounds, is looser than the uncomputable bound
even in the large sample asymptotic limit.
One of the goals of learning theory is to give generalization
guarantees that are predictive of actual generalization error. It is
well known that the actual generalization error can exhibit phase
transitions --- as the sample size increases the expected
generalization error can jump essentially discontinuously in sample
size. So accurate true error bounds should also exhibit phase
transitions. Shell bounds exhibit these phase transitions while
other bounds such as VC dimension results do not.
The phase transitions can also be interpreted as a statement about the
bound as a function of the confidence parameter $\delta$. As the
value of $\delta$ is varied the bound may shift essentially
discontinuously. To put this another way, let $h^*$ be the hypothesis
of minimal training error on a large sample. Near a phase transition
in true generalization error (as opposed to a phase transition in the
bound) we may have that with probability $1-\delta$ the true error of
$h^*$ is near its training error but with probability $\delta/2$, say,
the true error of $h^*$ can be far from its training error. More
traditional bounds do not exhibit this kind of sensitivity to
$\delta$. Bounds that exhibit phase transitions seem to bring the
theoretical analysis of generalization closer to the actual
phenomenon.
{\bf Acknowledgments:}
Yoav Freund, Avrim Blum, and Tobias Scheffer all provided useful
discussion in forming this paper.
\begin{thebibliography}{1}
\bibitem[Bartlett et al(2002)]{LRad}P. Bartlett, O. Bousquet and S. Mendelson. Localized rademacher complexities. \emph{Proceedings of the 15th annual conference on Computational Learning Theory}, 44-58 (2002).
\bibitem[Chernoff(1952)]{Ch52}H. Chernoff. A measure of asymptotic efficiency for test of a hypothesis based on the sum of observations, \emph{Annals of Mathematical Statistics}, 23:493-507, 1952.
\bibitem[Cover & Thomas(1991)]{TnC}T. M. Cover and J. A. Thomas. \emph{Elements of Information Theory}, Wiley, 1991.
\bibitem[Freund(1998)]{Freund} Y. Freund, Self bounding algorithms, \emph{Computational Learning Theory (COLT)}, 1998.
\bibitem[Haussler et al(1996)]{HKST96}D. Haussler, M. Kearns, H. Sebastian Seung, and N. tishby, Rigorous learning curve bounds from statistical mechanics, \emph{Machine Learning} issue 25, 195-236, 1996.
\bibitem[Langford(2002)]{thesis}J. Langford, Quantitatively Tight sample complexity bounds, Thesis, Carnegie Mellon 2002.
\bibitem[Langford(2003)]{JCL_tutorial}J. Langford, Practical prediction theory for classification, ICML 2003 tutorial, avaliable at http://hunch.net/~jl/projects/prediction\_bounds/tutorial/tutorial.ps.
\bibitem[Langford and Blum(1999)]{Microchoice}J. Langford and A. Blum, Microchoice and self-bounding algorithms, \emph{Computational Learning Theory (COLT)}, 1999.
\bibitem[Mansour and McAllester(2000)]{Mc1}Y. Mansour and D. McAllester, Generalization bounds for decision trees, \emph{Computational Learning Theory (COLT)}, 2000.
\bibitem[Moore]{AWM_tutorial}A. Moore, VC dimension for characterizing classifiers, Tutorial at http://www-2.cs.cmu.edu/~awm/tutorials/vcdim08.pdf
\bibitem[McAllester(1999)]{PACBayes2}D. McAllester, Pac-Bayesian model averaging, \emph{Computational Learning Theory (COLT)}, 1999.
\bibitem[McAllester and Schapire(2000)]{Mc2}D. McAllester and R. Schapire, On the convergence rate of good-turing estimators, \emph{Computational Learning Theory (COLT)}, 2000.
\bibitem[Scheffer and Joachims(1999)]{Tobias}T. Scheffer and T. Joachims, Expected error analysis for model selection, \emph{International Conference on Machine Learning (ICML)}, 1999.
\bibitem[van de Geer(1999)]{Peel}S. van de Geer, \emph{Empirical Process in M-Estimation}, Cambridge University Press, 1999.
\end{thebibliography}
\end{document}