\documentclass{article}
\usepackage[accepted]{icml2005}
\usepackage{epsf}
\usepackage{mlapa}
\usepackage{url,makeidx,latexsym,amssymb,amsmath,stmaryrd,comment,graphicx}
\icmltitlerunning{A Comparison of Tight Generalization Error Bounds}
\begin{document}
\newcommand{\R}{\mathbb{R}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\E}{\mathbb{E}}
\renewcommand{\P}{\mathbb{P}}
\newcommand{\dash}{\,---\,}
\newcommand{\generr}[1]{\ensuremath{\epsilon(#1)}}
\newcommand{\emperr}[1]{\ensuremath{\hat{\epsilon}(#1)}}
\newcommand{\fvote}{\ensuremath{c_{\text{vote}}}}
\newcommand{\ffinal}{\ensuremath{c_{\text{final}}}}
\newcommand{\bintail}[1]{\ensuremath{\overline{\mathrm{Bin}}\left(#1\right)}}
\newcommand{\Prob}{\ensuremath{\mathbb{P}}}
\newtheorem{definition}{Definition}
\newtheorem{lemma}{Lemma}
\newtheorem{theorem}{Theorem}
\newenvironment{proof}{\emph{Proof.}}{\hfill $\Box$ \vspace{2ex}
\vspace{-.3cm}
}
\renewcommand{\topfraction}{.9}
\renewcommand{\textfraction}{.1}
\renewcommand{\floatpagefraction}{.75}
\renewcommand{\baselinestretch}{0.7}
\twocolumn[
\icmltitle{A Comparison of Tight Generalization Error Bounds}
\icmlauthor{Matti K{\"a}{\"a}ri{\"a}inen}{matti.kaariainen@cs.helsinki.fi}
\icmladdress{International Computer Science Institute, 1947 Center St Suite 600, Berkeley, CA 94704, USA}
\icmlauthor{John Langford}{jl@tti-c.org}
\icmladdress{TTI-Chicago, 1427 E 60th Street, Chicago, IL 60637, USA}
\vskip 0.3in
]
\begin{abstract}
We investigate the empirical applicability of several bounds (a number of
which are new) on the true error rate of learned classifiers which hold
whenever the examples are chosen independently at random from a fixed
distribution.
The collection of tricks we use includes:
\begin{enumerate}
\item A technique using unlabeled data for a tight derandomization of
randomized bounds.
\item A tight form of the progressive validation bound.
\item The exact form of the test set bound.
\end{enumerate}
The bounds are implemented in the \texttt{semibound} package and are
freely available.
\end{abstract}
%
\section{Introduction}
%
When we learn a classifier $c$, it is very natural to wonder ``How often
will $c$ be wrong in the future?'' This question cannot be answered in
general, because any source of future examples might know $c$ and either
choose $c(x)=y$ or not. However, this question can be answered when
assumptions are made about the data.
A basic principle of system design is that if a system has few components
then it can break in fewer ways. For this problem, components are
assumptions which may or may not be true on any individual problem.
Learning theory shows us that using \emph{only} the assumption that data
is drawn IID, we can construct confidence intervals on the error rate of
learned classifiers. Thus, we make only the IID assumption.
There are several common techniques for attempting to predict the future
error rate of a classifier. For example, sometimes people compute the
standard deviation of the error rate on a test set. This is statistically
questionable because 0/1 errors are inherently not drawn from a normal
distribution. This leads to odd or misleading results like reporting
``accuracy $0.9 \pm 0.3$'' and ``$1.00 \pm 0.0$'' (the last is especially
difficult because it's a form of overconfidence). Another approach is
to use $k$-fold cross-validation on a dataset and then use similar methods
to transform the cross-validation estimate into an estimate or heuristic
confidence interval of the error of the final classifier learned from all
examples. This, again, is statistically questionable because the runs of
cross validation are not independent, the error rates are not Gaussian,
and nothing in general guarantees that the error rate of the final classifier
learned from all data is close to the error rates of the classifiers learned
in cross-validation.
In this paper, we test several different styles of error bounds
that lead to confidence intervals which hold based only on the IID
assumption. The baseline version of this approach is to simply use a
test set and compute a Binomial confidence interval. This technique is
sometimes unsatisfactory because it ``wastes examples'' which might be
critical to successful learning, so we test other approaches, including:
\begin{enumerate}
\item Transforming cross-validation and bagging based error estimates into
error bounds for randomized classifiers.
\item Using unlabeled data to get tight training set
bounds~\cite{kaariainen05} based on derandomization of bounds for
randomized classifiers, including, e.g., the bounds in 1 and the
PAC-Bayesian margin bound.
\item Using progressive validation~\cite{blum99} over a test set to improve
performance so that a portion of the test set can be used for training
and validation.
\item Combinations of the above.
\end{enumerate}
%
All of the bounds are stated with respect to a classification problem
defined by a probability measure $D$ over a space $X \times Y$, where
$X$ is some arbitrary feature space and $Y$ a set of labels. The IID
assumption we make means that we assume that all examples are drawn
independently at random from $D$ (for labeled examples) or its marginal
$D_X$ (for unlabeled examples). With this notation, the error rate
$e(c,D)$ of a classifier $c$ is $e(c,D) = \Pr_{(x,y)\sim D} (c(x) \neq y)$.
The organization of the paper is straightforward. We describe the
techniques that we apply, the settings that they apply in,
and then test them with a wide array of experiments.
%
\section{Test Set Bound}
\label{sec:test}
%
When a classifier $c$ is trained on some training set and then tested on an
independent test set $S$ sampled from $D$, the distribution of the test set
error
\[ \hat{e}(c,S) = \frac{1}{|S|} \sum_{(x,y)\in S} I(c(x) \neq y) \]
is Binomial with parameters $e(c,D)$ and $|S|$. The following inverse
binomial tail defines and lets us compute an exact $1-\delta$ confidence
interval for $e(c,D)$.
%
\begin{definition}\cite{langford03}
The inverse binomial tail $\bintail{\hat{p},m,\delta}$ is the $q$ for which
\[ \sum_{i=0}^{\lceil \hat{p}m \rceil} \binom{m}{i} q^i (1-q)^{m-i} = \delta. \]
\end{definition}
%
The interpretation of the definition is that getting an error rate of
$\hat{p}$ or smaller from a binomial distribution with bias larger than
$\bintail{\hat{p},m,\delta}$ has probability less than $\delta$.
Therefore, the true error rate $p$ has to be below
$\bintail{\hat{p},m,\delta}$ with confidence $1-\delta$. Besides stating
this formally, the next lemma is a robust baseline and a frequently used
building block in the bounds that follow.
%
\begin{lemma}(Test Set Bound) \label{lem:test} For all $D$, $c$
\[
\Pr_{S\sim D^m}\left(\bintail{\hat{e}(c,S),m,\delta} \geq e(c,D)\right)
\geq 1-\delta .\]
\end{lemma}
%
\section{Semi-Supervised Bounds}
%
There is some tension between what we can prove and what we want to
prove. For example, the (empirically) tightest learning theory bounds
for many applications tend to be for randomized classifiers (see
\cite{langford02} for one example). However, a randomized classifier is
inconvenient and counterintuitive for many practical purposes.
The safe derandomization technique we discuss here relieves this
tension using a spare unlabeled dataset $U$ drawn from $D_X^m$ with
$m$ examples. This makes the bounds semi-supervised.
Randomized classifiers are metaclassifiers defined by a distribution on
a set of deterministic classifiers. A randomized classifier is used by
drawing an independent sample from this distribution of classifiers each
time a classification is given. Standard deterministic classifiers are
the special case in which the distribution concentrates on a single
classifier. Analogous to deterministic classifiers, the generalization
error of a randomized classifier $f$ is its probability of making an error,
that is $e(f,D) = \Pr_{(x,y)\sim D, c \sim f} (c(x) \not= y) =
E_{c \sim f} e(c,D)$.
The basic idea behind safe derandomization is to use the
disagreement probability
$d(f,g) = d(f,g,D) = \Pr_{(x,y)\sim D}(f(x) \not= g(x))$
as a metric in the space of randomized classifiers. This quantity can be
estimated in a semi-supervised setting by
\[ \hat{d}(f,g,U) = \frac{1}{|U|}\sum_{x \in U} I(f(x) \not= g(x)) \]
which can be computed by simply classifying the examples in $U$ by $f$ and
$g$ and counting the number of times they disagree.
Note that $\hat{d}(f,g,U)$ depends both on the randomness in $f$ and $g$ and
the randomness introduced by the choice of $U$ from $D_X^m$.
If neither $f$ nor $g$ depend on $U$, the distribution of $\hat{d}(f,g,U)$
is binomial with parameters $m$ and $d(f,g)$ since $\hat{d}(f,g,U)$ is
simply the proportion of times $f$ and $g$ disagree on the unlabeled
sample. Thus, the same confidence intervals used for test sets in
Section~\ref{sec:test} can be reused here.
All of the semi-supervised bounds (there are several) use the next
theorem as a starting point. Here, $f$ is typically a randomized
classifier for which we have a generalization bound,
while $\ffinal$ is the final deterministic classifier learned based on
all the labeled data.
%
\begin{theorem}(Semisupervised Derandomization) \cite{kaariainen05}
\label{thm:base} Let $f$ be a randomized classifier and $\ffinal$ the
final deterministic classifier, both learned based on $S \sim D^n$.
Suppose we have a generalization error bound
$\alpha = \alpha(S, \delta)$ for $f$ satisfying
$\Pr_{S \sim D^n}(e(f,D) \leq \alpha(S,\delta/2)) \geq 1-\delta/2$.
Then with probability at least $1-\delta$ over the draw of
$S \sim D^n$, $U \sim D_X^m$, and the randomness in $f$, we have
\[ e(\ffinal,D) \leq \alpha(S, \delta/2) + \bintail{\hat{d}(f,\ffinal,U),m,\delta/2} . \]
\end{theorem}
%
Next we present a resampling-related bound using the above theorem.
The resampling methods are based on the following technique for
obtaining $f$ and $\alpha$:
\begin{enumerate}
\item
Generate $k$ subsamples (multisets) of the labeled sample $S$.
Denote these by $S_1,\ldots,S_k$ and their complements in $S$ by
$S'_1,\ldots,S'_k$. The only requirement is that the choice of which
examples to include in the subsamples has to be independent of the
contents of $S$.
\item
For each $S_i$, apply the learning algorithm to get a classifier $c_i$.
\item
Test each $c_i$ on $S'_i$. By Lemma~\ref{lem:test}, with probability
$1 - \delta/(2k)$ we have
\[e(c_i,D) \leq \bintail{\hat{e}(c_i,S'_i),|S'_i|,\delta/k}.\]
\item
Let the randomized classifier $f$ be the classifier obtained by
uniform randomization over the set $\{c_i \mid 1 \leq i \leq k \}$.
The union bound
($\Pr( \cup A_i ) \leq \sum \Pr(A_i)$) tells us that all the
approximations in step 3 hold simultaneously with probability at least
$1-\delta/2$, which implies
\[
\begin{split}
e(f,D) = E_{c \sim f} e(c,D) = \frac{1}{k} \sum_{i=1}^k e(c_i,D) \\
\leq
\frac{1}{k} \sum_{i=1}^k \bintail{\hat{e}(c_i,S'_i), |S'_i|,\delta/(2k)}
\end{split}
\]
with the same probability. This gives us
$\alpha =
\frac{1}{k} \sum_{i=1}^k \bintail{\hat{e}(c_i,S'_i), |S'_i|,\delta/(2k)}$.
\end{enumerate}
After this, the learning algorithm is applied once more to the whole labeled
sample $S$ to get the final deterministic classifier $\ffinal$. Then the
$\alpha$ obtained from step 4 above and the final hypothesis $\ffinal$
are plugged into Theorem~\ref{thm:base}, which gives the following
bound for $\ffinal$.
%
\begin{theorem}
\label{thm:resampling} For all $D$, $f$, and $\ffinal$ as above, with
probability at least $1-\delta$ over the choice of $S\sim D^n$,
$U \sim D_X^m$, and the randomness in $f$, we have
\[
\begin{split}
e(\ffinal,D) \leq
\frac{1}{k} \sum_{i=1}^k \bintail{\hat{e}(c_i,S'_i), |S'_i|,\delta/(2k)} \\
+ \bintail{\hat{d}(f,\ffinal,U),m,\delta/2} .
\end{split}
\]
\end{theorem}
%
The only variation in the resampling based bounds is the way the samples
$S_i$ are chosen in step 1. We next list a few possibilities.
%
\paragraph{Cross-Validation}
%
In $k$-fold cross-validation, the labeled data is split into $k$ equisized
folds $S'_i$ (for notational simplicity, we assume $k$ divides $n$). The
subsample generation process is then defined by $S_i = \cup_{j \not= i} S'_j$.
This plugged into Theorem~\ref{thm:resampling} gives the cross-validation
bound. The special case $k=n$ is the leave-one-out bound, which is of no
use as $\bintail{0,1,\delta/(2n)} = 1 - \frac{\delta}{2n}$.
%
\paragraph{Bagging}
%
Bagging is a bootstrapping method introduced by Leo
Breiman~\cite{breiman96}. In it each of the $k$ subsamples $S_i$ is
generated by choosing randomly with replacement $n$ examples from $S$.
For each $i$, about a $1 - 1/e \simeq .63$ fraction of the examples
are represented in $S_i$, while a .37 fraction remain for testing
purposes in $S'_i$. Applying Theorem~\ref{thm:resampling} to this
resampling scheme gives the bagging bound.
Bagging was originally introduced as an aggregation method that
enhances classification accuracy by replacing $\ffinal$ with the
\emph{voting classifier} $\fvote(x) = \arg \max \{ P(f(x) = y) \mid y
\in Y \}$ (ties are broken arbitrarily). The bagging bound holds also
with the substitution $\ffinal \rightarrow \fvote$, thus giving a
generalization error bound for the aggregated hypothesis $\fvote$.
Breiman has introduced out-of-bag \emph{estimates} (different
from the estimate behind $\alpha$ in step 4 above) of the generalization
performance of $\fvote$~\cite{breiman96b}, but the generalization error bounds
for $\ffinal$ and $\fvote$ given by Theorem~\ref{thm:resampling} are new.
%
\paragraph{Semi-Supervised Test Set Bound}
%
The special case $k=1$ corresponds to transforming a test set bound for a
(deterministic) classifier learned based on some fraction $S_1$ of the data
into a semi-supervised training set bound for the classifier $\ffinal$.
This bound with the choice $|S'_1|=n/k$ should be compared with the
semisupervised cross-validation bound above. Except for the
difference in the confidence parameter (which should have only minor
relevance for large $n/k$), the first term of the above bound and the
sum in the cross-validation bound have the same expectation.
%
\paragraph{Other Variants}
%
Besides the above, one can think of bagging with different bootstrap
sample sizes, bagging where a fraction $c < 1$ of data is sampled into
each $S_i$ without replacement, weighting the classifiers
non-uniformly,\ldots For each of these subsampling strategies
Theorem~\ref{thm:resampling} creates a generalization error bound.
%
\section{Online Bounds}
%
Online bounds for generalization error work in a setting where the
learner is given the labeled examples one by one. After seeing $i$
labeled examples, $0 \leq i < n$, the learner outputs a classifier
$c_i$ that is used in predicting the label of the example
number $i+1$. It is then given the correct label, and the process of
prediction and correction repeats. It is also possible to start the
online process after first seeing a batch of training examples that
is used in learning $c_0$ and the subsequent classifiers.
This way, the learner produces a sequence of classifiers
$c_0,\ldots,c_{n-1}$ whose performance is often measured by the
cumulative error
\[ \hat{e}_{A,S} =
\sum_{i=1}^n I(c_{i-1}(X_i) \not= Y_i) \] where $c_{i-1} =
A((x,y)^{i-1})$ is the classifier learned on the first $i-1$
samples. The cumulative error is then used in providing a
generalization error bound for the randomized classifier obtained by
uniform randomization over the classifiers $\{ c_i \mid 0 \leq i < n
\}$. The generalization error is $\bar{e}_{A,S,D}$ (the notation
tracks the dependencies).
This approach has been analyzed for classification
error~\cite{blum99,cesabianchi01} yielding a bound that was recently
tightened~\cite{cesa-bianchi04} (for some values, the constants are worse).
A simple application of Theorem~\ref{thm:base} transforms these into
semi-supervised bounds for $\ffinal = c_n$. However, neither of these
results is maximally tight. A tighter bound can
be constructed by thinking of the learning algorithm and the learning
distribution as an adversary which can pick any probability of error
given any history of past error/not error events in a round by round
fashion. The goal of the adversary is to maximize the probability of
achieving some deviation between the expected and the observed number of
errors, where deviation simply means the first minus the second.
This optimal strategy is defined recursively.
Let $g_n(x)$ be an upper bound for the maximum probability that an adversarial
algorithm/distribution pair can achieve a deviation of size at least $x$ in
$n$ rounds. The recurrence obeys the following property: \[
g_{n}(x)=\max_{p}pg_{n-1}(x+1-p)+(1-p)g_{n-1}(x-p)\] with the base
case $g_{0}(x)=I(x\leq0)$.
\begin{figure}
\label{fig:first}
\begin{center}
\includegraphics[angle=270,scale=0.5]{dev_success.ps}
\includegraphics[angle=270,scale=0.5]{max_dev_choice.ps}
\end{center}
\caption{Top: maximum probability of a deviation $x$ given 1, 2, 3, or
4 rounds of progressive validation.
Bottom: Optimal choice of the adversary achieving these deviations.}
\end{figure}
Calculating this recurrence, we get the plots in Figure~1 for values of
$g$ at each deviation $x$. The optimal choice of $p$ can be difficult to
calculate analytically as shown by the plot on the bottom. In particular,
note the nonmonotonic behavior.
This can still be improved by noting that the above recurrence does
not take into account the number of errors observed as the test set
bound (Lemma \ref{lem:test}) suggests. We can state another recurrence based
on both $n$ and the number of errors $\hat{e}_{A,S}$.
\[ g_{n,e}(x)=\max_{p}pg_{n-1,e-1}(x+1-p)+(1-p)g_{n-1,e}(x-p)\]
The base case is $g_{0,e}(x)=I(x\leq0)$.
Solving this recurrence results in a bound which is somewhat worse
than the Binomial confidence interval, but significantly better than
other approaches as shown in Figure~2.
\begin{figure}
\label{fig:two}
\begin{center}
\includegraphics[angle=270, width=0.45\textwidth]{more_bounds.ps}
\end{center}
\caption{A comparison of bounds on the deviation between the observed
and the true error rates with $\delta = 0.1$ and as the number of
examples increases. ``Square Root Chernoff'' is the Chernoff
approximation. Binomial Tail uses \ref{lem:test} with either
0 or 1/2 errors. The line ``any error progressive'' is the result of
the first recursion which does not take the number of errors into
account. All other lines use the second recursion which takes the errors
into account.}
\end{figure}
Many learning algorithms are designed for the online setting, and this
approach is the most natural in combination with them. However,
every batch mode learning algorithm can be used to define an online
learning algorithm by the following simple strategy: Feed the batch
algorithm the labeled examples seen so far, and use the classifier
output by the algorithm to classify the next example before its label
is revealed. Then, add the newly seen labeled example to the set of
labeled examples, and continue till all $n$ examples have been
processed. A potential problem is that unless the batch learning
algorithm is incremental in the sense it can efficiently update its
hypothesis when a new example is seen, the learning algorithm has to
be run $n$ times. For large $n$ this may take ages. To circumvent
the time complexity, we can be lazy and update the hypothesis only
every $0.1 n$ samples (for example).
This strategy can be taken into account to tighten the bound
calculation. Let $k$ be the number in the batch for each update and
$m = n/k$ be the number of batches. The new recurrence equations are:
$g_{m,k,e}(x)= \max_p$
\[
\sum_{i\leq\mathop{min}(e,k)}{k \choose i}(1-p)^{k-i}p^{i}g_{m-1,k,e-i}(x+i(1-p)-p(k-i))
\]
with base case $g_{0,k,e}(x)=I(x\leq0)$.
Computing the results of these recurrences is computationally
intensive. We use an anytime approach which discretizes $x$ at
successive scales and carefully maintains the upper bound property.
When this proves too intensive (as it does for very large datasets),
we fall back on the looser analytic expressions above.
%
\begin{theorem} (Tight Progressive Validation) For all $D$, $n=m*k$, and all
learning algorithms, with probability $1-\delta$ over the draw of $S
\sim D^n$, we have:
\[ \bar{e}_{A,S,D} \leq \hat{e}_{A,S} + \max \{x : g_{m,k,\hat{e}_{A,S}}(x) \geq \delta\}\]
\end{theorem}
\begin{proof}
We start using induction to prove that for all $A,x,k,m,d$:
\[ \Pr_{S\sim D^n} \left( \bar{e}_{A,S,D} \geq \hat{e}_{A,S} + x \mbox{ and } \hat{e}_{A,S} \leq d \right) \leq g_{m,k,d}(x)\]
The proof is done for any fixed $A,x,k,d$ and inductively over $m$. The
base case is $m=0$ when no deviation above $0$ can be achieved and $A$
can do nothing.
For the inductive step, we have the following:
\[ \Pr_{S'\sim D^{n+m}} \left( \bar{e}_{A,S',D} \geq \hat{e}_{A,S'} + x \mbox{ and } \hat{e}_{A,S} \leq d \right) \]
\[ =
\sum_{i\leq\mathop{min}(d,k)}{k \choose i}(1-p)^{k-i}p^{i} \]\[
\Pr_{S\sim D^n} ( \bar{e}_{A,S,D} \geq \hat{e}_{A,S} + x +
i(1-p)-p(k-i) \]\[ \mbox{ and } \hat{e}_{A,S} \leq d-i ) \] where $p$
is the probability of failure of the first classifier chosen by $A$
(which depends on $0$ examples). Using the inductive assumption, we have for
all $A$:
\[ \leq
\sum_{i\leq\mathop{min}(d,k)}{k \choose i}(1-p)^{k-i}p^{i} g_{m,k,e-i}(x+i(1-p)-p(k-i)) \]
\[ \leq g_{m+1,k,d}(x) \]
Now, note that we can set $g_{m,k,d}(x)=\delta$ and solve for $x$ to get:
\[ \Pr_{S\sim D^n} ( \bar{e}_{A,S,D} \geq \hat{e}_{A,S} + \max \{x : g_{m,k,d}(x) \geq \delta\} \]\[\mbox{ and } \hat{e}_{A,S} \leq d ) \leq \delta\]
To complete the proof note that for any deviation $\max \{x :
g_{m,k,d}(x) \geq \delta\}$, if $\hat{e}_{A,S} \in \{0,\ldots,d-1\}$ that
also achieves deviation $\max \{x : g_{m,k,d}(x) \geq \delta\}$.
Similarly, for any $\hat{e}_{A,S} \in \{d+1,\ldots,m\}$ the deviation is
smaller than $\max \{x : g_{m,k,d}(x) \geq \delta\}$.
\end{proof}
%
\section{PAC-Bayesian Bounds}
%
The PAC-Bayesian theorems~\cite{mcallester03a} are a relatively recent
method of providing generalization error bounds for randomized
classifiers. The idea is to control the generalization performance by
the KL-divergence between a ``prior'' and a ``posterior'' on a set of
classifiers. The ``prior'' may encode beliefs about which classifiers
perform well, while the ``posterior'' describes the randomized
classifier we learn.
%
\subsection{A PAC-Bayesian Margin Bound}
%
We work with \emph{unbiased linear classifiers}. This representation
is employed by many important learning algorithms, including, e.g.,
the (unbiased) support vector machines and boosting. We assume that
$X = \{ x \in \R^d \mid \|x\| = 1 \}$ for some $d \in \N$ and that $Y
= \{-1,+1\}$. A linear classifier is represented by a \emph{weight
vector} $w \in X$, which defines a classifier $f_w \colon X \to Y$ by
$f_w(x) = \mathop{sign}(w \cdot x)$.
The ``prior'' used in the PAC-Bayesian margin
bounds~\cite{langford02,langford03,mcallester03b} is a
unit-variance isotropic Gaussian on $\R^d$. The ``posterior''
$Q(w,\mu)$, where $\|w\|=1$ and $\mu > 0$, is the normal distribution
$N(\mu,1)$ in the direction of $w$ and standard normal in every
orthogonal direction. Thus, $Q(w,\mu)$ is the prior shifted by $\mu
w$.
%
\begin{theorem} (PAC-Bayesian margin bound)
\label{thm:randpacbayes} For all $D$, with probability at least $1-\delta/2$ over the choice of the labeled data
and the randomness in $Q(w,\mu)$, we have for all $w \in \R^d$,
$\|w\|=1$, $\mu \in [0,\infty)$ that
\[
\mathop{KL}\left(\hat{e}(Q(w,\mu),S)||e(Q(w,\mu),D)\right)
\leq \frac{\frac{\mu^{2}}{2}+\ln\frac{m+1}{\delta/2}}{m} .
\]
\end{theorem}
%
To derandomize the above, let us choose $w$ to be the normalized
weight vector learned by the learning algorithm, and let $\alpha$ be
the smallest upper bound for $e(Q(w,\mu),D)$ that can be obtained from
the above by the optimal choice $\mu^*$ for $\mu$. Note that greedily
picking $\mu^*$ for $\mu$ may not be the best choice for the
derandomized bound, but we still get the following.
%
\begin{theorem}
\label{thm:pacbayes}
With probability at least $1-\delta$ over the choice of $S \sim D^n$,
$U \sim D_X^m$, and the randomness in $Q(w,\mu)$, we have
\[ e(f_w,D) \leq \alpha + \bintail{\hat{d}(Q(w,\mu^*),f_w,U),m,\delta/2} .\]
\end{theorem}
%
For details on how the bound of Theorem~\ref{thm:randpacbayes} can be
evaluated and how $\mu^*$ can be found efficiently, the reader is referred
to~\cite{langford03}. The remaining difficulty in evaluating the bound
of Theorem~\ref{thm:pacbayes} is computing $\hat{d}(Q(w,\mu^*),f_w,U)$
which can be done with techniques similar to those presented
in~\cite{langford03}.
%
\section{Empirical Results}
%
In this section we summarize results of experiments obtained by
\texttt{semibound} (available at \url{http://hunch.net/semibound}),
a package of scripts that makes computing these bounds easy. The
package currently supports
\texttt{libsvm}, \texttt{C4.5}, \texttt{svmlight}~\cite{joachims99},
and the classification algorithms in \texttt{Weka}, but can easily
be extended to work with other learning algorithms as well. As
datasets we used benchmark datasets from the UCI repository~\cite{UCI}
(for C4.5) and from the libsvm tools page~\cite{libsvmtools}
(for \texttt{libsvm} and \texttt{svmlight}). The problem \texttt{mnist0}
is the ``0 versus rest'' version of \texttt{mnist}, and
\texttt{mnist0-10000} is a 10000 example subsample of it. Unlabeled data
was obtained by forgetting the labels of 10\% of the original labeled
data set. This ``unlabeled'' data was also used in computing empirical
error rates for the classifiers.
\begin{table*}[t]
\vskip -0.1in
\label{table:experiments_rand}
\centering
\caption{Results on experiments with bounds for randomized classifiers
on various (algorithm, data set) combinations. Each cell contains a
``error rate''/``bound'' combination.}
\vskip 0.1in
\begin{small}
\begin{sc}
\input{table_rand}
\end{sc}
\end{small}
\vskip -0.1in
\end{table*}
\begin{table*}[t]
\vskip -0.1in
\label{table:experiments_semi}
\centering
\caption{Results on experiments with semi-supervised bounds for $\ffinal$ on
various (algorithm, data set) combinations.}
\vskip 0.1in
\begin{small}
\begin{sc}
\input{table_semi}
\end{sc}
\end{small}
\vskip -0.1in
\end{table*}
The results of the experiments are summarized in
Tables~1 and~2. In Table~1, we report the observed error rates
(on the ``unlabeled'' data set) and error bounds for the randomized
classifiers, whereas Table~2 summarizes the corresponding semi-supervised
bounds for $\ffinal$. All numbers in the tables are averages over ten runs
of \texttt{semibound}, where the randomization of the split of the data to
labeled and unlabeled parts as well as the randomization inside the bounds
was done anew each time. In all bounds we chose $\delta = 0.01$.
The columns correspond to bounds presented in the text with the prefix
{\sc R-} indicating a randomized\footnote{The exception is {R-Test} which
is the standard test set bound for a deterministic classifier.} and {\sc S-}
a semi-supervised bound. In the bounds {\sc Test} and {\sc Prog}
we used a 90\%/10\% train/test split and in {\sc CV} and {\sc Bag} we chose
$k=10$. In {\sc Prog}, the hypothesis was updated only 10 times
during the validation phase to keep the computation time in control.
For the same reason, the bound in the table is the minimum\footnote{It
is not, normally, valid to take the minimum of two bounds in this manner
without increasing the value of $\delta$. It is valid here because the
recursion dominates the other bounds.} of the analytic approximation and
the upper bound to the solution of the recurrence obtained in one hour
of computation. The PAC-Bayesian margin bound and algorithm are
applicable to unbiased linear classifiers only, which explains the NAs
in the {\sc P-B} columns. With the exception of {\sc Prog}, evaluating
the bounds is relatively fast when compared to the time spent in the
learning algorithm.
After committing ourselves to the above parameter combinations we
experimented with others to see how stable the bounds are with respect
to the parameters, e.g., the $k$ in the bagging and cross-validation
bounds. The randomized bounds perform best when $k$ is small, because
the slack introduced through the inverse binomial tail increases with
$k$. In the semi-supervised bounds this increase is compensated by
the decrease of $d(f,\ffinal)$ with $k$, so the optimal value of the
bound is attained with $k$ around 5 or 10.
From Table~1 we see that {\sc R-Bag} seems to produce the best bounds
(even better than the baseline test set bound {\sc R-Test}), whereas
the randomized classifier related to the bound is seldom the one with
best error rate. In bagging about 37\% of the labeled data remains
for testing purposes, so the test set bound on which the bagging bound
built is close to the true error. However, the larger test set means
less data for training, which shows up in increased error rates on the
unlabeled data. In summary, {\sc R-Test} and {\sc R-Prog} appear to
provide the best error rates, whereas {\sc R-Bag} provides the best
bounds. Still, it is hard to draw any definitive conclusions of the
relative performance of the methods, with the exception of {\sc R-P-B}
being clearly worst.
The case of {\sc R-Bag} is an example of a more general trade-off: For
good prediction performance, we would like to use all data in
training, while the easiest way to get good bounds is to leave a part
of the data aside for testing. In case prediction accuracy is the
most important thing, one can use unlabeled data to transform the
bounds for randomized classifiers in Table~1 to bounds for $\ffinal$.
The results of this transformation are presented in Table~2, in which
the column {\sc Error} is the error rate of $\ffinal$. The cost of
the transformation is roughly the distance $d(f,\ffinal)$ between $f$
and $\ffinal$, which varies significantly from problem to problem.
The results are impressive as compared to traditional training set
bounds for $\ffinal$, but still not as good as that of the randomized
bounds. Here {\sc S-Test} appears to win, but the relative order of
the other bounds is again not clear.
%
\section{Discussion}
%
What these experiments illustrate is that there is a natural trade-off
between the goal of good prediction and the goal of good confidence about
prediction ability. We can state and use bounds which operate over
the range of this tradeoff, but no bound appears to dominate for both
goals simultaneously.
The good news is that some of these nonstandard tradeoff points are
entirely viable for common practice. Walking over the tradeoff, we
get the following prescription for common practice:
\begin{enumerate}
\item Prediction ability marginally important, confidence very
important: Use the a large test set bound or the randomized bagging bound.
\item Prediction ability somewhat important, confidence very important:
Use some other randomized bound, e.g., the test set bound or the
Progressive Validation bound.
\item Prediction ability very important, confidence somewhat
important, unlabeled data available: use the semisupervised versions
of these bounds.
\end{enumerate}
If randomness in a classifier presents problems, then the test set bound
(in case confidence is more important than accuracy) or the semi-supervised
bounds (when tightness of the bound is less important than the accuracy of
the final hypotheisis) should be preferred.
%
\section*{Acknowledgements}
%
We would like to thank Tom Hayes who helped substantially with early
discussions that turned into the improved progressive validation
bound.
%
\bibliography{catalog}
\bibliographystyle{mlapa}
\end{document}