\documentclass{article}
\usepackage[accepted]{icml2006}
%\usepackage{epsf}
%\usepackage{mlapa,eepic,epic}
\usepackage{mlapa}
\usepackage{amsmath,amssymb,rotating,graphicx,rotating,floatflt,float}
%\usepackage{pict2e}
%\usepackage[usenames]{color}
\newcommand{\shrink}[1]{}
\floatstyle{ruled}
\newfloat{algorithm}{h}{loa}
\floatname{algorithm}{Algorithm}
\newcommand{\mycomment}[1]{}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
%% \theoremstyle{plain}
%%\newtheorem{thm}{Theorem}[section]
\numberwithin{equation}{section} %% Comment out for sequentially-numbered
\numberwithin{figure}{section} %% Comment out for sequentially-numbered
%% \theoremstyle{plain}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\N}[1]{{\bf [[N: #1]]}}
\newcommand{\A}[1]{{\bf [[A: #1]]}}
\newcommand{\reals}{{\mathbb R}}
\newcommand{\nats}{{\mathbb N}}
\renewcommand{\Pr}{{\mathbf{Pr}}}
\newcommand{\target}{{h^*}}
\newcommand{\distr}{D}
\newcommand{\prin}{\sim}
\newcommand{\distrxy}{P}
\newcommand{\alg}{\mathcal{A}}
\newcommand{\err}{\mbox{err}}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{corollary}[theorem]{Corollary}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{definition}{Definition}
\newtheorem{note}{Note}
\newcommand{\dimens}{d}
\newcommand{\qedblob}{\mbox{\rule[-1.5pt]{5pt}{10.5pt}}}
\newenvironment{proof}{\noindent{\bf Proof:~~}}{~\qedblob}
\newcommand{\conf}{\delta}
\newcommand{\hclass}{H}
\newcommand{\vcdim}{V}
\newcommand{\hyp}{h}
\newcommand{\ERM}{ERM}
\newcommand{\complex}{H} %%% maybe change this???
%%\newcommand{\err}{err}
\newcommand{\herr}{\widehat{err}}
\newcommand{\rhoh}{\hat{\rho}}
\icmltitlerunning{Agnostic Active Learning}
\begin{document}
\twocolumn[
\icmltitle{Agnostic Active Learning}
\icmlauthor{Maria-Florina Balcan}{ninamf@cs.cmu.edu} \icmladdress{
School of Computer Science, Carnegie Mellon University,
Pittsburgh, PA 15213-3891}
\icmlauthor{Alina Beygelzimer}{beygel@us.ibm.com} \icmladdress{
IBM T. J. Watson Research Center, Hawthorne, NY 10532 }
\icmlauthor{John Langford}{jl@tti-c.org} \icmladdress{ Toyota
Technological Institute at Chicago, Chicago, IL 60637} ]
\begin{abstract}
We state and analyze the first active learning algorithm which
works in the presence of arbitrary forms of noise. The algorithm,
$A^2$ (for \emph{A}gnostic \emph{A}ctive), relies only upon the
assumption that the samples are drawn \emph{i.i.d.} from a fixed
distribution. We show that $A^2$ achieves an exponential
improvement (i.e., requires only $O\left(\ln{
\frac{1}{\epsilon}}\right)$ samples to find an $\epsilon$-optimal
classifier) over the usual sample complexity of supervised
learning, for several settings considered before in the realizable
case. These include learning threshold classifiers and learning
homogeneous linear separators with respect to an input
distribution which is uniform over the unit sphere. \shrink{ The
$A^2$ analysis also achieves an almost contradictory property: for
some sets of classifiers, we can choose an $\epsilon$-optimal
classifier with fewer samples than are needed to estimate the
error rate of the chosen classifier with precision
$\epsilon$.}\end{abstract}
\section{Introduction}
What distinguishes active learning from the more typical batch
learning is that the algorithm initially sees only the unlabeled
portion of a pool of examples drawn from some underlying
distribution. The algorithm can then pay for the label of any
example in the pool, and the hope is that a good classifier can be
learned with significantly fewer labels by actively directing the
queries to informative examples. There is a significant practical
interest in minimizing the number of labeled examples in settings
where there is no shortage of unlabeled data but labels are
expensive.
%and such settings are increasingly common.
%goal is to minimize the number of labels requested while
%maximizing the accuracy of the chosen classifier.
Most active learning strategies
are \emph{noise seeking} on many natural learning problems.
In particular, the process of trying to
find an optimal separation between one class and another often
involves label queries for examples with a large conditional noise
rate. Thus the most informative examples are also the ones that
are typically the most noise-prone.
Consider the active learning algorithm %in Figure~\ref{fig:thresh}
which searches for the optimal threshold on an interval using
binary search. This example is often used to demonstrate the
potential of active learning in the noise-free case when there is
a perfect threshold separating the classes~\cite{CAL}. Binary
search needs $O(\ln{\frac{1}{\epsilon}})$ labeled examples to
learn a threshold with error less than $\epsilon$, while learning
passively requires $O\left(\frac{1}{\epsilon} \right)$ labels. A
fundamental drawback of this algorithm is that a small amount of
adversarial noise can force the algorithm to behave badly.
Is this extreme brittleness to small amounts of noise essential?
Can we still achieve an exponential decrease in sample complexity?
Can we avoid making assumptions about
the mechanism producing noise?
These are the questions addressed here.
\paragraph{\bf Previous Work on Active Learning} There has been
substantial work on active learning under additional
assumptions. \shrink{that typically include realizability (i.e., there
exists a perfect classifier in a known set), a correct Bayesian
prior, and/or assumptions about the form of noise.}
For example, the Query by
Committee analysis~\cite{QBC} assumes realizability (i.e., there exists
a perfect classifier in a known set) and a
correct Bayesian prior on the set of hypotheses. Recent work by
\singleemcite{sanjoy} has identified sufficient and
semi-necessary conditions for active learning given only the
additional realizability assumption. \cite{dkm,greedy}
also assume realizability.
If there exists a perfect separator in our class,
we can use any informative querying strategy to direct the learning process
without the need to worry about the distribution
it induces---any
inconsistent hypothesis can be eliminated based
on a \emph{single} query, {regardless} of which distribution
this query comes from. In the agnostic case, however, a hypothesis
that performs badly on the query distribution may well be the
optimal hypothesis with respect to the input distribution.
This is the main challenge in agnostic active learning that is
not present in the non-agnostic case.
\singleemcite{BZ} allow noise, but
require a correct Bayesian prior on threshold functions.
Other assumptions include specific noise models.
For example, \singleemcite{CWN} have analyzed
active learning in a setting with a constant noise rate
everywhere.
The \emph{membership-query} setting~\cite{angluin,angluin2,BE,mq}
is similar to active learning considered here except that no
unlabeled data is given. Instead, the learning algorithm is
allowed to query examples of its own choice. This is problematic
in several applications because natural oracles, such as hired
humans, have difficulty labeling synthetic examples~\cite{baum}.
Ulam's Problem (quoted in \cite{ulam}), where the goal is find a
distinguished element in a set by asking subset membership
queries, is also related. The quantity of interest is the
smallest number of such queries required to find the element, given
a bound on the number of queries that can be answered incorrectly.
%Solutions to this problem can be viewed as robust binary search
%procedures.
But both types of results do not apply here since an
active learning strategy can only buy labels of the examples it
observes.
%quering anywhere is not allowed in active learning.
For example, a membership query algorithm can be used to quickly find a
separating hyperplane in a high-dimensional space. An active
learning algorithm can not do so when the data distribution
does not support queries close to the decision boundary.
\paragraph{ {\bf When Active Learning Can Help}}
It is important to keep in mind that the
speedups achievable with active learning depend on the match
between the distribution over example-label pairs and the
hypothesis class, and therefore on the {target hypothesis} in the
class. %, thus one should expect the results
%to be distribution-dependent.
There are simple examples where active learning does not help at all,
even if there is no noise (see, for example, \cite{sanjoy}).
Obviously, all such lower bounds apply in
our setting as well.
It is also important to note that we cannot hope to achieve speedups when
the noise rate $\eta$ is large, due to a lower bound
of $\Omega(\frac{\eta^2}{\epsilon^2})$ on the sample complexity
of any active learner~\cite{matti}.
\paragraph{Summary of Results}
In section~\ref{sec:The--A^{2}}, we present an \emph{A}gnostic
{\emph{A}}ctive
learning algorithm, $A^2$. The only assumption we rely upon is
that samples are drawn \emph{i.i.d.} from some underlying
distribution. In particular, we make no assumptions about the
mechanism producing noise (e.g., class/target misfit, fundamental
randomization, adversarial situations). As far as we know, this is
the first result of this form.
\shrink{
noise.
We show that the answer is ``no'' by presenting an algorithm
that can yield an exponential decrease in sample complexity over
the more typical batch learning regardless of the mechanism
producing noise (e.g., class/target misfit, fundamental
randomization, adversarial situations). As far as we know, this is
the first result of this form.
}
Section~\ref{sec:Correctness} proves that $A^2$ is correct,
and section \ref{sec:Fallback-Analysis} shows that $A^2$ is never
harmful, i.e., it never requires significantly more samples than
batch learning. Section~\ref{sec:Exponential-Speedup} shows the
potential of $A^2$ by establishing exponential speedups in several
settings previously analyzed without noise. In particular, we show
that we can achieve exponential speedups for the simple case of
learning threshold functions; the result holds for arbitrary
distributions provided that the noise rate is sufficiently small
with respect to the desired error $\epsilon$. We also show that
our algorithm achieves an exponential improvement if the
hypothesis class consists of homogeneous (through the origin)
linear separators and the data is distributed uniformly over the
unit sphere in $\reals^d$.
The last example has been the most encouraging theoretical
result so far in the realizable case~\cite{dkm}.
The $A^2$ analysis also achieves an almost contradictory property: for
some sets of classifiers, we can choose an $\epsilon$-optimal
classifier with fewer samples than are needed
to estimate the error rate of the chosen classifier with precision
$\epsilon$.
\section{Preliminaries}
\label{prelim}
%We consider a binary agnostic learning problem specified as follows.
%We consider the following problem.
Let $X$ be an instance space and
$Y=\{-1,1\}$ be the set of possible labels.
Let $H$ be the hypothesis class, a set of functions mapping from $X$
to $Y$. We assume there is a distribution $\distr$ over instances
in $X$, and that the instances are labeled by a possibly
randomized oracle $O$. The \emph{error rate} of a hypothesis $h$ with
respect to a distribution $P$ over $X\times Y$ is defined
as $\err_{P}(h) = \Pr_{x,y\prin P}[ h(x)\neq y ]$.
Let $\eta = \min\limits_{h \in H}{\left({\err_{D, O}(h)}\right)}$
denote the minimum error rate of any hypothesis in $H$ with respect
to the distribution $(D,O)$ induced by $D$ and the labeling oracle $O$.
The goal is to find a hypothesis $h\in H$ with
$\err_{D, O}(h)$ within $\epsilon$ of $\eta$, where $\epsilon$ is some
target error.
%%Note that $h^*$ is not necessarily the minimum error hypothesis
%after the corruption ($\eta>0$).
The algorithm $A^2$ relies on a subroutine, which computes a
lower bound $\textrm{LB}(S,h, \delta)$ and an upper bound
$\textrm{UB}(S,h,\delta)$ on the true error rate
$\err_{\distrxy}(h)$ of $h$ by using a sample $S$ of examples
drawn \emph{i.i.d.} from $\distrxy$.
Each of these bounds must hold for all $h$
simultaneously with probability at least $1-\delta$.
The subroutine is formally defined below.
\begin{definition} A subroutine for computing
$\textrm{LB}(S,h, \delta)$ and
$\textrm{UB}(S,h,\delta)$ is said to be \emph{legal} if for all distributions
$P$ over $X \times Y$, and for all $m\in \nats$,
$$\textrm{LB}(S,h, \delta) \leq \emph{err}_{\distrxy}(h)
\leq \textrm{UB}(S,h,\delta)$$
holds for all $h\in H$ simultaneously, with probability $1-\delta$ over
the draw of $S$ according to $P^m$.
\shrink{
\mbox{$\Pr_{S \prin
{\distrxy}^m}{[\forall h~ \textrm{LB}(S,h, \delta) \leq
\emph{err}_{\distrxy}(h) \leq \textrm{UB}(S,h,
\delta)]} \geq 1
-\delta$.}
}
\end{definition}
\shrink{
We now define the sample complexity $m(\varepsilon, \delta, H)$
required by a batch algorithm that uses a subroutine for
computing $\textrm{LB}(S,h, \delta)$ and $\textrm{UB}(S,h,\delta)$
as the minimum number of samples $m$ such that
for all $S \in X^m$,
we have
$| \textrm{UB}(S,h, \delta) - \textrm{LB}(S,h, \delta) | \leq
\varepsilon$
for all $h\in H$.
}
Examples of such subroutines are the VC bound~\cite{VC} and the
Occam Razor bound~\cite{occam}. \comment{ We use
the following bound on $m(\epsilon,\delta,H)$ stated as
Theorem~\ref{VCbound} in Appendix~\ref{app:batch}:
$$
m (\epsilon, \delta, H) =
\frac{64}{\epsilon^2} \left(2\vcdim \ln {\left(
\frac{12}{\epsilon} \right)}+ \ln {\left( \frac{4}{\delta}
\right)} \right)
$$
Here $V$ is the VC-dimension of $H$.}
%-------------->changes begin
\section{The $A^{2}$ Agnostic Active Learner}
\label{sec:The--A^{2}}
At a high level, $A^2$ can be viewed as a robust version of
the selective sampling algorithm of \singleemcite{CAL}.
Selective sampling is a sequential process that
keeps track of two spaces---the current \emph{version space} $H_i$, defined
as the set of hypotheses in $H$ consistent
with all labels revealed so far, %and thus still under consideration,
and the current \emph{region of uncertainty} $R_i$, defined as the set
of all $x \in X$, for which there exists a pair of hypotheses in $H_i$
that disagrees on $x$. In round $i$, the algorithm picks a random
unlabeled example from $R_i$ and
queries it, eliminating
all hypotheses in $H_i$ inconsistent with the received label.
The algorithm then eliminates those $x\in R_i$ on which
all surviving hypotheses agree, and recurses.
%This may, in turn,
%allow the algorithm to reduce the region of uncertainty, because all surviving
%hypotheses may now agree on some parts of the region.
This process fundamentally relies on the assumption that there exists
a consistent hypothesis in $H$. In the agnostic case,
we cannot eliminate a hypothesis based on its disagreement with a single
example. We need to be more conservative, or
we risk eliminating best hypotheses in the class.
%because a
%hypothesis that disagrees on such a query may well be the
%optimal hypothesis with respect to the input distribution $D$.
A formal specification of $A^2$ is given in Algorithm~\ref{robust_active}.
Let $H_i$ be the set
of hypotheses still under consideration by $A^2$ in round $i$.
%Let $H_i$ be the version space of $A^2$ in round $i$,
%the set of hypotheses in $H$ still under consideration in round $i$.
If all hypotheses in
$H_i$ agree on some region of the instance space, this region can
be safely eliminated.
To help us keep track of progress in decreasing the region of uncertainty,
define $\textsc{Disagree}_{D}(H_i)$ as the probability that
there exists a pair of hypotheses in $H_i$ that disagrees on a
random example drawn from $D$:
$$\small{\textsc{Disagree}_{D}(H_i)=\mathbf{Pr}_{x \prin D}[ \exists
h_{1},h_{2}\in G: h_{1}(x)\neq h_{2}(x)].}$$
Hence $\textsc{Disagree}_{D}(H_i)$ is the volume of the current
region of uncertainty with respect to $D$.
Let $D_i$ be the distribution $D$ restricted to the current region of
uncertainty. Formally, $D_i = D(x\mid \exists
h_{1},h_{2} \in H_i:h_{1}(x)\neq h_{2}(x))$.
In round $i$, $A^2$ samples a set of
examples $S_i$ from $D_i,O$, and uses it to
compute upper and lower bounds for all hypotheses in $H_i$.
It then eliminates all hypotheses whose lower bound is
greater than the minimum upper bound. Figure~\ref{a2} shows the algorithm
in action for the case when the data lie in the $[0,1]$ interval
on the real line, and $H$ is the set of thresholding functions.
The horizontal axis denotes both
the instance space and the hypothesis space, superimposed.
The vertical axis shows the error rates.
Round $i$ completes when $S_i$ is large enough
to eliminate at least half of the current region of uncertainty.
%(Otherwise, we keep doubling the size of $S_i$.)
Since we eliminate
only those examples on which the surviving hypotheses agree, an optimal
hypothesis in $H_i$ with respect to $D_i$ remains an optimal hypothesis
in $H_{i+1}$ with respect to $D_{i+1}$.
Since each round $i$ cuts $\textsc{Disagree}_D(H_i)$ down by half,
the number of rounds is bounded by $\log\frac{1}{\epsilon}$.
Sections~\ref{sec:Exponential-Speedup} gives examples of distributions
and hypothesis classes for which $A^2$ requires only a small
number of labeled examples to transition between rounds, yielding an
exponential improvement in sample complexity.
%savings in the number of labeled examples needed for learning.
When evaluating bounds during
the course of Algorithm~\ref{robust_active}, we choose a schedule
of $\delta$ according to the following rule: we evaluate bound $k$
with confidence $\delta_k=\frac{\delta}{k(k+1)}$, for $k\geq 1$.
\begin{algorithm} \caption{\label{robust_active} $A^{2}$ (allowed error rate
$\epsilon$, sampling oracle for $D$, labeling oracle $O$,
hypothesis class $H$)}
Set $i=1$, $\distr_i=D$, $H_i=H$, $S_i=\emptyset$, and $k=1$.
\vskip .1in {\bf while} $\
\textsc{Disagree}_{D}(H_i)(\min\limits_{h\in
H_i}\textrm{UB}(S_i,h, \conf_k)-\min\limits_{h\in
H_i}\textrm{LB}(S_i,h, \conf_k))
>\epsilon$
\begin{enumerate}
\item set $S_i=\emptyset$, $H'_i=H_i$, $k \leftarrow k+1$. \item
\vskip .1in {\bf while} $\ \textsc{Disagree}_{D}(H'_i) \geq
\frac{1}{2}\textsc{Disagree}_{D}(H_i)$
\begin{enumerate}
\item {\bf if} $\ \textsc{Disagree}_{D}(H'_i)(\min\limits_{h\in
H'_i}\textrm{UB}(S_i,h, \conf_k)-\min\limits_{h\in
H'_i}\textrm{LB}(S_i,h, \conf_k))\leq\epsilon$ \item \hskip .2in
return $h={\rm arg}\!\!\min \limits_{h \in H'_i}
{\textrm{UB}(S_i,h, \conf_k)}$. \item {\bf else}
\begin{enumerate}
\item $S'_i=$ Rejection sample $2|S_i|+1$ samples $x$ from $D$
satisfying:
$$\exists h_{1},h_{2}\in H_i:h_{1}(x)\neq h_{2}(x)$$ \item $S_i
\leftarrow S_i \cup \{(x,O(x)):x\in S'_i\}$; $k \leftarrow k+1$;
\item $H'_i=\{ h\in
H_i:\textrm{LB}(S_i,h,\conf_k,)\leq\min\limits_{h'\in
H_i}\textrm{UB}(S_i,h', \conf_k)\}$, $k \leftarrow k+1$.
\end{enumerate}
{\bf end if}
\end{enumerate}
{\bf end while}
\item \vskip .1in $H_{i+1}\leftarrow H'_i$;
$D_{i+1}\leftarrow D_{i}$ conditioned on the disagreement $\exists
h_1, h_2 \in H_{i}: h_1(x) \neq h_2(x)$;
$i \leftarrow i+1$.
\end{enumerate}
{\bf end while}
\vskip .01in Return $h={\rm arg}\!\!\min \limits_{h \in H_i}
{\textrm{UB}(S_i,h, \conf_k)}$.
\end{algorithm}
%------------------------- A^2 FIGURE BEGINS ------------------
\begin{figure}
%\begin{minipage}[h]{\linewidth}
\centering
\scalebox{0.28}{\includegraphics{template.png}}
\scalebox{0.28}{\includegraphics{slide1.png}}
\scalebox{0.28}{\includegraphics{slide2.png}}
\vskip -1.5in
%\begin{picture}(300,115)
\begin{picture}(300,100)
\put(5,95){\small{$0$}} \put(5,195){\small{$0$}}
\put(5,-5){\small{$0$}} \put(221,95){\small{$1$}}
\put(221,195){\small{$1$}} \put(221,-5){\small{$1$}}
\put(5,180){\small{$\frac{1}{2}$}}
\put(5,280){\small{$\frac{1}{2}$}}
\put(5,80){\small{$\frac{1}{2}$}}
%
\put(21,90){\scriptsize{$1$}}\put(31,90){\scriptsize{$1$}}
\put(37,90){\scriptsize{$-{\hskip -.01in}1$}}\put(59,90){\scriptsize{$1$}}
\put(67,90){\scriptsize{$1$}}\put(81,90){\scriptsize{$-{\hskip -.01in}1$}}
\put(95,90){\scriptsize{$1$}}\put(103,90){\scriptsize{$-{\hskip -.01in}1$}}
\put(114,90){\scriptsize{$1$}}\put(128,90){\scriptsize{$-{\hskip -.01in}1$}}
\put(140,90){\scriptsize{$-{\hskip -.01in}1$}}\put(186,90){\scriptsize{$-{\hskip -.01in}1$}}
%
\put(21,190){\scriptsize{$1$}}\put(31,190){\scriptsize{$1$}}
\put(37,190){\scriptsize{$-{\hskip -.01in}1$}}\put(59,190){\scriptsize{$1$}}
\put(67,190){\scriptsize{$1$}}\put(81,190){\scriptsize{$-{\hskip -.01in}1$}}
\put(95,190){\scriptsize{$1$}}\put(103,190){\scriptsize{$-{\hskip -.01in}1$}}
\put(114,190){\scriptsize{$1$}}\put(128,190){\scriptsize{$-{\hskip -.01in}1$}}
\put(140,190){\scriptsize{$-{\hskip -.01in}1$}}\put(186,190){\scriptsize{$-{\hskip -.01in}1$}}
%%%%%%%%%%%%%%%
\shrink{
\put(20,90){\scriptsize{$-1$}}\put(30,90){\scriptsize{$-1$}}
\put(40,90){\scriptsize{$1$}}\put(60,90){\scriptsize{$-1$}}
\put(67,90){\scriptsize{$-1$}}\put(86,90){\scriptsize{$1$}}
\put(95,90){\scriptsize{$-1$}}\put(109,90){\scriptsize{$1$}}
\put(114,90){\scriptsize{$-1$}}\put(135,90){\scriptsize{$1$}}
\put(144,90){\scriptsize{$1$}}\put(191,90){\scriptsize{$1$}}
%
\put(22,190){\scriptsize{$-1$}}\put(32,190){\scriptsize{$-1$}}
\put(40,190){\scriptsize{$1$}}\put(60,190){\scriptsize{$-1$}}
\put(67,190){\scriptsize{$-1$}}\put(86,190){\scriptsize{$1$}}
\put(95,190){\scriptsize{$-1$}}\put(109,190){\scriptsize{$1$}}
\put(114,190){\scriptsize{$-1$}}\put(135,190){\scriptsize{$1$}}
\put(144,190){\scriptsize{$1$}}\put(191,190){\scriptsize{$1$}}
}
%
%\put(42,-7){\small{thresholds / input space superimposed}}
\put(80,245){\small{true error rates}}
\put(100,150){\small{upper bounds}}
\put(35,105){\small{lower bounds}}
\put(91,55){\small{min upper bound}}
\thinlines
\put(125,50){\vector(-1,-4){5}}
\put(64.3,26){\circle*{5}}
\put(191,26){\circle*{5}}
\end{picture}
\caption{$A^2$ in action: Sampling, Bounding, Eliminating.
%The horizontal axis shows the instance space and the hypothesis space
%superimposed. The vertical axis shows the error rates of the corresponding
%hypotheses.
}\label{a2}
%\end{minipage}
\end{figure}
%------------------------- A^2 FIGURE ENDS ------------------
%-------------->changes end
%------------------------- A^2 FIGURE ENDS ------------------
\subsection{Correctness} \label{sec:Correctness}
\begin{theorem}\emph{(Correctness)}
For all $H$, for all $(D,O)$, for all valid subroutines for computing
$UB$ and $LB$, with probability $1-\delta$, $A^{2}$ returns an
$\epsilon$-optimal hypothesis or does not terminate.
\end{theorem}
\begin{note}For most ``reasonable'' subroutines for computing
$UB$ and $LB$, $A^{2}$ terminates with probability at least
$1-\delta$. For more discussion and a proof of this fact see
Section~\ref{sec:Fallback-Analysis}.
\end{note}
\begin{proof}
We first prove that all bound evaluations are valid simultaneously
with probability at least $1-\delta$, and then show that the
procedure produces an $\epsilon$-optimal hypothesis upon termination.
To prove the first claim, notice that the samples on which each
bound is evaluated are drawn \emph{i.i.d.} from some distribution over $X
\times Y$. This can be verified by noting that the distribution
$D_i$ used in round $i$ is precisely that given by drawing $x$ from the
underlying distribution $D$ conditioned on the disagreement
$\exists h_{1},h_{2} \in H_i :h_{1}(x)\neq h_{2}(x)$, and then
labeling according to the oracle $O$. %Formally, we can specify the
%distribution over $x$ used in round $i$ as $D_i = D(x\mid \exists
%h_{1},h_{2} \in H_i:h_{1}(x)\neq h_{2}(x)$.
The $k$-th bound evaluation fails with probability at most
$\frac{\delta}{k(k+1)}$. By the Union
bound, the probability that any bound fails is less then the sum
of the probabilities of individual bound failures. This sum is
bounded by $\sum_{k=1}^{\infty}\frac{\delta}{k(k+1)}=\delta$.
%%Note also that the probability that any of the bounds fails is at
%%most $\delta$.
%%We argue now that the procedure terminates and it outputs an
%%$\epsilon$-optimal hypothesis upon termination. The termination
%%proof is essentially the Fallback analysis in
%%Section~\ref{sec:Fallback-Analysis} (see
%Theorem~\ref{generic-fallback-labeled}).
To prove the second claim, notice first that since every bound
evaluation is correct, we can be certain that step 2(c)(iii) never
eliminates a hypothesis that has minimum error rate with respect
$D$.
%%Moreover, notice that the
%%algorithm procedure as proven in Theorem~\ref{fallback-labeled}.
Let us now introduce the following notation. For a hypothesis $h
\in H$ and $G \subseteq H$ define:
$$e_{D,G,O}(h)= \mathbf{Pr}_{x,y\prin D,O \mid \exists
h_{1},h_{2}\in G : h_{1}(x)\neq h_{2}(x)} [h(x)\ne y],$$% and
$$f_{D,G,O}(h) = \mathbf{Pr}_{x,y\sim D,O \mid \forall
h_{1},h_{2}\in G : h_{1}(x) = h_{2}(x)}[h(x)\ne y].$$ Notice that
$e_{D,G,O}(h)$ is in fact $\err_{D_{G},O}(h)$, where $D_{G}$ is D
conditioned on the disagreement $\exists h_1, h_2 \in G: h_1(x)
\neq h_2(x)$. Moreover, given any $G \subseteq H$, we can
decompose the error rate of every hypothesis $h$ into two parts as
follows:
$ \err_{D,O}(h)
= e_{D,{G},O}(h)\cdot\textsc{Disagree}_D({G}) +
f_{D,{G},O}(h)\cdot (1-\textsc{Disagree}_D({G}))
= \err_{D_{{G}},O}(h)\cdot\textsc{Disagree}_D({G})
+ f_{D,{G},O}(h)\cdot (1-\textsc{Disagree}_D({G})).$
Notice that the only term that varies
with $h \in G$ in the above decomposition, is $e_{D,{G},O}(h)$.
Consequently, if we want to find an $\epsilon$-optimal hypothesis
we need only bound the error rate of
$\err_{D_{{G}},O}(h)\cdot\textsc{Disagree}_D({G})$ to precision
$\epsilon$. But this is exactly what the negation of the main
while-loop guard does, and this is also the condition we use in
Step 2(a) of the algorithm. In other words, upon termination we
have $\ \textsc{Disagree}_{D}(H_i)(\min_{h\in
H_i}\textrm{UB}(S_i,h,\delta_k)-\min_{h\in H_i}\textrm{LB}(S_i,h,
\delta_k))\leq\epsilon$, which proves the desired result.
\end{proof}
\subsection{Fallback Analysis} \label{sec:Fallback-Analysis}
This section shows that $A^{2}$ is never much worse than a
standard batch, bound-based algorithm\footnote{A
standard example of a bound-based learning algorithm is
Empirical Risk Minimization (\ERM).} in terms of
the number of samples required in order to learn, assuming that
UB and LB are ``sane''.
Define the sample complexity $m(\varepsilon, \delta, H)$
required by a batch algorithm that uses a subroutine for computing
$\textrm{LB}(S,h, \delta)$ and $\textrm{UB}(S,h,\delta)$ as the
minimum number of samples $m$ such that for all $S \in X^m$, we
have $| \textrm{UB}(S,h, \delta) - \textrm{LB}(S,h, \delta) | \leq
\varepsilon$ for all $h\in H$.
%Examples of such a subroutine is the Occam Razor bound or the VC
%bound (see, for example, \singleemcite{VC,occam}).
We use the following bound on $m(\epsilon,\delta,H)$ stated as
Theorem~\ref{VCbound} in Appendix~\ref{app:batch}:
$$
m (\epsilon, \delta, H) = \frac{64}{\epsilon^2} \left(2\vcdim \ln
{\left( \frac{12}{\epsilon} \right)}+ \ln {\left( \frac{4}{\delta}
\right)} \right)
$$
Here $V$ is the VC-dimension of $H$. Assume that $m(2\epsilon,
\delta, \complex) \leq \frac{m(\epsilon, \delta, \complex)}{2}$,
and also that the function $m$ is monotonically increasing in
$1/\delta$. These conditions are satisfied by many subroutines for
computing UB and LB, including those based on the
VC-bound~\cite{VC} and the Occam's Razor bound~\cite{occam}.
\begin{theorem}
\label{generic-fallback-labeled} For all $H$,
for all $(D,O)$, for all UB and LB satisfying the assumption above,
the algorithm $A^{2}$
makes at most $2 m(\epsilon, \delta', \complex)$ calls to the
oracle $O$, where $\delta'= \frac{\delta }{N(\epsilon, \delta,
H)(N(\epsilon, \delta, H)+1) }$ and $N(\epsilon, \delta, H) $
satisfies: $N(\epsilon, \delta, H) \geq \ln{\frac{1}{\epsilon}}
\ln{m (\epsilon, \frac{\delta}{N(\epsilon, \delta, H)(N(\epsilon,
\delta, H)+1)}, H)}$. Here
$m(\epsilon, \delta, \complex)$ is the sample complexity of UB and LB.
\end{theorem}
\begin{proof}
Let $\delta_k=\frac{\delta}{k(k+1)}$ be the confidence parameter
used in the $k$-th application of the
subroutine for computing UB and LB. We will
determine an upper bound $N(\epsilon, \delta, H)$ on the number of
bound evaluations throughout the life of the algorithm. This will
imply that the confidence parameter $\delta_k$ will always be
greater than $ \delta'= \frac{\delta}{N(\epsilon, \delta,
H)(N(\epsilon, \delta, H)+1)}$.
Consider $i=1$. If condition $2$ of Algorithm $A^{2}$ is
repeatedly satisfied then after labeling $m (\epsilon, \delta',
H)$ examples from $D_1$ we have that, uniformly, for all
hypotheses $h \in H_1$,
$ \left| \textrm{UB}(S_1,h, \delta') - \textrm{LB}(S_1,h, \delta')\right|
\leq \epsilon$. Note that in these conditions we safely halt.
Notice also that the number of bound evaluations during this
process is at most $\ln{m (\epsilon, \delta', H)}$.
On the other hand, if loop (2) ever completes and $i$ increases,
then it is enough to have uniformly for all $h \in H_2$, $ \left|
\textrm{UB}(S_2,h, \delta') - \textrm{LB}(S_2,h, \delta')\right|
\leq \ 2\epsilon$. (This follows from the exit conditions we use
in the outer while-loop and in Step 2(a) of $A^2$.)
Clearly, in order to uniformly have that for all hypotheses $h \in
H_2$ their true upper and lower bounds are within $2\epsilon$ from
each other, we only need $m (2\epsilon, \delta', H) \leq \frac{m
(\epsilon, \delta', H)}{2}$ labeled examples from $D_2$ and the
number of bounds evaluations in round $i=2$ is at most $\ln{m
(\epsilon, \delta', H)}$.
In general, in round $i$ it is enough to have uniformly for all
$h \in H_i$, $ \left| \textrm{UB}(S_i,h, \delta') -
\textrm{LB}(S_i,h, \delta')\right| \leq 2^{i-1}\epsilon$, and in
order to obtain this we only need $m (2^{i-1}\epsilon, \delta', H)
\leq \frac{m (\epsilon, \delta', H)}{2^{i-1}}$ labeled examples
from $D_i$. Also the number of bounds evaluations in round $i$ is
at most $\ln{m (\epsilon, \delta', H)}$.
Since the number of rounds is bounded by
$\ln{\frac{1}{\epsilon}}$, it follows that the maximum number of
bound evaluation throughout the life of the algorithm is at most
$\ln{\frac{1}{\epsilon}} \ln{m (\epsilon, \delta', H)}$. This
implies that in order to determine an upper bound $N(\epsilon,
\delta, H)$ we just need to find a solution of the following
inequality: $N(\epsilon, \delta, H) \geq \ln{\frac{1}{\epsilon}}
\ln{m (\epsilon, \frac{\delta}{N(\epsilon, \delta, H)(N(\epsilon,
\delta, H)+1) }, H)}$.
%%This ensure that a confidence parameter of
%%$\frac{\delta}{N(\epsilon, \delta, H)}$ suffices.
Finally, adding up the number of calls to the oracle in all
rounds, we get that the number of calls to the oracle throughout
the life of the algorithm is at most $2 m (\epsilon, \delta', H)$.
\end{proof}
Let $\vcdim$ denote the VC-dimension of
$H$, and let $m (\epsilon, \delta, H)$ be the number of examples
required by the ERM algorithm. As we state in
Theorem~\ref{VCbound} in Appendix~\ref{app:batch} a classic bound
on $m (\epsilon, \delta, H)$ is $m (\epsilon, \delta,H) =
\frac{64}{\epsilon^2} \left(2\vcdim \ln {\left(
\frac{12}{\epsilon} \right)}+ \ln {\left( \frac{4}{\delta}
\right)} \right)$. Then using
Theorem~\ref{generic-fallback-labeled}, we can show the following corollary.
\begin{corollary}
\label{VCbound-fallback-labeled}
For all hypothesis classes $H$ of VC-dimension $V$,
for all distributions $(D,O)$ over $X\times Y$, the algorithm $A^2$
requires at most $\tilde{O}\left(\frac{1}{\epsilon^2} (V \ln \frac{1}{\epsilon}
+ \ln \frac{1}{\delta})\right)$
labeled examples drawn \emph{i.i.d.} from $(D,O)$.\footnote{
Here and in the rest of the paper, the $\tilde{O}(\cdot)$ notation is used to
hide factors logarithmic in the factors present explicitly.
}
\shrink{The number of calls to the oracle
$O$ that $A^{2}$ makes is always at most $ 2 m(\epsilon, \delta',
\complex)$, where $\delta'= \frac{\delta }{N(\epsilon, \delta,
H)^2}$ and $N(\epsilon, \delta, H)=\frac{256}{\epsilon^2} \ln
{\left( \frac{1}{ \epsilon}\right)} \vcdim \ln {\left(
\frac{12}{\epsilon} \right)}+ \frac{128}{\epsilon^2} \ln {\left(
\frac{4}{\delta} \right)}+ \frac{256}{\epsilon^2} \ln {\left(
\frac{1}{ \epsilon}\right)}\ln{\left( \frac{256}{\epsilon^2} \ln
{\left( \frac{1}{ \epsilon}\right)} \right)}$.}
\end{corollary}
\begin{proof}
We use the form of $m (\epsilon, \delta, \complex)$ and
Theorem~\ref{generic-fallback-labeled} to upper
bound $N=N (\epsilon, \delta, \complex)$. It is enough to find the smallest $N$
satisfying $N \geq \ln {\left( \frac{1}{\epsilon}
\right)} \ln { \left( \frac{64}{\epsilon^2} \left(2\vcdim \ln
{\left( \frac{12}{\epsilon} \right)}+ \ln {\left(
\frac{4N^2}{\delta} \right)} \right)\right)}$. Using the
inequality $\ln a \leq ab-\ln b -1 $ for all $a, b >0$ and some
simple algebraic manipulations, we get the desired upper bound on
$N(\epsilon, \delta, H)$. The result then follows from
Theorem~\ref{generic-fallback-labeled}.
\end{proof}
\section{Active Learning Speedups}
\label{sec:Exponential-Speedup} In this section, we show that it
is possible to achieve exponential sample complexity improvements
even with arbitrary noise for some sets of classifiers.
\subsection{Learning Threshold Functions}
We begin by analyzing the simple class of threshold functions. As
mentioned in the introduction, it turns out that even for this
simple class of functions exponential reduction in sample
complexity is \emph{not} achievable when the noise rate $\eta$ is
large~\cite{matti}.
%%%I'm not sure this is right!!!
%%as a moment of
%%introspection confirms.
Therefore we prove two results: one shows an exponential sample
complexity improvement when the noise rate is small, while the
other simply shows a slower improvement when the noise rate is
large. In the extreme where the noise rate is $1/2$, there is no
improvement.
\begin{theorem}%%\emph{(Non-triviality I)}
\label{treshold-small-noise}
Let $H$ be the set of thresholds on
an interval with $\textrm{LB}$ and $\textrm{UB}$ the VC bound.
For all distributions $(D,O)$, for any
$\epsilon<\frac{1}{2}$ and $\eta<\frac{\epsilon}{16}$,
the algorithm $A^{2}$ makes $O\left( \ln{ \left(
\frac{1}{\epsilon} \right)} \ln{ \left( \frac{\ln{ \left(
\frac{1}{\epsilon \delta} \right)}}{\delta}\right)} \right)$
calls to the oracle $O$ on examples drawn \emph{i.i.d.} from $D$,
with probability $1-\delta$. \mycomment{ $O\left(\ln{\left(
\frac{1}{\delta'}\right)} \ln{ \left( \frac{1}{\epsilon} \right)}
\right)$, where $\delta'=\frac{\delta}{N^2(\epsilon, \delta, H)}$
and $N(\epsilon,\delta,H)$ satisfies $N(\epsilon,\delta, H) = 128
\ln{ \left( \frac{1}{\epsilon} \right)} \ln{ \left(
\frac{128}{\epsilon \delta} \ln{ \left( \frac{1}{\epsilon}
\right)} \right)} $.}
\end{theorem}
\begin{proof}
Each execution of loop $2$ decreases $\textsc{Disagree}_D(H_i)$ by
at least a factor of $2$, implying that the number of executions
is bounded by $\log\frac{1}{\epsilon}$. Consequently, we are
done if only $O\left(\ln\frac{1}{\delta'}\right)$ labeled samples
are required per loop.\footnote{Notice that the difference
$(\min_{h\in H_i}\textrm{UB}(S_i,h, D_i,O)-\min_{h\in
H_i}\textrm{LB}(S_i,h, D_i, O))$ appearing in Step 2(a) is always
constant.}
%Let $h^*$ denote a minimum error rate hypothesis. % of minimum error
%rate with respect to the original distribution $D$.%\footnote{As
%mentioned in the proof of Theorem~\ref{generic-fallback-labeled}
%we never eliminate from $H_i$ hypotheses that are optimal with
%respect to $D$.}
Let $h^*$ be any minimum error rate hypothesis.
For $h_1,h_2 \in H_i$, let
$d_i(h_1,h_2)$ be the probability that $h_1$ and $h_2$ predict
differently on a random example drawn according to $D_i$,
i.e., $d_i(h_1,h_2)= \Pr_{x \prin D_i} {\left(h_1(x) \neq
h_2(x)\right)}$.
Consider $i \geq 1$. Let $[lower_i, upper_i]$ be the support of
$D_i$. Note that for any hypothesis $h \in H_i$ we have
$\err_{D_i,O}(h) \geq d_i(h,h^*) -\err_{D_i,O}(h^*)$. We also
clearly have $\err_{D_i,O}(h^*) \leq {\eta}/Z_i$, where $Z_i=
\Pr_{x \prin D} {\left( x \in \left[ lower_i, upper_i \right]
\right) }$. (So $Z_i$ is a shorthand for
$\textsc{Disagree}_D(H_i)$.)
Now notice that at least a $\frac{1}{2}$-fraction (measured with
respect to $D_i$) of thresholds in $H_i$ satisfy $d_i(h,h^*)
\geq \frac{1}{4}$, and these thresholds are located at the
``ends'' of the interval $[lower_i, upper_i]$. Formally, assume first that
both $d_i(h^*, lower_i) \geq \frac{1}{4}$ and $d_i(h^*, upper_i)
\geq \frac{1}{4}$, then let $l_i$ and $u_i$ be the hypotheses to the
left and to the right of $h^*$, respectively, that satisfy
$d_i(h^*,l_i) = \frac{1}{4}$ and $d_i(h^*,u_i) = \frac{1}{4}$. We
clearly have that all $h \in [lower_i,l_i] \cup [u_i, upper_i]$
satisfy $d_i(h^*, h) \geq \frac{1}{4}$ and moreover $\Pr_{x \prin
D_i}{\left(x
\in [lower_i,l_i] \cup [u_i, upper_i]\right)} \geq
\frac{1}{2}$.
Now assume without loss of generality that $d_i(h^*, lower_i) \leq
\frac{1}{4}$. Let $u_i$ be the hypothesis to the right of $h^*$
with $d_i(h, upper_i) = \frac{1}{2}$. Then we clearly have that
all $h \in [u_i, upper_i]$ satisfy $d_i(h^*, h) \geq \frac{1}{4}$
and moreover $\Pr_{x \prin D_i}{\left(x \in [u_i, upper_i]\right)}
\geq \frac{1}{2}$.
Using the VC bound, we get that with probability $1-\delta'$ if
$$|S_i|= O\left(\frac{\ln\frac{1}{\delta'}}
{\left(\frac{1}{8}-\frac{\eta}{Z_i}\right)^{2}} \right),$$ then
uniformly for all hypotheses $h\in H_i$, we have
$\left| \textrm{UB}(S_i,h, \delta) - \textrm{LB}(S_i,h,
\delta)\right| \leq \frac{1}{8}-\frac{\eta}{Z_i}$.
Consider a hypothesis $h\in H_i$ with $d_i(h,h^*) \geq \frac{1}{4}$.
For any such $h$ we have
$\err_{D_i,O}(h) \geq d_i(h,h^*)
-\err_{D_i,O}(h^*) \geq \frac{1}{4} - \frac{\eta}{Z_i}$, and so
$\textrm{LB}(S_i,h, \delta) \geq \frac{1}{4} - \frac{\eta}{Z_i} -
(\frac{1}{8}-\frac{\eta}{Z_i})= \frac{1}{8}$. On the other hand,
$\err_{D_i,O}(h^*) \leq \frac{\eta}{Z_i}$, and so
$\textrm{UB}(S_i,h^*, \delta) \leq \frac{\eta}{Z_i}+
\frac{1}{8}-\frac{\eta}{Z_i}= \frac{1}{8}$.
Thus $A^2$ eliminates all $h\in H_i$
with $d_i(h,h^*) \geq \frac{1}{4}$.
But that means
we have $\ \textsc{Disagree}_{D}(H'_i) \leq
\frac{1}{2}\textsc{Disagree}_{D}(H_i)$, terminating round $i$.
\footnote{The assumption in the theorem statement can
clearly be weakened to $\eta <
\frac{\epsilon}{(8+\Delta)\sqrt{d}}$ for any constant $\Delta>0$.
}
Finally notice that $A^2$ makes $O\left(\ln{\left(
\frac{1}{\delta'}\right)} \ln{ \left( \frac{1}{\epsilon} \right)}
\right)$ calls to the oracle, where
$\delta'=\frac{\delta}{N^2(\epsilon, \delta, H)}$ and
$N(\epsilon,\delta,H)$ is an upper bound on the number of bound
evaluations throughout the life of the algorithm. We clearly have
that the number of bound evaluations required in round $i$ is
$O\left(\frac{\ln{\frac{1}{\delta'}}}
{\left(\frac{1}{8}-\frac{\eta}{Z_i}\right)^{2}}\right)$. This
implies that the number of bound evaluations throughout the life
of the algorithm $N(\epsilon, \delta,H)$ should satisfy $c
\ln{\left( \frac{N^2(\epsilon, \delta,H)}{\delta}\right)} \ln{
\left( \frac{1}{\epsilon} \right)} \leq N(\epsilon, \delta,H)$,
for some constant $c$. Solving this inequality, we get the desired
result.
\end{proof}
\begin{theorem}
\label{treshold-high-noise} %%\emph{(Non-triviality II)}
Let $H$ be the set of thresholds on an interval with $\textrm{LB}$ and
$\textrm{UB}$ the VC bound. Suppose that $\epsilon<\frac{1}{2}$
and $\eta>\epsilon$. For all $D$, with probability $1-\delta$, the
algorithm $A^{2}$ will require at most
$\tilde{O}\left(\frac{\eta^{2}\ln\frac{1}{\delta}}{\epsilon^{2}}\right)$
labeled samples.
\end{theorem}
\textbf{Proof Sketch:}\quad
The proof is similar to the previous proof. Theorem 4.1 implies
that loop (2) will complete $\log\frac{1}{\eta} - 4$ times. At
this point, the noise becomes sufficient so that the algorithm may
only halt via the return step in loop (2). In this case, we have
$\textsc{Disagree}_D(H)=\Theta(\eta)$ implying that the number of
samples required is
$\tilde{O}
\left(\frac{\eta^{2}\ln\frac{1}{\delta}}{\epsilon^{2}}\right)$.
%\left(\frac{\eta^{2}\ln\frac{\ln\frac{1}{\eta}}{\delta}}{\epsilon^{2}}\right)$.
~\qedblob
Note that Theorem~\ref{treshold-high-noise} asymptotically matches
a lower bound of Kaariainen~\cite{matti}.
\subsection{Linear Separators under the Uniform Distribution}
\label{ls} A commonly analyzed case for which active learning is
known to give exponential savings in the number of labeled
examples is when the data is drawn uniformly from the unit sphere
in $\reals^{\dimens}$, and the labels are consistent with a linear
separator going through the origin. We show that $A^2$ also gives
exponential savings in this case, in the presence of noise.
Let $X=\{ x\in \reals^{d} : \Vert x\Vert = 1\}$, the unit sphere
in $\reals^{\dimens}$. Assume that $D$ is uniform over $X$, and
let $H$ be the class of linear separators through the origin. Any
$h\in H$ is a homogeneous hyperplane represented by a unit vector
$w\in X$ with the classification rule $h(x)=\mbox{sign}(w\cdot
x)$.
%Thus the hypothesis space is also the unit
%sphere in $\reals^{\dimens}$.
The distance between two hypotheses $u$ and $v$ in $H$ with
respect to a distribution $D$ (i.e., the probability that they
predict differently on a random example drawn from $D$) is given
by $d_D(u,v) = \frac{\mbox{arccos}(u\cdot v)}{\pi}$.
Finally, let $\theta(u,v)=\mbox{arccos}(u\cdot v)$. Thus $d_D(u,v)
= \frac{\theta(u,v)}{\pi}$.
\begin{theorem}
\label{circle-unif} Let $X$, $H$, and $D$ be as defined above, and
let $\mbox{LB}$ and $\mbox{UB}$ be the VC bound. Then for any $0<
\epsilon<\frac{1}{2}$, \mbox{$0<
\eta<\frac{\epsilon}{16\sqrt{d}}$}, and $\delta>0$, with
probability $1-\delta$, $A^{2}$ requires
$$O\left(d\left(d\ln d + \ln\frac{1}{\delta'}\right) \ln\frac{1}{\epsilon}\right)$$
%$$O\left(d^2\log d
%\log\frac{\log\frac{1}{\epsilon}}{\delta}\log\frac{1}{\epsilon}\right)$$
calls to the labeling oracle, where
$\delta'=\frac{\delta}{N^2(\epsilon, \delta, H)}$ and
$N(\epsilon,\delta, H) = O\left( \ln{ \frac{1}{\epsilon}
}\left( d^2 \ln d + \ln { \frac{d \ln{
\frac{1}{\epsilon} }}{\delta}} \right)\right)$.
\end{theorem}
\begin{proof}
Let $w^*\in H$ be a hypothesis with the minimum error rate $\eta$.
Denote the region of uncertainty in round $i$ by $R_i$. Thus
$\mathbf{Pr}_{x\sim D}[x\in R_i] = \textsc{Disagree}_{D}(H_i)$.
Consider round $i$ of $A^2$. Initially $i=1$, corresponding to
$H_i=H$, $D_i=D$, and $\textsc{Disagree}_D(H_i)=1$.
Theorem~\ref{VCbound} says that it suffices to query the oracle on
a set $S$ of $O(d^2 \ln d + d \ln \frac{1}{\delta'})$ examples
from $D_i$ to guarantee, with probability $1-\delta'$, that for
all $w\in H_i$,
$$|\err_{D_i,O}(w)-\widehat{\err}_{D_i,O}(w)| <
\frac{1}{2}\left(\frac{1}{8\sqrt{d}} - \frac{\eta}{r_i}\right),$$
where $r_i$ is a shorthand for $\textsc{Disagree}_D(H_i)$. (By
assumption, $\eta < \frac{\epsilon}{16\sqrt{d}} $. We also have
$\textsc{Disagree}_D(H_i)\geq \epsilon$. Thus the precision above
is at least $\frac{1}{16\sqrt{d}}$.)\footnote{ The assumption in
the theorem statement can clearly be weakened to $\eta <
\frac{\epsilon}{(8+\Delta)\sqrt{d}}$ for any constant $\Delta>0$.
} This implies that $ \mbox{UB}(S,w,\delta') - \err_{D_i,O}(w) <
\frac{1}{8\sqrt{d}}- \frac{\eta}{r_i}, $ and $ \err_{D_i,O}(w) -
\mbox{LB}(S,w,\delta') < \frac{1}{8\sqrt{d}}- \frac{\eta}{r_i}. $
Consider any $w\in H_i$ with $d_{D_i}(w,w^*)\geq
\frac{1}{4\sqrt{d}}$. For any such $w$, we have
$\mbox{err}_{D_i,O}(w)\geq \frac{1}{4\sqrt{d}}-\frac{\eta}{r_i}$,
and so $$\mbox{LB}(S,w,\delta') >
\frac{1}{4\sqrt{d}}-\frac{\eta}{r_i}-\frac{1}{8\sqrt{d}} +
\frac{\eta}{r_i} = \frac{1}{8\sqrt{d}}.$$ But we also know that
$\mbox{err}_{D_i,O}(w^*)=\frac{\eta}{r_i}$, and thus
$\mbox{UB}(S,w^*,\delta') < \frac{\eta}{r_i}+\frac{1}{8\sqrt{d}} -
\frac{\eta}{r_i}= \frac{1}{8\sqrt{d}}$, so $A^2$ will eliminate
$w$ in step 2(c)(iii).
Thus round $i$ eliminates all hypotheses $w\in H_i$ with
$d_{D_i}(w,w^*)\geq \frac{1}{4\sqrt{d}}$. Since all hypotheses in
$H_i$ agree on every $x\not\in R_i$, we have $$d_{D_i}(w,w^*) =
\frac{1}{r_i}d_D(w,w^*) = \frac{\theta(w,w^*)}{\pi r_i}.$$ Thus
round $i$ eliminates all hypotheses $w\in H_i$ with $\theta(w,w^*)
\geq \frac{\pi r_i}{4\sqrt{d}}$. But since $2 \theta/\pi \leq \sin
\theta$, for $\theta\in (0,\frac{\pi}{2}]$, it certainly
eliminates all $w$ with $\sin\theta(w,w^*) \geq
\frac{r_i}{2\sqrt{d}}$.
Consider any $x\in R_{i+1}$ and the value $|w^*\cdot x| =
\cos\theta(w^*,x)$. There must exist a hypothesis $w\in H_{i+1}$
that disagrees with $w^*$ on $x$; otherwise $x$ would not be in
$R_{i+1}$. But then we must have $\cos\theta(w^*,x) \leq
\cos(\frac{\pi}{2}-\theta(w,w^*)) = \sin \theta(w,w^*) <
\frac{r_i}{2\sqrt{d}}$, where the last inequality is due to the
fact that $A^2$ eliminated all $w$ with $\sin\theta(w,w^*) \geq
\frac{r_i}{2\sqrt{d}}$. Thus any $x\in R_{i+1}$ must satisfy
$|w^*\cdot x| < \frac{r_i}{2\sqrt{d}}$.
Using the fact that $\mathbf{Pr}(A\, |\, B) =
\frac{\mathbf{Pr}(AB)}{\mathbf{Pr}(B)} \leq
\frac{\mathbf{Pr}(A)}{\mathbf{Pr}(B)}$ for any $A$ and $B$, we can
write
\begin{align*}
\mathbf{Pr}_{x\sim D_i} \left[ x\in R_{i+1}\right] \leq
\mathbf{Pr}_{x\sim D_{i}} \left[
|w\cdot x| \leq \frac{r_i}{2\sqrt{d}} \right] \\
\leq \frac{\mathbf{Pr}_{x\sim D}\left[
|w\cdot x| \leq \frac{r_i}{2\sqrt{d}}\right]}{\mathbf{Pr}_{x\sim
D} [x\in R_{i}]} \leq \frac{r_i}{2r_i} = \frac{1}{2},
\end{align*}
where the second inequality follows from Lemma~\ref{lemma1}. Thus
$\textsc{Disagree}_D(H_{i+1})\leq
\frac{1}{2}\textsc{Disagree}_D(H_{i})$, as desired.
\begin{figure}[h]
\begin{center}
%\scalebox{0.6}{\includegraphics{sphere-filled.png}}
\scalebox{0.55}{\includegraphics{sphere-complex.png}}
\vskip -2in
\begin{picture}(170,120)
\setlength{\unitlength}{1.2pt} \thinlines
\put(71,63){\line(0,50){55}}
%\put(72.5,68){\line(-7,20){21.7}}
%\put(72.5,68){\line(7,20){21.7}}
\put(68,125){\large $w^*$} \put(-20,60){\Large
$\frac{1}{2\sqrt{d}}$}
\end{picture}
\caption{The region of uncertainty after the first iteration
(schematic).}
\end{center}
\end{figure}
To finish the argument, it suffices to notice that since every
round cuts $\textsc{Disagree}_D(H_i)$ at least in half, the total
number of rounds is bounded by $\log\frac{1}{\epsilon}$. Notice
also that $A^2$ makes $O \left(d^2 \ln d + d \ln \frac{1}{\delta'}
\right) \ln{ \left( \frac{1}{\epsilon} \right)}$ calls to the
oracle, where $\delta'=\frac{\delta}{N^2(\epsilon, \delta, H)}$
and $N(\epsilon,\delta,H)$ is an upper bound on the number of
bound evaluations throughout the life of the algorithm. We clearly
have that the number of bound evaluations required in round $i$ is
$O \left(d^2 \ln d + d \ln \frac{1}{\delta'} \right) $. This
implies that the number of bound evaluations throughout the life
of the algorithm $N(\epsilon, \delta,H)$ should satisfy $ c \left(
d^2 \ln d + d \ln{\left( \frac{N^2(\epsilon,
\delta,H)}{\delta}\right)} \right) \ln{ \left( \frac{1}{\epsilon}
\right)} \leq N(\epsilon, \delta,H)$ for some constant $c$.
Solving this inequality, we get the desired result.
\end{proof}
For comparison, the query complexity of the Perceptron-based
active learner of \singleemcite{dkm}, is $O(d\ln
\frac{1}{\epsilon\delta} (\ln
\frac{d}{\delta}+\ln\ln\frac{1}{\epsilon}))$, for the same $H$,
$X$, and $D$, but only for the realizable case when $\eta=0$.
\section{Discussion and Open Questions}
The results here should be regarded as a first-case
proof-of-possibility. $A^2$
suggests a number of interesting open
questions:
\begin{enumerate}
\item On what other (hypothesis spaces, distribution) pairs can we
observe exponential speedups? Is there an algorithm that is more
sample efficient than $A^2$? Does $A^2$
always achieve speedups comparable to the ones achieved
by the selective sampling algorithm~\cite{CAL},
but in the presence of (limited) noise?
\item Are there concept classes for which $A^{2}$ (or some other algorithm)
can be made computationally efficient? Checking for
disagreement amongst all remaining classifiers can be very
computationally intensive.
\item
What conditions are sufficient and necessary for active learning to succeed
in the agnostic case?
\end{enumerate}
\shrink{
\paragraph{Relation to Splittability}
\singleemcite{sanjoy} identified sufficient and semi-necessary conditions
for active learning to succeed in the realizable case.
We introduce some terminology.
A subset $Q\subseteq H\times H$ is said to be $r$-\emph{split} by an example $x$
if at least an $r$-fraction of pairs of hypotheses
in $Q$ disagree on how to label $x$.
Let $Q_{\epsilon}=\{(h,h')\in Q : d_D(h,h')>\epsilon\}$.
A subset $H'$ of $H$ is called
$(\rho,\epsilon,\tau)$-\emph{splittable} if for all $Q\subset H'\times H'$,
$
\Pr_{x\sim D} [x\ \rho\mbox{-splits}\ Q_{\epsilon}] \geq \tau.
$
Let $B(h^*,r) = \{ h \in H : d_D(h,h^*) \leq r \}$ be a ball or radius
$r$ around an optimal hypothesis $h^*$.
The upper bound in \cite{sanjoy} says that if
for all $\Delta \geq \epsilon/2$, $B(h^*,4\Delta)$ is
$(\rho,\Delta,\tau)$-splittable, then it takes
$O(\frac{1}{\rho}\ln\frac{1}{\epsilon}\ln\frac{1}{\epsilon\tau})$
labeled examples to output an $\epsilon$-close hypothesis,
assuming realizability.
Let $H$ be any hypothesis space which is $(\rho,\Delta,\tau)$-splittable.
Consider
$H'$ which for every $x$, adds a hypothesis $h_x$
satisfying for all $x'\not= x$,
$h_x (x') = h^*(x')$ and $h_x(x) = 1-h^*(x')$.
As the probability of any one $x$ goes to zero, this does not alter the
splittability, but it implies $A^2$ must query for the label of every
example.
So, while splittability maybe sufficient in the realizable case, it is
not sufficient for $A^2$. An natural question is: ``What is sufficient in the
agnostic case?''
}
{\small
\begin{thebibliography}{}
\bibitem[Angluin, 1998][Angluin][1998]{angluin}
Angluin, D. (1998).
\newblock Queries and concept learning.
\newblock {\em Machine Learning}, {\em 2}, 319--342.
\bibitem[Angluin, 2001][Angluin][2001]{angluin2}
Angluin, D. (2001).
\newblock Queries revisited.
\newblock {\em Proceedings of the 12th International Conference on Learning
Theory}.
\bibitem[Anthony \& Bartlett, 1999][Anthony and Bartlett][1999]{AB99}
Anthony, M., \& Bartlett, P. (1999).
\newblock {\em {Neural Network Learning: Theoretical Foundations}}.
\newblock Cambridge University Press.
\bibitem[Baum \& Lang, 1992][Baum and Lang][1992]{baum}
Baum, E., \& Lang, K. (1992).
\newblock Query learning can work poorly when a human oracle is used.
\newblock {\em International Joint Conference on Neural Networks}.
\bibitem[Blumer et~al.\/, 1987][Blumer et~al.\/][1987]{occam}
Blumer, A., Ehrenfeucht, A., Haussler, D., \& Warmuth, M. (1987).
\newblock Occam's razor.
\newblock {\em Information Processing Letters}, {\em 24}, 377--380.
\bibitem[Bshouty \& Eiron, 2002][Bshouty and Eiron][2002]{BE}
Bshouty, N.~H., \& Eiron, N. (2002).
\newblock Learning monotone dnf from a teacher that almost does not answer
membership queries.
\newblock {\em Journal of Machine Learning Research}, {\em 3}, 49--57.
\bibitem[Burnashev \& Zigangirov, 1974][Burnashev and Zigangirov][1974]{BZ}
Burnashev, M., \& Zigangirov, K. (1974).
\newblock An interval estimation problem for controlled observations.
\newblock {\em Problems in Information Transmission}, {\em 10}, 223--231.
\bibitem[Castro et~al.\/, 2005][Castro et~al.\/][2005]{CWN}
Castro, R., Willett, R., \& Nowak, R. (2005).
\newblock Fast rates in regression via active learning.
\newblock {\em University of Wisconsin Technical Report ECE-05-03}.
\bibitem[Cohen et~al.\/, 1994][Cohen et~al.\/][1994]{CAL}
Cohen, D., Atlas, L., \& Ladner, R. (1994).
\newblock Improving generalzation with active learning.
\newblock {\em Machine Learning}, {\em 15(2)}, 201--221.
\bibitem[Czyzowicz et~al.\/, 1989][Czyzowicz et~al.\/][1989]{ulam}
Czyzowicz, J., Mundici, D., \& Pelc, A. (1989).
\newblock Ulam's searching game with lies.
\newblock {\em Journal of Combinatorial Theory, Series A}, {\em 52}, 62--76.
\bibitem[Dasgupta, 2004][Dasgupta][2004]{greedy}
Dasgupta, S. (2004).
\newblock Analysis of a greedy active learning strategy.
\newblock {\em Advances in Neural Information Processing Systems (NIPS)}.
\bibitem[Dasgupta, 2005][Dasgupta][2005]{sanjoy}
Dasgupta, S. (2005).
\newblock Coarse sample complexity bounds for active learning.
\newblock {\em Neural Information Processing Systems}.
\bibitem[Dasgupta et~al.\/, 2005][Dasgupta et~al.\/][2005]{dkm}
Dasgupta, S., Kalai, A., \& Monteleoni, C. (2005).
\newblock Analysis of perceptron-based active learning.
\newblock {\em COLT}.
\bibitem[Freund et~al.\/, 1993][Freund et~al.\/][1993]{QBC}
Freund, Y., Seung, H.~S., Shamir, E., \& Tishby, N. (1993).
\newblock Information, prediction, and query by comittee.
\newblock {\em Neural Information Processing Systems}.
\bibitem[Jackson, 1997][Jackson][1997]{mq}
Jackson, J. (1997).
\newblock An efficient membership-query algorithm for learning dnf with respect
to the uniform distribution.
\newblock {\em Journal of Computer and System Sciences}, {\em 55(3)}.
\bibitem[Kaariainen, 2005][Kaariainen][2005]{matti}
Kaariainen, M. (2005).
\newblock On active learning in the non-realizable case.
\newblock {\em NIPS Workshop on Foundations of Active Learning}.
\bibitem[Vapnik \& Chervonenkis, 1971][Vapnik and Chervonenkis][1971]{VC}
Vapnik, V., \& Chervonenkis, A. (1971).
\newblock On the uniform convergence of relative frequencies of events to their
probabilities.
\newblock {\em Theory of Probability and its Applications}, {\em 16(2)},
264--280.
\end{thebibliography}
}
\appendix
\section{Appendix}
\label{app:batch}
We use the following standard Sample Complexity bound from
\singleemcite{AB99}.
\begin{theorem}
\label{VCbound} Suppose that $\hclass$ is a set of functions from
$X$ to $\{-1,1\}$ with finite VC-dimension $\vcdim \geq 1$. Let
$\distr$ be an arbitrary, but fixed probability distribution over
$X \times \{-1,1\}$. For any $\epsilon$, $\delta>0$, if we draw a
sample from $\distr$ of size $$m (\epsilon, \delta, \vcdim) =
\frac{64}{\epsilon^2} \left(2\vcdim \ln {\left(
\frac{12}{\epsilon} \right)}+ \ln {\left( \frac{4}{\delta}
\right)} \right),$$ then with probability at least $1-\delta$, we
have $\left|\err(h) - \herr(h)\right| \leq \epsilon$ for all
%hypotheses
$\hyp \in \hclass$.
\end{theorem}
In section~\ref{ls} we make use of the following lemma. For a
proof see, for example, \singleemcite{dkm}.
\begin{lemma}\label{lemma1}
\label{sphere} For any fixed unit vector $w$ and any $0< \gamma
\leq 1$,
$$
\frac{\gamma}{4} \leq \mathbf{Pr}_{x} \left[ |w\cdot x| \leq
\frac{\gamma}{\sqrt{d}} \right] \leq \gamma,
$$
where $x$ is drawn uniformly from the unit sphere.
\end{lemma}
\end{document}