\documentclass{article}
\usepackage{icml2005}
%\usepackage[accepted]{icml2005}
\usepackage{epsf}
\usepackage{mlapa}
\usepackage{pst-node}
\usepackage{bibspacing}
\setlength{\bibspacing}{\baselineskip}
%\usepackage{times}
\usepackage[small,compact]{titlesec}
\usepackage{algorithm,algorithmic}
\providecommand{\tabularnewline}{\\}
\newcommand{\lossfunc}{{\ell}}
%\newcommand{\task}{{\Gamma}}
\newcommand{\task}{{\cal T}}
\newcommand{\bE}{{\mathbf E}}
\newcommand{\cD}{{\cal D}}
\newcommand{\shrink}[1]{}
\def\blackslug
{\hbox{\hskip 1pt\vrule width 5pt height 8pt depth 1.5pt\hskip 1pt}}
\def\qed{\hfill\quad\blackslug\lower 8.5pt\null\par}
\newcommand{\dt}{{\mathrm{d}t}}
\usepackage{graphicx,latexsym,amsmath,amssymb,pstricks,amsthm}
%\configure[pdfgraphic]
\setlength\textheight{9.1in}
\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{proposition}{Proposition}[section]
\newtheorem{lemma}{Lemma}
\newtheorem{definition}{Definition}
\makeatletter
\numberwithin{equation}{section} %% Comment out for sequentially-numbered
\numberwithin{figure}{section} %% Comment out for sequentially-numbered
\DeclareMathOperator*{\Exp}{E}
%\addtolength{\textwidth}{3.0in}
%\addtolength{\oddsidemargin}{-1.7in}
%\addtolength{\evensidemargin}{-1.5in}
\renewcommand{\baselinestretch}{0.96}
%\renewcommand{\baselinestretch}{0.93}
\begin{document}
\twocolumn[
\icmltitle{Error Limiting Reductions Between Classification Tasks}
\icmlauthor{}{}
\icmladdress{Keywords: supervised learning, reductions, classification}
\vskip 0.3in
]
\shrink{
\author{Alina Beygelzimer$^1$ \and Varsha Dani$^2$ \and Tom Hayes$^3$ \and John Langford$^3$ \and \\ Bianca Zadrozny$^1$}
\institute{IBM TJ Watson \and University of Chicago \and TTI-Chicago}
}
\begin{abstract}
We introduce a reduction-based model for analyzing supervised learning
tasks. We use this model to devise a new reduction
from cost-sensitive classification to binary classification with the following
guarantee: If the learned binary classifier has error rate at most $\epsilon$
then the cost-sensitive classifier has cost at most $2\epsilon$ times the
expected sum of costs of all choices.
Since cost-sensitve classification can embed any bounded loss finite choice
supervised learning task, this result shows that \emph{any} such task
can be solved using a binary classification oracle.
Finally, we present experimental results showing
that our new reduction outperforms existing
algorithms for multi-class cost-sensitive learning.
\end{abstract}
\section{Introduction}
%{\bf The Problem}.
Supervised learning involves learning a function from supplied
examples. There are many natural supervised learning tasks;
classification alone comes in a number of different forms (e.g., binary,
multi-class, cost-sensitive, importance weighted). Generally, there are two
ways to approach a new task~\footnote{We use ``task'' instead of
a more common ``problem''
because we distinguish between them formally.}.
The first is to solve it independently from other tasks.
The second is to reduce it to a task that has already
been thoroughly analyzed, automatically transferring existing theory
and algorithms from well-understood tasks to new tasks. We
investigate (and advocate) the latter approach.
{\bf Motivation}.
Informally, a reduction is a learning machine that solves one task
using oracle (i.e., black-box) access to a solver for another task.
%Thus, if the latter task can be solved well then so can the former.
Reductions present a powerful tool with a number of desirable properties:
\begin{enumerate}\addtolength{\itemsep}{-0.5\baselineskip}
\item
\vskip -.1in
They provide a formal way of comparing how relatively easy different tasks are.
If task $A$ reduces to task $B$, it is reasonable to say that
$A$ is not much harder than $B$. If in addition
$B$ also reduces to $A$, we can conclude that $A$ and $B$ are similar.
The more restricted the notion of a reduction, the stronger the conclusions.
\item
If $A$ can be reduced to $B$, then techniques for solving
$B$ can be used to solve $A$, transferring results from one learning
task to another. This saves both research and development effort.
\item
Reductions allow one to study an entire class of tasks by focusing on
a single primitive (they also help to identify such primitives).
Any progress on the primitive immediately implies
progress for every task in the class. It also simplifies
the evaluation of different learning algorithms for effectiveness.
\item
Viewed as negative statements, reductions imply relative lower bounds
stating that progress on one task cannot be achieved without making
progress on some primitive.
\end{enumerate}
\vskip -.1in There is an essential discrepancy between theory and
practice of supervised learning. On one hand, it is impossible to
prove, without making any assumptions, that we can learn to predict
well. (For example, in PAC-learning we might assume that all examples
are sampled independently and the problem is predicting an unknown
decision list.) On the other hand, learning algorithms are often
succesfully used in practice.
Reductions give us a way to cope with this discrepancy.
They provide us with a useful tool to show \emph{relative} ability to learn
-- relative to basic tasks that are better understood.
In other words, reductions allow us to make statements
of the form ``Good performance of the oracle on subproblems generated by the
reduction translates into good performance on the original problem''.
Such statements can be made without any assumptions
that cannot be verified, or do not generally hold in practice.
We propose an analysis framework allowing us to prove such error transformation
guarantees. Attempts to understand this framework in terms of sample
complexity based models of learning \cite{vapnik,valiant} will
fail; it is important to understand that this is a different model of analysis.
To test the model, we consider learning machines which have oracle
access to a solver for classification problems, and ask what can
be done with such a machine.
Section~\ref{weighted-all-pairs}
presents a new reduction from cost-sensitive
classification to binary classification and quantifies the way in which errors
by the binary classification oracle induce errors in the original
problem. Since cost-sensitve classification (see the definitions
in Section~\ref{defi}) can express any
bounded loss finite choice
supervised learning task, this result shows that \emph{any} such task
can be solved using a binary classification oracle.
To give further evidence that the proposed
framework allows for a tractable analysis,
we show how to analyze several existing algorithms in this model (see
Section~\ref{section:examples}).
In addition to the above theoretical motivations, there is
significant empirical evidence (see \cite{costing}, \cite{Boosting},
\cite{ECOC}) that this style of analysis often produces learning
algorithms that perform well in practice. Section~\ref{experiments} presents
experimental results showing that our new reduction outperforms existing
algorithms for multi-class cost-sensitive learning.
\shrink{
{\bf Results}. This paper makes the following (definitional and
constructive) contributions.
\begin{enumerate}
\item
We define a class of supervised learning tasks, and present a new,
reduction-based model for the analysis of
such tasks.
\item We show how to analyze several existing reductions in this model.
\item
We present a new reduction from \emph{all} bounded loss finite choice
supervised learning tasks to binary classification. This new reduction
is empirically tested and shown to be sound for real-world problems.
\end{enumerate}
{\bf Outline}. We present in the order (1), (3), (2), and close with a
discussion, including relevant work.
}
%\section{The Error-Limiting Reduction Model: Preliminaries}\label{defi}
\section{Basic Definitions}\label{defi}
This section defines the basic concepts.
%\subsection{Basic Definitions}\label{defi}
\begin{definition}
A \emph{supervised learning task} is a tuple $(K,Y,\lossfunc)$, where
$K$ is the space of supervised advice available at training time,
$Y$ is a prediction space, and
$\lossfunc:K\times Y \to [0,\infty)$ is a loss function.
% (nonconstant in both arguments) loss function.
\end{definition}
We assume that $\lossfunc$ is nonconstant in both arguments,
which is satisfied by any reasonable loss function.
To test the definition, we instantiate it for common
classification tasks. For {\it binary classification},
$K=\{0,1\}$ (the space of labels given for training examples),
$Y=\{0,1\}$, and $\ell(k,y)=I(k\not=y)$,
where $I(\cdot)$ is the indicator function which is $1$ when the argument
is true and $0$ otherwise. A simple generalization
to more than two labels
gives \emph{multiclass classification} with
$K=Y=\{1,\ldots,r\}$ and $\ell(k,y)=I(k\not=y)$.
\shrink{
\begin{center}
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{|c|c|c|c|}
\hline
Classification Task&
$K$&
$Y$&
$\lossfunc$ \\
\hline \hline
Binary &
$\{0,1\}$ &
$\{0,1\}$ &
$I(k \neq y)$ \\
\hline
Multiclass &
$\{1,\ldots,r\}$ &
$\{1,\ldots,r\}$ &
$I(k \neq y)$ \\
\hline
\hskip .1in Importance Weighted \hskip .1in &
$\,\{1,\ldots,r\} \times [0,\infty)\,$ &
$\,\{1,\ldots,r\}\,$ &
$\,k_2 I(k_1 \neq y)\,$ \\
\hline
Subset &
$2^{\{1,\ldots,r\}}$ &
$\{1,\ldots,r\}$ &
$I(y \not \in k)$
\tabularnewline
\hline
Cost-Sensitive &
$[0,\infty)^r$ &
$\{1,\ldots,r\}$ &
$k_y$
\tabularnewline
\hline
\end{tabular}
\end{center}
}
In general, the space $K$ of advice provided at training time need
not equal the prediction space $Y$. We sometimes need this additional
flexibility (also used in \cite{Haussler}) to provide some information
to the learning algorithm about the loss of different choices. For example, in
\emph{importance weighted classification}, we want to specify
that predicting some examples correctly is more important than
predicting others, which is done by letting
$K=\{1,\ldots,r\}\times [0,\infty)$ and
$\lossfunc(\langle k_1,k_2\rangle,y)=k_2 I(k_1\not= y)$.
In \emph{cost-sensitive classification}, $K$ is used
to associate a cost to each prediction $y \in Y$ by setting
$K=[0,\infty)^r$ and $\lossfunc(\langle k_1,\ldots,k_r\rangle ,y)=k_y$.
In both examples above, the prediction space
$Y$ is $\{1,\ldots,r\}$.
A task is (implicitly) a set of learning problems, each of which
requires more information to be fully specified.
\begin{definition}
A \emph{supervised learning problem} is a tuple $(D,X,\task)$, where
$\task =(K,Y,\lossfunc)$
is a supervised learning task, $X$ is an arbitrary feature space, and
$D$ is a distribution over $X\times K$.
\end{definition}
The goal in solving a supervised learning problem is to find a
hypothesis $h:X\rightarrow Y$ minimizing the expected loss
$E_{(x,k)\sim D}\lossfunc(k,h(x))$. Finding $h$ is difficult because
$D$ is generally not known at training time. For any particular
hypothesis $h$, the loss rate \[h_D = E_{(x,k) \sim D}
\lossfunc(k,h(x))\] is a fundamental quantity of interest.
\begin{definition}
A \emph{supervised learning algorithm for task $(K,Y,\lossfunc)$}
is a procedure mapping any finite set of examples in $(X\times K)^*$ to a
hypothesis $h:X\rightarrow Y$.
\end{definition}
We do not require that training and test examples are drawn from the
same distribution. We also do not require the samples to agree with
some hypothesis in a predetermined class, as in the PAC learning
model.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We use these definitions in Section~\ref{reductions} to define
the error limiting reductions model, mentioned informally in
the introduction.
Ultimately, the real test of a learning model is whether it allows for
tractable analysis leading to useful learning algorithms.
Thus to motivate the formulations of the model, we first show an example
of what can be done with this model.
In particular, we present a new reduction
from cost-sensitive classification to binary classification and
prove its error-transformation bounds.
%After presenting the model in Section~\ref{reductions},
To give further evidence,
Section~\ref{section:examples} shows that
the model captures several existing machine learning algorithms.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Reduction from Cost Sensitive Classification to Binary Classification}
\label{weighted-all-pairs}
We present a new reduction, Weighted All-Pairs, from $r$-class cost sensitive
classification to importance weighted binary classification. It can
be composed with the Costing reduction \cite{costing} to yield a
reduction to binary classification. Since cost sensitive
classification can express any supervised learning task we consider,
this implies that \emph{any} such task can be solved using binary
classification. (The reduction is defined for any feature space $X$, and
we assume that $X$ has been fixed and all distributions are over
$X\times [0,\infty)^r$.)
\shrink{Before starting, note that we can transform any cost
sensitive example $(x,k_1, \dots, k_r)$ according to $k_i \leftarrow
k_i - \min_{i'} k_{i'}$. Optimizing the loss of the second problem is
equivalent to optimizing the loss of the first problem (with an
additive offset). For simplicity, we only consider cost sensitive
problems after this canonicalization step.}
\subsection{The Algorithm}
For a given cost-sensitive example $(x,k_1, \dots, k_r)$, let $L(t)$
be the function $L(t) = | \{j \mid k_j \le t\} |$, for $t \in [0,
\infty)$. By shifting, we may assume that the minimum cost $k_{\min}
= 0$, so that $t \ge 0$ implies $L(t) \ge 1$.
Optimizing the loss of the shifted problem is
equivalent to optimizing the loss of the original problem.
To define the reduction, we need values $v_i$ defined by
\[
v_i = \int_0^{k_i} 1/L(t) \mbox{\emph{dt}}.
\]
Note that the values are
order-preserving: $k_i < k_j$ iff $v_i < v_j$ for all $i$ and $j$.
The algorithms specifying the reduction are given below.
\begin{algorithm}
\caption{{\bf WAP-Train} (Set of $r$-class cost sensitive examples $S$,
importance weighed binary classifier learning algorithm $B$)\vskip .05in}
\vskip .05in
Set $S'=\emptyset$.
\noindent
{\bf for all} examples $(x,k_1, \dots, k_r)$ in $S$ {\bf do}
\hskip .3in {\bf for all} pairs $(i,j)$ with $1\leq i < j\leq r$ {\bf do}
\hskip .7in Add an importance weighted example
\hskip .7in $(\langle x,i,j\rangle, I(k_i < k_j), |v_j - v_i|)$ to $S'$.
\hskip .3in {\bf end for}
{\bf end for}
Return $h=B(S')$.
\vskip .05in
\end{algorithm}
We say that label $i$ \emph{beats} label $j$ for input $x$
if either $i j$
and $h(\langle x,j,i\rangle) = 0$.
\begin{algorithm}
\caption{{\bf WAP-Test} (classifier $h$, example $x$)\vskip .05in}
{\bf for all} pairs $(i,j)$ with $1\leq i < j\leq r$ {\bf do}
\hskip .3in Evaluate $h(\langle x,i,j\rangle)$
{\bf end for}
Output $\mbox{argmax}_i |\{j \mid i\ \mbox{beats}\ j\}|$.
\vskip .05in
\end{algorithm}
Note that if $h$ makes no errors and $k_i \ne k_j$, then
label $i$ beats label $j$ exactly when $k_i < k_j$. {\bf WAP-Test}
outputs the label which beats the maximum number of other labels,
with ties broken arbitrarily.
Before stating the error transform, we must first define the
distribution induced by the reduction on the importance-weighted
binary classifier (to which the reduction makes a single call).
To draw a sample from this distribution, we first draw
a cost sensitive sample $(x,k_1,\ldots,k_r)$ from the input distribution
$D$ and then apply {\bf WAP-Train} to the singleton
set $\{(x,k_1,\ldots,k_r)\}$
to get a sequence of ${r\choose 2}$ examples for the binary classifier.
Now we just sample uniformly from this set. We overload and denote the
induced distribution by $\mbox{\bf WAP-Train}(D)$.
Recall that the loss rate of classifier $h$ on distribution
$D$ is denoted by $h_D$.
\begin{theorem}\label{thm:CSCtoIW}
\emph{(WAP error efficiency)}
For all importance weighted binary learning algorithms $B$ and
cost-sensitive datasets $S$, let $h=\mbox{\bf WAP-Train}(S,B)$. Then
for all cost-sensitive test distributions $D$,
\[ \mbox{\bf WAP-Test}(h)_D \leq 2 h_{\mbox{\bf WAP-Train}(D)}.\]
\end{theorem}
This theorem states that the cost sensitive loss is bounded by
twice the importance weighted loss on the
induced importance weighted learning problem.
Theorem \ref{thm:CSCtoIW} and the Costing reduction
(Theorem~\ref{thm:costing} in Section~\ref{costing})
imply the following efficiency guarantess:
\begin{theorem}
For all cost sensitive test distributions $D$, let $D'$ be the binary
classification problem induced by \emph{\bf WAP-Train} and Costing, and $h$
be the learned binary classifier. Then
\vskip -.2in
\[ \mbox{\bf WAP-Test}(h)_D \leq 2 h_{D'} E_{\langle x,k_1,\ldots,k_r\rangle\sim D}
\sum_{i=1}^r k_i. \]
\end{theorem}
\vskip -.1in
For notational simplicity, assume that $k_1 \le
\dots \le k_r$. Note that no generality is lost since
the algorithm does not distinguish between the labels. The following
insight is the key to the proof.
\begin{lemma}\label{lem:crossings}
Suppose label $i$ is the winner. Then, for every $j \in \{1, \ldots,
i-1\}$, there must be at least $\lceil j/2 \rceil$ pairs $(a,b)$,
where $a \le j < b$, and $b$ beats $a$.
\end{lemma}
\begin{proof}
\vskip -.1in
(Lemma \ref{lem:crossings})
Consider the restricted tournament on $\{1, \dots, j\}$.
\par\noindent {\bf Case 1:} Suppose that some $w$ beats at least $\lceil
j/2 \rceil$ of the others. If no label $b > j$ beats any label $a \le
j$, then $w$ would beat at least $\lceil j/2 \rceil + 1$ more labels
than any $b > j$; in particular, $w$ would beat at least
$\lceil j/2 \rceil + 1$ more labels than $i$. Thus, in order to have
label $i$ beat as many labels as $w$, at least $\lceil j/2 \rceil$
edges of the form $(w,b), b>j$ or $(a, i), a \le j$ must be reversed.
\par\noindent{\bf Case 2:} There is no label $w \in \{1, \dots, j\}$
beating $\lceil j/2 \rceil$ of the rest of $\{1, \dots, j\}$. This can
only happen if $j$ is odd and there is a $j$-way tie with $(j-1)/2$
losses per label in $\{1,\ldots,j\}$. In this case, although every
label beats $(j+1)/2$
more labels than any $b > j$, in particular $i$, it is still necessary
to reverse at least $(j+1)/2 \ge \lceil j/2 \rceil$ edges, in order to
ensure that $i > j$ beats as many labels as each of $\{1, \dots, j\}$.
\end{proof}
\begin{proof} (Theorem \ref{thm:CSCtoIW})
Suppose that our algorithm
chooses the wrong label $i$ for a particular example $(x, k_1, \dots, k_r)$.
We show that this requires the adversary to
incur a comparable loss.
Lemma~\ref{lem:crossings} and the definition of $v_i$ imply that the
penalty incurred to make label $i$ win is at least
\[
\int_{0}^{k_i} \frac{\left\lceil L(t) / 2 \right\rceil}{L(t)} \dt
\ge \int_{0}^{k_i} \frac12 \dt = \frac{k_i}{2}.
\]
On the other hand, the total importance assigned to queries for this
instance equals
\begin{align*}
\sum_{i t\}$,
and then switching the order of summation.
Consequently, for every example $(x,k_1,\ldots,k_r)$, the total
importance assigned to queries for $x$ equals $\sum_i k_i$, and the
cost incurred by our algorithm on instance $x$ is at most twice the
importance of errors made by the binary classifier on instance $x$.
Averaging over the choice of $x$ shows that the cost
is at most $2$.
\end{proof}
\vskip -.1in
This method of assigning importances is provably near-optimal.
\begin{theorem}
\emph{(Lower bound)} For any other assignments of
importances $w_{i,j}$ to the points $(x,i,j)$ in the above
algorithm, there exists a distribution with expected cost $\epsilon/2$. %}
\end{theorem}
\begin{proof}
\vskip -.2in
Consider examples $\left(x,0,\frac{1}{r-1},\ldots,\frac{1}{r-1} \right)$.
Suppose that we run our algorithm using some $w_{i,j}$ as the importance
for the query $(x,i,j)$. Any classifier which errs on $(x,1,i)$ and
$(x,1,j)$, where $i \ne j$, causes our algorithm to choose label $2$
as the winner, thereby giving a cost of $1/(r-1)$, out of the total cost
of $1$. The importance of these two errors is $w_{1,i} + w_{1,j}$,
out of the total importance of $\sum_{i,j} w_{i,j}$. Choosing $i$ and $j$ so
that $w_{1,i} + w_{1,j}$ is minimal, the adversary's penalty is at
most $2 \sum_{i=2}^r w_{1,i}/(r-1)$, and hence less than $2/(r-1)$
times the total importance for $x$. This shows that the cost of the
reduction cannot be reduced below $1/2$ merely by improving the choice
of weights.
\end{proof}
\subsection{Experiments}\label{experiments}
We conducted experiments comparing the weighted
all-pairs reduction to the original all-pairs reduction
\cite{All-pairs} (which ignores the costs and attempts to predict the
label with minimal cost) and to a state-of-the-art multi-class
cost-sensitive learning algorithm called MetaCost \cite{MetaCost}. We
used Boosted Naive Bayes~\cite{elkan}
as the oracle classifier learner and applied
the methods to five UCI multiclass datasets. Since these datasets do
not have costs associated with them, we generated artificial costs in
the same manner that was done in the MetaCost paper (except that we
fixed the cost of classifying correctly at zero). We repeated the
experiments for 20 different settings of the costs. The average and
standard error of the test set costs obtained with each method are
shown in the table below. Note that we often observe
order-of-magnitude improvements in performance.
We present results for two versions of this reduction. One is the
exact version used in the proof above. In the other version,
we learn one classifier per pair as is done in the original all-pairs
reduction. More specifically, for each $i$ and $j$ (with $i}{2,1}{1,2}
\ncline{->}{2,1}{3,2}
%\ncline{->}{3,2}{2,3}
\end{psmatrix}%
\vskip -1.5in
\begin{pspicture}(8,3)
\rput(1.4,1.77){\psline{->}(1.75,0.85)}
\rput(6.3,1.82){\psline{<-}(-1.32,0.75)}
\rput(6.3,1.29){\psline{<-}(-1.8,-0.69)}
\end{pspicture}
\end{pdfpic}
%\caption{\label{fig:easy} The {}``easy'' direction for
%reductions between the learning problems discussed here.}
\end{figure}
}
%\vskip -.08in
Note that all supervised learning tasks we consider can be embedded
into cost-sensitive classification using the mapping $K' =
\lossfunc(K,Y)$ for the task $(K,Y,\lossfunc)$. \shrink{Consequently,
cost-sensitive classification is (vacuously) complete for the set
of supervised learning tasks.} Section~\ref{weighted-all-pairs} shows
that cost-sensitive classification is reducible to binary
classification. Thus binary classification is
\emph{complete} for the set of supervised learning tasks considered here in
the sense that a learning machine with an oracle for
binary classification learning can solve any supervised
learning task.
\shrink{Stated
another way, a learning machine consisting of a turing machine with a
generalizing classification oracle can solve any supervised learning
task.}
\includegraphics[scale=0.49]{directions}
%\vskip -.04in
\subsection{Properties of Reductions}
We show that the definition of error limiting reductions
is non-trivial (in the sense defined below).
{\bf Non-triviality of the definition}\quad
\shrink{
Definition~\ref{definition:reduction} is only significant when the
oracle learner's loss rate is small.
We can show that trivial
reductions that either never call the oracle or create ``hard'' subproblems
with arbitrarily large informational lower bounds on errors are not
allowed by the definition. \shrink{The proposition below states that every
reduction must make a non-trivial use of its oracle
(although it does not forbid such %almost always suboptimal
behaviors as encrypting $x$).}
}
The proposition below states
that every reduction must make a non-trivial use of its oracle.
\begin{proposition}
For every error-limiting reduction $R$, every training set $S$, and
test example $(x,k)$, there exists an oracle $A$ such that
$$\bE_R [h(x)] = \arg \min_{y \in Y} \lossfunc(k,y)=0,$$ where $h$
is the hypothesis output by $R(S,A)$, and the expectation
is over the randomization used by $R$ in forming $h$.
\end{proposition}
\begin{proof}
\vskip -.1in
Recall that the error limitation property holds for all $D$, in particular for
the deterministic $D$ with all probability mass on $(x,k)$.
Thus if $R$ just ignored its oracle
we would have $\lossfunc(k,h(x)) = g(\max_{\emptyset}) = g(0) = 0$.
The fact that $\lossfunc(k,y)$ is nonconstant (in both arguments) implies
that there exists $(x,k)$ such that any
\shrink{\footnote{Note that a machine with state is required rather
than a (stateless) function to cope with the situation where the
reduction uses the same $x'$ twice.}} $h$ must satisfy $\lossfunc(k,h(x))
> 0$, a contradiction. Consequently, at least one call to the oracle must be
made.
Next note that for all $R$, $S$, and $(x,k)$,
there exists an $h'$ (and thus an oracle $A$ outputting
$h'$) consistent with all examples induced by $R$ on $(x,k)$.
If $R$ is randomized, there
exists a deterministic reduction $R'$ taking one
additional argument (a randomly chosen string) with the same output
behavior as $R$. Consequently, there exists an oracle $A$ dependent
on this random string which achieves error rate $0$. Since $g(0) =
0$, this must imply that $h(x) = \arg \min_{y \in Y} \lossfunc(k,y) =
0$.
\shrink{
Next note that for all $R$, $S$, and $(x,k)$,
there exists an $h'$ (and thus an oracle outputting
$h'$) such that $h'(x') = \arg \min_{y' \in Y'} \lossfunc(k',y') = 0$.
This is obvious if the reduction is deterministic. Since $h'$ is a
prediction machine, this is sufficient to predict perfectly for every
element $(x',k') \in S'$.
Note that if $R$ is randomized, there
exists a deterministic reduction $R'$ taking one
additional argument (a randomly chosen string) with the same output
behavior as $R$. Consequently, there exists an oracle $A$ dependent
on this random string which achieves error rate $0$. Since $g(0) =
0$, this must imply that $h(x) = \arg \min_{y \in Y} \lossfunc(k,y) =
0$.
}
\end{proof}
Note, however, that while the notion of error limiting reductions is non-trivial,
it does not prevent a reduction from creating hard problems for the oracle.
% ECOC example
\shrink{
While the notion of error limiting reductions is
non-trivial, it does not prevent the reduction of an
``easy'' problem to a ``hard'' problem. Alternative definitions
enforcing this do exist, but it is unclear what can be proved under
these much stronger definitions.
}
{\bf Composition}\quad
Since our motivation for studying reductions is to reduce the number of
distinct problems which must be considered, it is natural to want to
compose two or more reductions to obtain new ones.
It is easy to see that reductions defined above compose.
\begin{proposition}
If $R_{12}$ is a reduction from task ${\task}_1$ to ${\task}_2$
with error limitation $g_{12}$ and $R_{23}$ is a reduction from task
${\task}_2$ to ${\task}_3$ with error limitation $g_{23}$, then the
composition $R_{12}\circ R_{23}$ is a reduction from ${\task}_1$ to
${\task}_3$ with error limitation $g_{12} \circ g_{23}$.
\end{proposition}
\begin{proof}
Given oracle access to any learning algorithm $A$ for ${\task}_3$,
reduction $R_{23}$ yields a learning algorithm for ${\task}_2$ such that for
any $X_2$ and $D_2$, it produces a hypothesis with loss rate
$h'_{D_2} \leq g_{23}(h''_{D_3})$,
where
$h''_{D_3}$ is the maximum loss rate
of a hypothesis $h''$ output by $A$ (over all invocations).
We also know that $R_{12}$ yields a
learning algorithm for ${\task}_1$ such that for all $X_1$ and $D_1$ we have
$h_{D_1}\leq g_{12}(h'_{D_2}) \leq
g_{12}(g_{23}(h''_{D_3}))$, where $h'_{D_2}$ is the maximum
loss rate of a hypothesis $h'$ output by $R_{23}(A)$.
Finally, notice that $g_{12}\circ g_{23}$ is continuous, monotone
nondecreasing, and $g_{12}(g_{23}(0))=0$.
\end{proof}
\section{Examples of Reductions}
\label{section:examples}
Here, we show that the notion of an error-limiting reduction is both
tractable for analysis and describes several standard machine learning
algorithms.
\subsection{Reduction from Importance Weighted Classification to Classification}
\label{costing}
\shrink{
\begin{center}
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
Name& $g(\epsilon)$& % Learn Preserve&
Time& \#Calls
\tabularnewline
\hline
Costing& $\epsilon E_{(x,k_1,k_2)\sim D} k_2$& % Yes&
$O(m)$& O(1) parallel
\tabularnewline
\hline
\end{tabular}
\end{center}
}
In importance weighted classification, every example comes with an
extra ``importance'' value which measures how (relatively) important a
decision might be. This extra flexibility turns out to be useful in
several applications, and as an intermediate step in several
reductions.
The ``Costing'' \cite{costing} algorithm is a (very simple) reduction
from importance weighted classification to classification based on
rejection sampling. Given an importance weighted dataset $S =
(x,k_1,k_2)$ an unweighted dataset is produced using rejection
sampling:
$$\{ (x,k_1) : (x,k_1,k_2) \in S \mbox{ and } k_2 < t \sim U(0, w)
\}$$ where $t \sim U(0,w)$ is a uniform random draw from the interval
$[0,w]$ and $w$ is greater than $k_2$ for all examples. The
classifier learner is applied to produce a classifier $c$ for this new
dataset\footnote{The actual costing algorithm repeats this process and
predicts according to the majority on learned $c$.}.
Although this algorithm is remarkably simple, composing costing with a
decision tree has been shown to yield superior performance (again, see
\cite{costing} for details) on championship prediction problems.
\begin{theorem}\label{thm:costing}
(Costing error efficiency) For all importance weighted problems
$(D,X,\task)$, if the classifiers have error rate $\epsilon$, costing
has loss rate at most \vskip -.18in
\[ \epsilon E_{(x,k_1,k_2)\sim D} k_2.\]
\end{theorem}
\begin{proof} \vskip -.12in Trivial. See \cite{costing}.
\end{proof}
\subsection{Reductions from Multiclass Classification to Binary Classification}
In this section we analyze several well known reductions for
efficiency. Before we start, it is important to note that there are
boosting algorithms for
solving multiclass classification given weak binary importance
weighted classification (see \cite{adaboost.oc}, \cite{adaboost.m2}).
Some of these methods require extra powerful
classification algorithms, but others are genuine reductions, as
defined here. A reductionist approach for creating a multiclass
boosting algorithm is to simply compose any one of the following
reductions with \emph{binary} boosting.
\shrink{
\begin{center}
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
Name& $g(\epsilon)$& % Learn Preserve&
Time& \#Calls
\tabularnewline
\hline
Tree& $\epsilon \log_2 r$& % Yes&
$O(|S| r)$&\hskip .05in $r - 1$ parallel \hskip .1in
\tabularnewline
\hline
ECOC& $4 \epsilon $& % No&
$O(|S|r)$& $2 r$ parallel
\tabularnewline
\hline
\end{tabular}
\end{center}
}
\noindent
{\bf The Tree Reduction}\quad
In the tree reduction (which is well known, see chapter 15 of
\cite{Tree}), we construct $r-1$ binary classifiers distinguishing
between the labels using a binary tree. The tree has depth
$\log_{2}r$, and each internal node is associated with a set of labels
that can reach the node, assuming that each classifier above it
predicts correctly. The set of labels at that node is split in half,
and a classifier is trained to distinguish between the two sets of
labels. The root node starts with the set of all labels $\{1,\ldots,r\}$
which are all indistinguishable at that point.
%In particular, we start with the set of all labels
%$\{1,\ldots,r\}$ at the top most node, and split the labels into
%ordered intervals $\{1,\ldots,r/2 - 1\}, \{r/2, r\}$ at each node.
Predictions are made by following a chain of classifications from the root
down to the leaves, each of which is associated with a unique label.
Note that instead of $r - 1$ parallel, one could use $\log_2r$ adaptive queries.
%Note that the trick of combining many calls into one described in
%Section~\ref{sec:call} applies here as well.
%The time efficiency of both reductions is $O(|S| r)$.
\begin{theorem}
\emph{(}Tree error efficiency\emph{)}
For all multiclass problems $(D,X,\task)$, if the
binary classifiers have loss rate $\epsilon$, the tree reduction has
loss rate at most $\epsilon\log_{2}r$.
\end{theorem}
\begin{proof}
\vskip -.1in
A multiclass error occurs only if some binary classifier on
the path from the root to the leaf errs.
\end{proof}
\noindent
{\bf The Error Correcting Output Codes (ECOC) Reduction}\quad
Let $C \subseteq \{0,1\}^n$ be an error-correcting code with $l$
codewords a minimum distance of $d$ apart. Let $C_{i,j}$ denote the
$i$'th bit of the $j$'th codeword of $C$. The ECOC reduction
corresponding to $C$ learns $n$ classifiers: the $i$'th classifier
predicts $C_{i,y}$ where $y$ is the correct label given input $x$.
The loss rate for this reduction is at most $2n\epsilon/d$, as we
shall prove in Theorem~\ref{thm:ECOC} below.
We mention three codes of particular interest. The first is when the
codewords form a subset of the rows of a Hadamard matrix (an $n \times
n$ binary matrix with any two rows differing in exactly $n/2$ places).
Such matrices exist and are easy to construct when $n$ is a power of
$2$. Thus, for Hadamard codes, the number of classifiers needed is
less than $2 r$ (since a power of $2$ exists between $r$ and $2r$),
and the loss rate is at most $4 \epsilon$. When the codewords
correspond to binary representations of $0, 1, \dots, r-1$, the ECOC
reduction is similar to (although not the same due to decoding
differences) the tree reduction above. Finally, if the codewords form
the $r \times r$ identity matrix, the ECOC reduction corresponds to
the one-against-all reduction.
%Also note that the ``many calls to one call'' trick in section
%\ref{sec:call} (call efficiency) can be applied here.
\begin{theorem} \label{thm:ECOC}
\emph{(}ECOC\emph{)}
For all multiclass problems $(D,X,\task)$, if the
binary classifiers have loss rate $\epsilon$, the ECOC reduction has
loss rate less than
$\frac{2n}{d} \epsilon$.
\end{theorem}
This theorem is identical to Lemma 2 of \cite{Guruswami}
(which was generalized in \cite{ASS}), except that we generalize the
statement to any distribution $D$ generating (test) examples rather
than the loss rate on a training set.
\begin{proof}
\vskip -.1in
When the loss rate is zero, the correct label is consistent with all
$n$ classifiers, and wrong labels are consistent with at most $n - d$
classifiers, since $C$ has minimum distance $d$. In order to wrongly
predict the multiclass label, at least $d/2$ binary classifiers must
err simultaneously. Since every multiclass error implies at least a
fraction $d/2n$ of the binary classifiers erred, a binary loss rate
of $\epsilon$ can yield at most a multiclass loss rate of
$2n\epsilon/d$.
\end{proof}
\vskip -.1in
\noindent Sometimes ECOC is analyzed assuming the errors are
independent. This gives much better results, but seems less
justifiable in practice. For example, in any setting with label noise
errors are highly \emph{dependent}.
We note that all reductions in this section preserve independence.
Many common learning algorithms are built under the assumption that
examples are drawn independently from some distribution, and
it is a desirable property for reductions to preserve this independence
when constructing datasets for the oracle.
\section{Prior Work}
There has been much work on reductions in learning. One of our goals
is to give a unifying framework, allowing a better understanding of
strengths and weaknesses in forms of reductions.
Boosting algorithms \cite{Boosting,noise_boost} can be thought of as
reductions from binary (or sometimes multiclass) classification with a
small error rate to importance weighted classification with a loss
rate of nearly $\frac{1}{2}$. Given this, it is important to ask:
``Can't a learning machine apply boosting to solve any classification
problem perfectly?'' The answer is ``No'' since it is impossible to
always predict correctly whenever the correct prediction given
features is not deterministic (i.e., is probabilistic). What this implies
for a learning machine is that the loss rate becomes unbounded from
$\frac{1}{2}$ after some number of rounds of boosting.
Bagging \cite{Bagging} can be seen as a self-reduction of
classification to classification, which turns learning algorithms
with high ``variance'' (i.e., dependence on the exact examples seen)
into a classifier with lower ``variance''. Bagging performance
sometimes suffers because the examples fed into the classifier
learning algorithm are not necessarily independently distributed
according to the original distribution.
The ECOC reduction~\cite{ECOC} generalizes
the tree reduction \cite{Tree} and the one-against-all reduction
\cite{One-vs-all} from multiclass to binary classification.
The ``Costing'' algorithm \cite{costing} is a reduction of importance
weighted classification to classification.
The ``Probing'' algorithm \cite{probing} is a reduction of class
probability estimation to binary classification.
Pitt and Warmuth \cite{PW} also introduced a form of
a reduction between classification problems called
\emph{prediction-preserving reducibility}. There are several
important differences between this notion and the notion presented here:
(1) prediction-preserving reductions are typically used to show negative,
{\it non}-predictability results, while we show positive results
transferring learning algorithms between tasks;
(2) reductions presented here are representation independent;
(3) prediction-preserving
reducibility is a \emph{computational} notion, while the reductions
here are \emph{informational}.
Reductions presented here are also more general in the sense that they
are between arbitrary learning
\emph{tasks} rather than between restricted sets of classification problems.
\shrink{
(5) prediction-preserving reductions
make only one call to the oracle (i.e., problem they reduce to) and
output the oracle's answer as its own answer. We enforce no
restriction on how the oracle is used, making the setting directly
applicable to ECOC, Boosting, and other common
transformations. However, Abe and Watanabe~\cite{naoki} defined a
Turing analogue of the prediction preserving reducibility.
}
\section{Conclusion}
We introduced the analysis model of error-limiting reductions in
learning. This model has several nice properties (often not shared
with other methods of learning analysis). First,
the analysis in this model does
not depend upon \emph{any} unprovable (or unobservable) assumptions
such as independence.
Error-limiting reductions often have tractable analysis and
capture several common machine learning algorithms. They also
often work well in practice.
To test the model, we constructed, proved, and empirically tested
an error-limiting reduction from multi-class cost-sensitive
classification to binary classification.
Since cost sensitive classification can express any supervised
learning task considered here, this reduction
can be used to reduce \emph{any} such task
to binary classification.
\shrink{
We constructed, proved, and empirically tested an error-limiting
reduction from cost sensitive classification to binary classification.
Since cost sensitive classification can express any supervised
learning task considered here, this means that \emph{any} such task
can be reduced to binary classification.
}
\shrink{
Several important problems remain to be addressed.
\begin{enumerate}
\item (scope) There are several learning problems (for example,
outlier detection and clustering) that are not captured by the notion
of supervised learning defined here and are not known to be reducible
to binary classification. A model capable of capturing all of these
would be of significant interest. An obvious question is whether
there is a complete task for these problems and whether this
primitive is fundamentally different from binary classification.
\item (refinement) While the notion of error limiting reductions is
non-trivial, it does not (intuitively) prevent the reduction of an
``easy'' problem to a ``hard'' problem. Alternative definitions
enforcing this do exist, but it is unclear what can be proved under
these much stronger definitions.
\end{enumerate}
}
\shrink{
\begin{table}
\centering
\begin{tabular}{|c|c|c|}
\hline
\hskip .05in Dataset \hskip .05in & All-Pairs & MetaCost \\ \hline
splice & 59.8 $\pm$ 24 & 49.8 $\pm$ 3.05 \\
solar & \hskip .03in 3989 $\pm$ 1415 \hskip .03in & \hskip .03in 5317 $\pm$ 390 \hskip .03in \\
anneal & 310 $\pm$ 205 & 207 $\pm$ 43 \\
kdd-99 & 234 $\pm$ 62 & 49.4 $\pm$ 9.3 \\
satellite & 137 $\pm$ 33 & 104 $\pm$ 6.4 \\ \hline
& WAP v.1 & WAP v.2 \\ \hline
splice & 197 $\pm$ 23 & 46.5 $\pm$ 3.5 \\
solar & 24.7 $\pm$ 2.0 & 36.8 $\pm$ 11 \\
anneal & 29.9 $\pm$ 3.0 & 164 $\pm$ 43 \\
kdd-99 & 0.956 $\pm$ 0.12 & 1.76 $\pm$ 0.66 \\
satellite & 428 $\pm$ 21 & 148 $\pm$ 8.5 \\ \hline
\end{tabular}
\caption{The weighted all pairs reduction: Experimental results}
\label{tb:wap}
\end{table}
}
\shrink{
\begin{table*}
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
\hskip .05in Dataset \hskip .05in & All-Pairs & MetaCost & Weighted All-Pairs v.1 & Weighted All-Pairs v.2 \\ \hline
splice & 59.8 $\pm$ 24 & 49.8 $\pm$ 3.05 & 197 $\pm$ 23 & 46.5 $\pm$ 3.5 \\
solar & \hskip .03in 3989 $\pm$ 1415 \hskip .03in & \hskip .03in 5317 $\pm$ 390 \hskip .03in & 24.7 $\pm$ 2.0 & 36.8 $\pm$ 11 \\
anneal & 310 $\pm$ 205 & 207 $\pm$ 43 & 29.9 $\pm$ 3.0 & 164 $\pm$ 43 \\
kdd-99 & 234 $\pm$ 62 & 49.4 $\pm$ 9.3 & 0.956 $\pm$ 0.12 & 1.76 $\pm$ 0.66 \\
satellite & 137 $\pm$ 33 & 104 $\pm$ 6.4 & 428 $\pm$ 21 & 148 $\pm$ 8.5 \\ \hline
\end{tabular}
\caption{The weighted all pairs reduction: Experimental results}
\label{tb:wap}
\end{table*}
}
\shrink{
\begin{thebibliography}{}
\bibitem{naoki}Naoki Abe and Osamu Watanabe. Polynomially Sparse Variations and Reducibility among Prediction Problems,
{\em IEICE Transactions on Communications, Electronics, Information, and Systems}, E75-D(4):449--458, 1992.
\bibitem{ASS}Erin Allwein, Robert Schapire, Yoram Singer. Reducing
Multiclass to Binary: A Unifying Approach for Margin Classifiers,
{\em Journal of Machine Learning Research}, 1:113--141, September 2001.
\bibitem{Bagging}Leo Breiman. Bagging predictors,
{\em Machine Learning}, 26(2):123--140, 1996.
\bibitem{ECOC}Thomas G. Dietterich and Ghulum Bakiri. Solving
Multiclass Learning Problems via Error-correcting Output Codes,
{\em Journal of Artificial Intelligence Research}, 2:263--286, January 1995.
\bibitem{MetaCost}Pedro Domingos. Metacost: A general method for make classifiers cost sensitive.
{\em Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining},
pages 155-164. ACM Press, 1999.
\bibitem{Boosting}Yoav Freund and Robert E. Schapire.
A decision-theoretic generalization
of on-line learning and an application to boosting. {\em Journal of Computer
and System Sciences}, 55(1):119--139, 1997.
\bibitem{Tree}John Fox. {\em Applied Regression Analysis, Linear Models, and Related Methods}, Sage Publications, 1997.
\bibitem{Guruswami}Venkatesan Guruswami and Amit Sahai. Multiclass Learning,
Boosting, and Error-correcting codes, {\em Proceedings of the 12th Annual Conference on Computational Learning Theory (COLT)}, 145--155, 1999.
\bibitem{All-pairs}Trevor Hastie and Robert Tibshirani.
Classification by pairwise
coupling. {\em Proceedings of the 1997 Conference on
Advances in Neural Information Processing Systems}, 507--513, MIT
Press, 1998.
\bibitem{Haussler}David Haussler. Decision Theoretic Generalizations
of the PAC Model for Neural Net and Other Learning Applications,
{\em Information and Computation}, 100:78--150, September 1992.
\bibitem{noise_boost}Adam Kalai and Rocco Servedio.
Boosting in the Presence of Noise, {\em
Proceedings of the 35th Annual ACM Symposium
on the Theory of Computing (STOC)}, 195--205, 2003.
\bibitem{probing}John Langford and Bianca Zadrozny, Estimating class
membership probabilities using classifier learners, AISTAT 2005.
\bibitem{PW}Lenny Pitt and Manfred Warmuth. Prediction-Preserving
Reducibility, {\em Journal of Computer and System Sciences}, 41:430--467,
1990.
\bibitem{One-vs-all} Ryan Rifkin and Aldebaro Klautau. In Defense of One-Vs-All
Classification, {\em Journal of Machine Learning Research}, 5:101--141, 2004.
\bibitem{adaboost.m2}Robert E. Schapire and Yoram Singer. Improved
boosting algorithms using confidence-rated predictions. {\em Machine
Learning}, 37(3):297--336, 1999.
\bibitem{adaboost.oc}Robert E. Schapire. Using output codes to boost
multiclass learning problems, {\em Proceedings of
the Fourteenth International Conference on Machine Learning (ICML)}, 313--321,
1997.
\bibitem{valiant}L.G. Valiant, A Theory of the
Learnable. {\em Communications of the ACM}, 27(11):1134-1142, November 1984
\bibitem{vapnik}V. N. Vapnik and A. Y. Chevonenkis. On the uniform
convergence of relative frequencies of events to their
probabilities. {\em Theory of probability and its applications},
16(2):264-280, 1971.
\bibitem{costing}Bianca Zadrozny, John Langford, and Naoki Abe. Cost
Sensitive Learning by Cost-Proportionate Example Weighting, {\em Proceedings of
the 3rd IEEE International Conference on Data Mining (ICDM)}, 2003.
\end{thebibliography}
}
\renewcommand{\baselinestretch}{1.12}
\begin{small}
\bibliography{wap}
\bibliographystyle{mlapa}
\end{small}
\end{document}