\documentclass{llncs}
\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\pagestyle{headings} % switches on printing of running heads
\newcommand{\shrink}[1]{}
\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\newcommand{\noun}[1]{\textsc{#1}}
\usepackage{float}
\floatstyle{ruled}
\newfloat{algorithm}{tbp}{loa}
\floatname{algorithm}{Algorithm}
\usepackage{latexsym,amsmath,amssymb,graphicx}
\begin{document}
\title{Sensitive Error Correcting Output Codes}
\titlerunning{Sensitive Error Correcting Output Codes}
\author{John Langford\inst{1} \and Alina Beygelzimer\inst{2}}
\authorrunning{Langford and Beygelzimer}
\tocauthor{John Langford (TTI-Chicago),
Alina Beygelzimer (IBM T. J. Watson Research Center)}
%
\institute{Toyota Technological Institute, Chicago IL 60637, USA\\
\email{jl@tti-c.org}
\and
IBM T. J. Watson Research Center, Hawthorne NY 10532, USA\\ \email{beygel@us.ibm.com}}
\maketitle
\begin{abstract}
We present a reduction from cost-sensitive classification to binary
classification based on (a modification of) error correcting output
codes. The reduction satisfies the property that $\epsilon$ regret
for binary classification implies $l_{2}$-regret of at most $2\epsilon$
for cost estimation. This has several implications:
\begin{enumerate}
\item
Any regret-minimizing online algorithm for 0/1 loss is (via the
reduction) a regret-minimizing online cost-sensitive algorithm.
In particular, this means that online learning can be made to work for
arbitrary (i.e., totally unstructured) loss functions.
\item
The output of the reduction can be thresholded so that $\epsilon$ regret
for binary classification implies at most $4\sqrt{\epsilon Z}$ regret
for cost-sensitive classification where $Z$ is the expected sum of
costs.
\item
For multiclass problems, $\epsilon$
binary regret translates into $l_{2}$-regret of at most $4\epsilon$ in the
estimation of class probabilities. For classification, this implies
at most $4\sqrt{\epsilon}$ multiclass regret.
\shrink{
3) Using the canonical embedding of multiclass classification into
cost-sensitive classification, this reduction shows that $\epsilon$
binary regret implies $l_{2}$-regret of at most $4\epsilon$ in the
estimation of class probabilities. For a hard prediction, this implies
at most $4\sqrt{\epsilon}$ multiclass regret.
}
\end{enumerate}
\end{abstract}
\section{Introduction}
\label{sec:Introduction}
\noun{Background:} The goal of classification is to predict labels
on test examples given a set of labeled training examples. Binary
classification, where the number of labels is two, is the most basic
learning task as it involves predicting just a single bit for each
example. Due to this simplicity, binary learning is (perhaps) better
understood theoretically than any other prediction task, and several
empirically good binary learning algorithms exist.
Practical machine learning, on the other hand, often requires predicting
more than one bit. Furthermore, each prediction
may generally have a different associated loss, and ignoring this
information can make
the problem more difficult.\footnote{
For example, consider the following
problem: Given some relevant information, we must predict which
of several routes to take. %The
%goal here is to minimize the expected time of travel. We could
%think of this as a multiclass problem (predict which route is fastest),
%but this can make the problem more difficult:
It is easy to design a problem where there are several
paths that are typically good but occasionally very slow. If there
is another path that is never the best but never slow, it may provide
the best expected time of all choices. If the problem is altered into
a multiclass problem of predicting the best route, this path
will never be taken. %the correct label.
} %due to throwing away information.
\noun{Motivation:} Reductions allow us to translate performance
on well-studied binary problems into performance on the more general
problems arising in practice. We provide a reduction (called SECOC) from
cost-sensitive classification to binary classification with the property
that small regret on the created binary problem(s) implies small regret
on the original cost-sensitive problem. This is particularly
compelling because \emph{any} loss function on single examples can
be expressed with cost-sensitive classification. Therefore, this reduction
can be used (at least theoretically) to solve a very broad set of
learning problems. In addition, there is convincing empirical evidence
that SECOC works well in practice, which gives further support to this
style of analysis.
Experiments in Section~\ref{experiments} show that SECOC results
in superior performance on all tested multiclass learning algorithms and
problems, compared to several other commonly used algorithms.
\shrink{
that analysis in this style results in algorithms with good performance
(see \cite{Probing,wap,costing}).
To support the reduction, Section~\ref{experiments} shows that SECOC results
in superior performance on all tested multiclass learning algorithms and
problems, compared to several other commonly used algorithms.
}
The basic SECOC reduction can be reused in several ways.
\begin{enumerate}
\item \noun{General Online Learning:} Any regret-minimizing online algorithm
for 0/1 loss is (via the reduction) a regret-minimizing online cost
sensitive algorithm. In particular, this means that online learning
can be made to work for arbitrary (i.e., totally unstructured) loss
functions.
\item \noun{Cost Sensitive Classification:} The output of the reduction
can be thresholded so that a small regret for binary classification
implies a small regret for cost-sensitive classification. This implies
that any consistent binary classifier is a consistent cost-sensitive
classifier.
\item \noun{Multiclass Problems:} Using the canonical embedding
of multiclass classification into cost-sensitive classification, this
reduction implies that small binary regret translates into small $l_{2}$
error in the estimation of class probabilities.
By thresholding the estimates, we get a bound on multiclass regret.
Note that this implies that \emph{any} consistent binary classifier
is (via the reduction) a consistent multiclass classifier. This is
particularly important because generalization of SVMs to multiple
classes have been done wrong, as shown in \cite{Wahba}.
\end{enumerate}
These applications are discussed in Section~\ref{apps}.
\noun{General Comment:} It is important to understand that analysis here
is orthogonal to the sample complexity analysis in, for example,
PAC learning \cite{valiant}
or uniform convergence \cite{VC}. We consider measures
over sets of examples and analyze the transformation of losses under
mappings between these measures induced by the algorithms.
This allows us to avoid making
assumptions (which cannot be verified or simply do not hold in practice)
necessary to prove sample complexity bounds. Instead, we show relative
guarantees in an assumption-free setting -- we bound the performance on
general problems arising in practice in terms of performance on basic
problems that are better understood.
\shrink{Attempts to understand this
paper in terms of sample complexity results such as PAC learning \cite{valiant}
or uniform convergence \cite{VC} will fail. This is an orthogonal
theory where we define learning problems as measures over certain
sets and analyze the transformation of losses under mappings between
these measures.}
\noun{Context:} SECOC is a variation of error-correcting output
codes (ECOC) \cite{ecoc}. Later, in Section \ref{sub:Inconsistency-of-ECOC},
we show that this variation is necessary in order to satisfy a regret
transform. The ECOC reduction works by learning a binary classifier,
which decides membership of a label in subsets of labels. Given a
sequence of subsets, each label corresponds to a binary string (or
a \emph{codeword}) defined by the inclusion of this label in the sequence
of subsets. A multiclass prediction is made by finding the codeword
closest in Hamming distance to the sequence of binary predictions
on the test example.
For the ECOC reduction, a basic statement \cite{Venkat} can be made:
with a good code, for all training sets, the error rate of the multiclass
classifier on the training set is at most $4$ times the average error rate
of the individual binary classifiers. The proof of this statement
is essentially the observation that there exist codes in which the distance
between any two codewords is at least $\frac{1}{2}$. Consequently,
at least $\frac{1}{4}$ of the classifiers must err to induce a multiclass
classification error, implying the theorem.
This theorem can be generalized \cite{wap} to quantify {}``for all
measures'' rather than {}``for all training sets''. This generalization
is not as significant as it might at first seem, because the measure
implicit in a very large training set can approximate other measures
(and it can do so arbitrarily well when the feature space is finite).
Nevertheless, it is convenient to quantify over all measures, so that the
statement holds for the process generating each individual example.
Since there is always some process generating examples, the result
is applicable even to adversarial processes.
The weighted all pairs algorithm \cite{wap} intuitively guarantees
that a small error rate on created classification problems implies
a small cost-sensitive loss. The core result here is similar except
that small binary \emph{regret} implies small cost-sensitive \emph{regret}.
Regret is the error rate minus the minimum error rate. Consequently,
the results here can have important implications even when, for example,
the binary error induced from a multiclass problem is $0.25$. SECOC
does not supercede this result, however, because the regret bounds
are weaker, roughly according to $\epsilon$ error rate going to $\sqrt{\epsilon}$
regret.
ECOC was modified \cite{marginecoc} to consider margins of the binary
classifiers---numbers internal to some classification algorithms that
provide a measure of confidence in a binary prediction. Decoding proceeds
in the same way as for ECOC except a {}``loss''-based%
\footnote{{}``Loss'' is in quotes because the notion of loss is a specification
of the optimization used by the binary learning algorithm rather than
the loss given by the problem, as is used in the rest of the paper.%
} distance is used instead of the Hamming distance. Roughly speaking,
SECOC uses the motivation behind this approach although not the approach
itself. Instead of working with margins, we define binary classification
problems, for which the optimal solution computes the relative expected
cost (rather than the margin) of choices. This approach allows us
to accomplish several things:
First, we can use arbitrary classifiers rather than margin-based classifiers.
We also
remove the mismatch between the margin and the motivations. Optimizations
of hinge loss (for SVMs) or exponential loss (for Adaboost) cause
a distortion where the optimization increases the margin of small-margin
examples at the expense of the margin on large-margin examples. The
efficacy of Platt scaling \cite{Platt} (i.e., fitting a sigmoid to
a margin to get a probabilistic prediction) can be thought of as strong
empirical evidence of the deficiency of margins as probability estimates.
Finally, we can generalize the approach to tackle all cost-sensitive problems
rather than just multiclass problems. This generalization comes at
no cost in multiclass performance.
\shrink{
\noun{Outline:} The next sections of this paper do the following:
\begin{enumerate}
\item We define the basic SECOC reduction.
\item We define and then prove the main theorem.
\item We discuss applications and corollaries of the main theorem.
\item We show that ECOC cannot achieve a similar regret transform.
\item We conclude with some discussion of variants.
\end{enumerate}
}
\section{The SECOC Reduction}
We work in an assumption-free learning setting. The SECOC reduction
reduces cost-sensitive classification
to importance weighted binary classification,
which in turn can be reduced to binary classification using the Costing
reduction~\cite{costing}. We first define all the problems involved.
\shrink{
To define the reduction,
we first define the problems that we reduce to and from. We actually
reduce to importance weighted binary classification. Since importance
weighted binary classification can be reduced to binary classification
\cite{costing}, a reduction all the way to binary classification
is achievable.
}
\begin{definition}
%\emph{(Importance Weighted Binary Classification)}
An \emph{importance weighted
binary classification problem} is defined by a measure $D$ on a set
$X\times\{0,1\}\times[0,\infty)$, where $X$ is some arbitrary feature
space, $\{0,1\}$ is the binary label, and $[0,\infty)$ is the importance
of correct classification. The goal is to find a binary classifier
$b:X\rightarrow\{0,1\}$ which minimizes the expected importance weighted
loss, $E_{(x,y,i)\sim D}\left[iI(b(x)\neq y)\right]$, where $I(\cdot)$
is $1$ when the argument is true and $0$ otherwise.
\end{definition}
Cost-sensitive classification defined below is sufficiently
general to express any loss function on a finite set.
\begin{definition}
%(Cost Sensitive Classification)
A \emph{cost-sensitive $k$-class problem} is defined
by a measure $D$ on a set $X\times[0,\infty)^{k}$, where $X$ is
some arbitrary feature space, and the additional information $[0,\infty)^{k}$
is the cost of each of the $k$ choices.
The goal is to find a classifier $h:X\rightarrow\{1,...,k\}$ which
minimizes the expected cost, $E_{(x,\vec{c})\sim D}\left[c_{h(x)}\right]$.
\end{definition}
A cost-sensitive learning algorithm typically takes as input a sequence
of training examples in $(X\times[0,\infty)^{k})^{*}$ as advice in
constructing $h(x)$.
The SECOC reduction is a cost-sensitive learning algorithm that uses
a given binary learning algorithm as a black box. As with the ECOC
reduction, SECOC uses a code defined by an $n\times k$ binary coding
matrix $M$ with columns corresponding to multiclass labels. % Each
%of the $n$ rows defines several importance-weighted binary
%classification problems by specifying
%a subset of labels and asking whether the label is in the subset.
For example, the columns can form a subset of any $k$ codewords of a Hadamard
code of length $n$, which has the property that any two distinct
codewords differ in at least $n/2$ bit positions. Such codes are
easy to construct when $k$ is a power of $2$. Thus, for Hadamard
codes, the number $n$ of classification problems needed is less than
$2k$.
For each subset $s$ of labels, corresponding to
a row of $M$, we create a set of importance weighted classification
problems parameterized by $t\in[t_{\textrm{min}},t_{\textrm{max}}]$, where
$t_{\textrm{min}}=0$ and $t_{\textrm{max}}=1$ are always correct,
but significant efficiency improvements arise from appropriately chosen
smaller ranges. Intuitively,
the problem defined by the pair $(s,t)$ is to answer the question
{}``Is the cost of $s$ greater than $t$ times the total cost?''
From the optimal solution of these problems we can compute the expected
relative cost of each subset $s$. SECOC-Train (Algorithm \ref{fig:SECOC})
has the complete specification.
We write $E_{s}$ to denote an expectation over $s$ drawn uniformly
from the rows of $M$, and $E_{t}$ to denote expectation over $t$
drawn uniformly from the interval $[t_{\textrm{min}},t_{\textrm{max}}]$.
To make a label cost estimate, SECOC-Predict (Algorithm \ref{fig:SECOC_predict})
uses a formula of the expected prediction of the subsets containing
the label.
%
\begin{algorithm}
\caption{\label{fig:SECOC} \textbf{SECOC-Train} (Set of $k$-class cost-sensitive
examples $S$, importance weighted binary classifier learning algorithm
$B$, range $\left[t_{\min},t_{\max}\right]$ for $t$)}
\begin{enumerate}
\item For each subset $s$ defined by the rows of $M$:
\begin{enumerate}
\item For $(x,\,\vec{c})\in S$, let $|\vec{c}|=\sum_{y}c_{y}$ and $c_{s}=\sum_{y\in s}c_{y}$.
\item For each $t$ in $\left[t_{\min},t_{\max}\right]$:
Let $b_{st}=\textrm{B(}\{(x,\,\, I(c_{s}\geq t|\vec{c}|),|c_{s}-|\vec{c}|t|):(x,\,\vec{c})\in S\})$.
\end{enumerate}
\item return $\{ b_{st}\}$
\end{enumerate}
\end{algorithm}
%
\begin{algorithm}
\caption{\label{fig:SECOC_predict} \textbf{SECOC-Predict} (classifiers $\{ b_{st}\}$,
example $x\in X$, label $y$)}
return $2\left(t_{\textrm{min}}+(t_{\textrm{max}}-t_{\textrm{min}})E_{s}E_{t}\left[I(y\in s)b_{st}(x)+I(y\not\in s)(1-b_{st}(x))\right]\right)-1$
\end{algorithm}
\noun{Single Classifier Trick:} %It might appear that SECOC-Train
%requires several invocations of the oracle learning algorithm $B$,
Multiple invocations of the oracle learning algorithm $B$
can be collapsed into a single call using
a standard trick \cite{wap,marginecoc}.
The trick is just to augment
the feature space with the name of the call, and then learn a classifier $b$
on (a random subset of) the union of all training data. With this
classifier, we can define $b_{st}(x)\equiv b(\langle x,s,t\rangle)$, and all of
our results hold for this single invocation classifier.
The implication of this observation is that we can view SECOC-Train
as a machine that maps cost-sensitive examples to importance weighted
binary examples. We denote the learned binary classifier by
$B(\textrm{SECOC-Train}(S))$.
%Thus, we can regard SECOC-Train as having type $\textrm{SECOC-Train}:(X\times[0,\infty)^{k})^{*}\rightarrow(X'\times\{0,1\}\times[0,\infty))^{*}$,
%and the learned binary classifier is simply $B(\textrm{SECOC-Train}(S))$.
\section{The Main Theorem}
Before stating the theorem, we need to define loss and regret. Given
any distribution $D$ on examples $X\times\{0,1\}\times[0,\infty)$,
the importance weighted error rate of a binary classifier $b$ is
given by\[
e(D,b)=E_{(x,y,i)\sim D}\left[iI(b(x)\neq y)\right].\]
Similarly, given any distribution $D$ on examples $X\times[0,\infty)^{k}$,
the cost-sensitive loss of a multiclass classifier $h$ is given by\[
e(D,h)=E_{(x,\vec{c})\sim D}\left[c_{h(x)}\right].\]
For each of these notions of loss, the regret is the difference between
the achieved performance and best possible performance:
\[
r(D,h)=e(D,h)-\min_{h'}e(D,h').\]
(Note that we mean the minimum over \emph{all} classifiers $h'$ not
over some class.) The minimum loss classifier is also known as the
``Bayes optimal classifier''.
We must also define how SECOC transforms its distribution $D$ into
a distribution on the learned binary classifier.
%the distribution that SECOC induces
%on the learned binary classifier.
To draw a sample from this distribution, we first
draw a cost-sensitive sample $(x,\vec{c})$ from $D$, and then apply
SECOC-Train to the singleton set $\{(x,\vec{c})\}$ to get a sequence
of importance weighted binary examples, one for each $(s,t)$ pair.
Now, we just sample uniformly from this set, adding the index in the
sequence as a feature. We overload and denote the induced distribution
by $\textrm{SECOC-Train}(D)$.
Throughout the paper, for a cost vector $\vec{c}\in[0,\infty)^{k}$
and a subset of labels $s\subseteq\left\{ 1,\ldots,k\right\} $, let
$c_{s}=\sum_{y\in s}c_{y}$. Also let $|\vec{c}|=\sum_{y=1}^{k}c_{y}$.
%Given $s$ and $t$, it will be useful to talk about the distribution
%$D_{st}$ over $X\times\left\{ 0,1\right\} \times[0,\infty)$ induced
%by the process in Algorithm \ref{fig:SECOC}. To draw from $D_{st}$,
%we draw $(x,\vec{c})$ from $D$ and output $(x,I(c_{s}\geq t|\vec{c}|),\left|c_{s}-t|\vec{c}|\right|)$.
The distribution $D|x$ is defined as $D$ conditioned on $x$.
\begin{theorem}
\label{thm:(SECOC-Regret-Transform)}\emph{(SECOC Regret Transform)}
For all importance weighted binary learning algorithms $B$ and cost-sensitive
datasets $S$ in $(X\times[0,\infty)^{k})^{*}$, let
$b=B(\mbox{\emph{SECOC-Train}}(S))$.
Then for all test distributions $D$ on $X\times[0,\infty)^{k}$,
for all labels $y\in\left\{ 1,\ldots,k\right\} :$\[
E_{(x,\vec{c})\sim D}\left({\mbox{\emph{SECOC-Predict}}}(b,x,y)E_{\vec{c}'\sim D|x}\left[|\vec{c}'|\right]-E_{\vec{c}'\sim D|x}\left[c_{y}'\right]\right)^{2}\]
\[
\leq8(t_{\max}-t_{\min})^{2}r(\mbox{\emph{SECOC-Train}}(D),b),\]
where $t_{\max}=\max_{(x,\vec{c}):D(x,\vec{c})>0}\max_{s}({c_{s}}/|\vec{c}|)$
and $t_{\min}=\min_{(x,\vec{c}):D(x,\vec{c})>0}\min_{s}({c_{s}}/|\vec{c}|)$.
\end{theorem}
This theorem relates the average regret of the created binary importance
weighted classifier to the relative estimation error.
For the proof, note that the dependence on $B$ and $S$ can be removed
by proving the theorem for all $b$, which is of equivalent generality.
\begin{proof}
We first analyze what happens when no regret is suffered, and then
analyze the case with regret. For any $s$ and $t$, let $D(s,t)$ be
the distribution on $X\times\left\{ 0,1\right\} \times[0,\infty)$
induced by drawing $(x,\vec{c})$ from $D$ and outputting
$(x,I(c_{s}\geq t|\vec{c}|),\left|c_{s}-t|\vec{c}|\right|)$.
For any choice of $s$ and $t$, the
optimal classifier is given by
\[
b_{st}^{*}=\arg\min_{b}E_{(x,y,i)\sim D(s,t)}\left[iI(b(x)\neq y)\right]\]
\[
=\arg\min_{b}E_{(x,\vec{c})\sim D}\left[\left|c_{s}-|\vec{c}|t\right|\cdot I\left(b(x)\not=I(c_{s}\geq t|\vec{c}|)\right)\right].\]
For any $x$, the optimal value of $b(x)$ is either $0$ or $1$.
When it is $0$, the expected cost is\begin{equation}
E_{\vec{c}\sim D|x}\max{\left\{ (c_{s}-t|\vec{c}|),0\right\} }.\label{eq:foo}\end{equation}
Otherwise, it is\begin{equation}
E_{\vec{c}\sim D|x}\max{\left\{ (t|\vec{c}|-c_{s}),0\right\} }.\label{eq:foo2}\end{equation}
To simplify notation, let $Z_{x}=E_{\vec{c}\sim D|x}|\vec{c}|$. Equations
\ref{eq:foo} and \ref{eq:foo2} are continuous in $t$; the first
decreases while the second increases monotonically with $t$, so we
need only find the single equality point to describe the optimal behavior
for all $t$. This equality point is given by \[
E_{\vec{c}\sim D|x}\max{\left\{ (c_{s}-t|\vec{c}|),0\right\} }=E_{\vec{c}\sim D|x}\max{\left\{ (t|\vec{c}|-c_{s}),0\right\} },\]
or\[
E_{\vec{c}\sim D|x}(c_{s}-t|\vec{c}|)=0,\]
yielding\[
t=\frac{E_{\vec{c}\sim D|x}\left[c_{s}\right]}{Z_{x}},\]
and thus $b_{st}^{*}(x)=I(E_{\vec{c}\sim D|x}\left[c_{s}\right]\geq tZ_{x})$.
For any choice of $s$, we have \[
E_{t}\left[b_{st}^{*}(x)\right]=E_{t}I\left(E_{\vec{c}\sim D|x}\left[c_{s}\right]\geq tZ_{x}\right)=\frac{\frac{E_{\vec{c}\sim D|x}\left[c_{s}\right]}{Z_{x}}-t_{\textrm{min}}}{t_{\textrm{max}}-t_{\textrm{min}}}\]
since $E_{t\in\left[t_{\min},t_{\max}\right]}I(K\geq t)=\frac{K-t_{\textrm{min}}}{t_{\textrm{max}}-t_{\textrm{min}}}$
for all $K\in\left[t_{\min},t_{\max}\right]$.
Since decoding is symmetric with respect to all labels, we need analyze
only one label $y$. Furthermore, since SECOC-Predict (Algorithm \ref{fig:SECOC_predict})
is symmetric with respect to set inclusion or complement set inclusion,
we can assume that $y$ is in every subset (i.e., complementing all
subsets not containing $y$ does not change the decoding properties
of the code.) Consequently, \[
\hat{c}_{y}=E_{s}E_{t}\left[b_{st}^{*}(x)\right](t_{\textrm{max}}-t_{\textrm{min}})+t_{\textrm{min}}=E_{s}\frac{E_{\vec{c}\sim D|x}\left[c_{s}\right]}{Z_{x}}\]
\[
=\frac{\frac{{1}}{2}E_{\vec{c}\sim D|x}(c_{y}+|\vec{c}|)}{Z_{x}}=\frac{{1}}{2}\left(\frac{E_{\vec{c}\sim D|x}\left[c_{y}\right]}{Z_{x}}+1\right),\]
where the third equality follows from the fact that every label other
than $y$ appears in $s$ half the time in expectation over $s$.
Consequently, SECOC-Predict outputs $\frac{1}{Z_{x}}E_{\vec{c}\sim D|x}\left[c_{y}\right]$
for each $y$, when the classifiers are optimal.
Now we analyze the regret transformation properties. The remainder
of this proof characterizes the most efficient way that any adversary
can induce estimation regret with a fixed budget of importance weighted
regret.
Examining equations \ref{eq:foo} and \ref{eq:foo2}, notice that
the importance weighted loss grows linearly with the distance of $tZ_{x}$
from $E_{\vec{c}\sim D|x}\left[c_{s}\right]$, but on the other hand,
each error has equal value in disturbing the expectation in SECOC-Predict
(Algorithm \ref{fig:SECOC_predict}). There are two consequences for
an adversary attempting to disturb the expectation the most while
paying the least importance weighted cost.
1) It is {}``cheapest'' for an adversary to err on the $t$ closest
to $\frac{{1}}{Z_{x}}E_{\vec{c}\sim D|x}\left[c_{s}\right]$ first.
(Any adversary can reduce the importance weighted regret by swapping
errors at larger values of $|t-\frac{{1}}{Z_{x}}E_{\vec{c}\sim D|x}\left[c_{s}\right]|$
for errors at smaller values without altering the estimation regret.)
2) It is {}``cheapest'' to have a small equal disturbance for each
$s$ rather than a large disturbance for a single $s$. (The cost
any adversary pays for disturbing the overall expectation can be monotonically
decreased by spreading errors uniformly over subsets $s$.)
Consequently, the optimal strategy for an adversary wanting to disturb
the output of SECOC-Predict by $\Delta$ is to disturb the expectation
for each $s$ by $\frac{\Delta}{2(t_{\textrm{max}}-t_{\textrm{min}})}$.
The importance weighted regret of erring (with a {}``$1$'') for
$t=\frac{\Delta}{2(t_{\textrm{max}}-t_{\textrm{min}})}+\frac{{1}}{Z_{x}}E_{\vec{c}\sim D|x}\left[c_{s}\right]$
can be found by subtracting equation \ref{eq:foo2} from equation
\ref{eq:foo}:\[
E_{\vec{c}\sim D|x}(t|\vec{c}|-c_{s})I(c_{s}2$, there exists a distributions
$D$ over multiclass test examples $(x,y)$ such that for all codes
$M$, with $c^{*}=\arg\min_{c}r(\mbox{\emph{ECOC-Train}}(D),c)$, \[
r(D,\mbox{\emph{ECOC-Predict}}(c^{*}))>\frac{1}{8}\]
where $\mbox{\emph{ECOC-Train}}$ and $\mbox{\emph{ECOC-Predict}}$ are as defined
in the introduction.
\end{theorem}
\begin{proof}
The proof is constructive. We choose a $D$ which places probability
on three labels: `1', `2', and `3'.
A few observations about symmetry simplify the proof. First, since
only three labels have positive probability, we can rewrite any code $M$ as
a new \emph{weighted} code $M'$ over the three labels where each
subset has a weight $w_{s}$ corresponding to the number of times
the subset of the three labels exists in $M$ after projection. The
second observation is that the symmetry with respect to complementarity
implies that each row (and each codeword) has one `1' in it.
These observations imply that ECOC essentially uses the binary classifier
to ask, ``Is the probability of label $i>0.5$?'' for each $i\in\{1,2,3\}$.
These answers are then combined with a weighted sum. If we let the
probability of one label be $0.5-\epsilon$ and the probability of
the other two labels be $0.25+\frac{\epsilon}{2}$ each, the answer to every
question will be ``no''.
Since we have a weighted sum, the exact weighting determines the outcome
(possibly with randomization to break ties). The exact distribution
therefore picks a label at random to have probability $0.5-\epsilon$,
encodes that choice in the $x$ value, and then draws the label from
this associated distribution.
Under any code, the probability of predicting the label with greatest
probability is at most $\frac{1}{3}$ implying a regret of $\frac{2}{3}(0.5-\epsilon-(0.25+\frac{\epsilon}{2}))$,
which can be made arbitrarily close to $\frac{1}{6}$.
$\Box$
\end{proof}
A margin-based version of ECOC \cite{marginecoc} has the same lower
bound whenever the coding matrices are limited to {}``-1'' and {}``1''
entries. This is because consistent binary classifiers might have
margin $1$ or $-1$ for each example, and the proof above holds.
However, this version of ECOC also allows {}``don't cares'' in the
coding matrix. The existence of {}``don't cares'' allows questions
of the form, {}``Is the probability of this label greater than that
label?'' In general, these are sufficiently powerful to support regret
transformation consistency, with the exact quantity of regret transformation
efficiency dependent on the coding matrix. We are not aware of any
margin based code with {}``don't cares'' with a better regret transform
than SECOC with the Hadamard Code.
Take as an example the all-pairs code, which is consistent with margin-based
ECOC. The all-pairs code creates a classifier for every pair of classes
deciding (for an optimal classifier) {}``Is class $i$ more probable
than class $j$?'' The problem with this question is that the classifier
is applied when class $i$ and class $j$ each have zero probability.
In this situation, an adversarial classifier can choose to report
either $p_{i}>p_{j}$ or $p_{j}>p_{i}$ without paying any regret.
Consequently, an adversarial binary classifier could make some label
with $0$ conditional probability beat all labels except for the correct
label for free. This is not robust, because one error in one classifier
(out of $k-1$ active classifiers) can alter the result. Consequently,
the regret transform for this code scales with $k$.
\section{Discussion}\label{sub:Varying-interval}
%This is the final section where we discuss various details and modifications
%of the core algorithm.
\subsubsection*{Variants}
There are several variants of the basic SECOC algorithm.
\paragraph{Random Code}
One simple variant code is {}``pick a random subset $s$ and pick
a random $t$''. This code has essentially the same analysis as the
Hadamard code presented here in expectation over the random choices.
\paragraph{Optimal codes}
For small values of $k$ (the number of classes), it is possible to
derive a better regret transform with other codes. For example, when
$k=2$ there is only one useful subset (up to symmetry in complementation),
so the prediction algorithm can simply output the cost estimate for
that one subset rather than 2 times the average predicted cost, minus
1. This removes a factor of $2$ loosening in the last paragraph of
the proof of the main theorem.
%
When used for class probability prediction the above observation improves
on the regret transform analysis of the probing algorithm \cite{Probing}
by a factor of $\sqrt{2}$. The reason for this improvement is (essentially)
the use of a unified measure over the classification problem rather
than many different measures for different problems.
\paragraph{Varying interval}
The range of $t$ can be made dependent on $s$. This is useful when
embedding multiclass classification into cost-sensitive classification
for $k$ not a power of $2$. Roughly speaking, allowing the range
of $t$ to vary with $s$ eliminates the use of classifiers for which
the correct prediction is ``always 0'' or ``always 1''. Eliminating
these classifiers improves the regret transform by reducing the size
of the set over which regret is averaged.
\paragraph{Clipping}
Our prediction algorithms can output relative costs or probability
estimates above ``1'' or below ``0''. In such cases, clipping
the prediction to the interval $[0,1]$ always reduces regret.
\shrink{
In practical application several concerns other than regret transforms
arise.
\subsubsection*{Randomized Integration to Reduce Computation}
When $k$ is very large, we can instead use a code defined by picking
a random subset of $O(\frac{1}{\epsilon^{2}}\log\frac{2k}{\delta})$
codewords. Hoeffding bounds imply that the estimates vary by at most
$\epsilon$ with probability $1-\delta$. Choosing random values of
$t$ rather than all values of $t$ further reduces computation without
decreasing the accuracy bound of $\epsilon$.
}
\subsubsection*{Why regret isn't everything}
Other work \cite{wap} defines a reduction with the property that
small \emph{error} on the subproblem implies small \emph{error} on
the original problem. The definition of regret we use here is superior
because the theorems can apply nontrivially even on problems with
large inherent noise. However, the mathematical form of the regret
transforms is weaker, typically by $\epsilon$ loss changing to $\sqrt{\epsilon}$regret. Tightening the regret transform by removal of the square root
seems to require a different construction.
\subsubsection*{Difficulty of created learning problems}
A natural concern when using any reduction is that it
may create hard problems for the oracle.
And in fact, learning to distinguish a random subset may be significantly
harder than learning to distinguish (say) one label from all other
labels, as in the One-Against-All (OAA) reduction, as observed
in \cite{marginecoc,Venkat}.
%This intuition is born out by experiments with ECOC \cite{marginecoc,Venkat}.
The best choice of code is a subtle affair.
The method here is general and can be used with sparser coding matrices as
well.
%largely orthogonal to the choice of the coding matrix,
%so SECOC may also be useful for sparser coding matrices.
Nevertheless, there is some empirical evidence in support of SECOC with
Hadamard codes (presented in the next section).
%In particular, SECOC with Hadamard codes was observed superior to both ECOC
%(with the same code) and OAA on every tested binary learning algorithm and
%multiclass problem.
%This suggests that SECOC should be applied on sparse codes (i.e.,
%codes where each classifier learns to predict a small subset). The
%best choice of code is a subtle affair.
\section{Experimental Results}\label{experiments}
We compared the performance of SECOC, ECOC and One-Against-All (OAA)
on several multiclass datasets from the UCI Machine Learning
Repository~\cite{uci} (ecoli, glass, pendigits, satimage, soybean, splice,
vowel, and yeast). Hadamard matrices were used for both SECOC and ECOC.
As oracles, we used a decision tree learner (J48), a (linear) support
vector machine learner (SMO) and logistic regression (denoted LR), all available
from Weka~\cite{weka}. Default parameters were used for all three learners
in all experiments.
%In the experiments, we used multiple invocations
%of the oracle instead of collapsing them into one as in the analysis.
For datasets that do not have a standard train/test split,
we used a random split with $2/3$ for training and $1/3$ for testing.
The figures below show test error rates of SECOC plotted against those
of ECOC and OAA (the axes are labeled).
\begin{figure}
\hskip -.1in
{\includegraphics[scale=1]{plot-ECOC.ps}}
{\includegraphics[scale=1]{plot-OAA.ps}}
\end{figure}
\noindent
For SECOC, we used six thresholds for each row of the matrix.
SECOC resulted in superior (or equal)
performance on every dataset tested, for every
learner used. We do not report any statistical significance tests because
the assumptions they are based on are not satisfied by the datasets.
Instead we report \emph{all} experiments performed;
we believe that the observed consistency across different
datasets and learners gives sufficient empirical evidence in support of SECOC.
The code is available from the authors.
\begin{thebibliography}{10}
\bibitem{marginecoc}Erin Allwein, Robert Schapire, and Yoram Singer. Reducing multiclass
to binary: A unifying approach for margin classifiers. \emph{Journal
of Machine Learning Research}, 1:113--141, 2000.
\bibitem{wap}Alina Beygelzimer, Varsha Dani, Tom Hayes, and John Langford. Reductions
between classification tasks. \emph{Electronic Colloquium on Computational
Complexity}, TR04-077, 2004.
\bibitem{ecoc}Thomas Dietterich and Ghulum Bakiri. Solving multiclass learning problems
via error-correcting output codes. \emph{Journal of Artificial Intelligence
Research}, 2:263--286, 1995.
\bibitem{adaboost}Yoav Freund and Robert Schapire. A decision-theoretic generalization
of online learning and an application to boosting. \emph{Journal of
Computer and System Sciences}, 55(1), 119--139, 1997.
\bibitem{Venkat}Venkat Guruswami and Amit Sahai. Multiclass learning, boosting, and
error-correcting codes. In \emph{Proceedings of the 12th Annual Conference
on Computational Learning Theory (COLT)}, 145--155,1999.
\bibitem{KS}Adam Kalai and Rocco Servedio. Boosting in the Presence of Noise.
In Proceedings of the 35th Annual ACM Symposium on the Theory of Computing
(STOC), 195--205, 2003.
\bibitem{Probing}John Langford and Bianca Zadrozny. Estimating class membership probabilities
using classifier learners. In \emph{Proceedings of the 10th International
Workshop on Artificial Intelligence and Statistics}, 2005.
\bibitem{Wahba}Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory Support Vector
Machines: Theory and Application to the Classification of Microarray
Data and Satelite Radiance Data, Journal of the American Statistical
Association, 99, 465 (2004) 67-81.
\bibitem{weighted majority}Nick Littlestone and Manfred Warmuth, The Weighted Majority Algorithm,
Foundations of Computer Science, 1992.
\bibitem{Platt}John Platt. Probabilistic outputs for support vector machines and
comparisons to regularized likelihood methods. In A. Smola, P. Bartlett,
B. Schölkopf, and D. Schuurmans, editors, \emph{Advances in Large
Margin Classifiers}, 61--74, 1999.
\bibitem{uci}C. Blake and C. Merz, UCI Repository of Machine Learning Databases, {\small\texttt{http://www.ics.uci.edu/$\sim$mlearn/MLRepository.html}},
University of California, Irvine.
\bibitem{valiant}Leslie Valiant. Learning disjunctions of conjunctions, In \emph{Proceedings
of the 9th IJCAI}, 560--566, 1985.
\bibitem{Vapnik}Vladimir Vapnik. \emph{The Nature of Statistical Learning Theory}.
Springer, 1995.
\bibitem{VC}Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence
of relative frequencies of event to their probabilities. \emph{Theory
of Probability and its Applications}, 16(2), 264--280, 1971.
\bibitem{weka}Ian H. Witten and Eibe Frank, Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, 2000, {\small\texttt{http://www.cs.waikato.ac.nz/ml/weka/}}.
\bibitem{costing}Bianca Zadrozny, John Langford, and Naoki Abe. Cost-Sensitive Learning
by Cost-Proportionate Example Weighting. In \emph{Proceedings of the
3rd IEEE International Conference on Data Mining} (ICDM) 435--442,
2003.
\end{thebibliography}
\end{document}