%\documentclass{acm_proc_article-sp}
\documentclass{sig-alternate}
\newtheorem{theorem}{Theorem}
\newtheorem{corollary}{Corollary}
\newtheorem{lemma}{Lemma}
\newtheorem{proposition}{Proposition}
\newtheorem{define}{Definition}
\newtheorem{conjecture}{Conjecture}
\DeclareMathOperator*{\Exp}{E}
\DeclareMathOperator*{\Exphat}{\hat{E}}
\begin{document}
\title{An Iterative Method for Multi-class Cost-sensitive Learning}
\numberofauthors{3}
\author{
\alignauthor Naoki Abe\\
\affaddr{Mathematical Sciences Dept.}\\
\affaddr{IBM T.~J.~Watson Res. Ctr.}\\
\affaddr{Yorktown Heights, NY 10598}\\
\email{nabe@us.ibm.com}
\alignauthor Bianca Zadrozny\\
\affaddr{Mathematical Sciences Dept.}\\
\affaddr{IBM T.~J.~Watson Res. Ctr.}\\
\affaddr{Yorktown Heights, NY 10598}\\
\email{zadrozny@us.ibm.com}
\alignauthor John Langford\\
\affaddr{Toyota Technological Institute at Chicago}\\
\affaddr{1427 East 60th Street}\\% - Press Building}\\
\affaddr{Chicago, IL 60637}\\
\email{jl@tti-c.org}
}
\date{26 May 2004}
\maketitle
\begin{abstract}
Cost-sensitive learning addresses the issue of classification in the
presence of varying costs associated with different types of
misclassification. In this paper, we present a method for solving
multi-class cost-sensitive learning problems using any binary
classification algorithm.
This algorithm is derived using three key ideas:
1) iterative weighting; 2) expanding data space; and
3) gradient boosting with stochastic ensembles.
%This algorithm is based on a number of
%techniques: iterative weighting, data space expansion, and gradient
%boosting with stochastic ensembles.
We establish some theoretical guarantees concerning
the performance of this method.
In particular, we show that a certain variant possesses
the {\em boosting property}, given a form of weak learning assumption
on the component binary classifier.
We also empirically evaluate the performance of the proposed method
using benchmark data sets and verify that our method generally
achieves better results than representative methods for cost-sensitive
learning, in terms of predictive performance (cost minimization) and,
in many cases, computational efficiency.
%concerning the convergence and optimality of a variant of the method
%and present empirical evidence of effectiveness
%for real-world problems
%in terms of predictive performance (cost minimization).
\end{abstract}
% A category with only the three required fields
\category{I.2.6}{Artificial Intelligence}{Learning}
%\category{H.4.m}{Information Systems}{Miscellaneous}
%\category{D.2}{Software}{Software Engineering}
%A category including the fourth, optional field follows...
%\category{D.2.8}{Software Engineering}{Metrics}[complexity measures,
%performance measures]
%
\terms{Algorithms}
%
\keywords{cost-sensitive learning, multi-class classification, boosting}
% NOT required for Proceedings
\section{Introduction}
Classification in the presence of varying costs associated with
different types of misclassification is important for practical
applications, including many data mining applications, such as
targeted marketing, fraud and intrusion detection among others.
A body of work on this subject has become known as
{\em cost-sensitive learning}, in the areas of machine learning and
data mining.
Research in cost-sensitive learning falls into three main
categories. The first category is concerned with making particular
classifier learners cost-sensitive, including methods specific for
decision trees \cite{Knoll,Bradford}, neural networks \cite{GeWy03}
and support vector machines \cite{FuRo02}. The second category uses
Bayes risk theory to assign each example to its lowest expected cost
class \cite{Domingos,ZaEl01b}. This requires classifiers to
output class membership probabilities and sometimes requires
estimating costs \cite{ZaEl01b} (when the costs are unknown at
classification time). The third category concerns methods that modify
the distribution of training examples before applying the classifier
learning method, so that the classifier learned from the modified
distribution is cost-sensitive. We call this approach {\em
cost-sensitive learning by example weighting}. Work in this area
includes stratification methods \cite{ChSt98,Breiman84} and the {\em
costing} algorithm \cite{ZLA03}. This approach is very general since
it reuses arbitrary classifier learners and does not require accurate
class probability estimates from the classifier. Empirically this
approach attains similar or better cost-minimization performance.
Unfortunately, current methods in this category suffer from a major
limitation: they are well-understood only for two-class problems.
In the two-class case,
it is easy to show that each example should be weighted
proportionally to the difference in cost between predicting correctly
or incorrectly \cite{ZLA03}. However, in the multi-class case there is
more than one way in which a classifier can make a mistake, breaking
the application of this simple formula. Heuristics, such as weighting
examples by the average misclassification cost, have been proposed
\cite{Breiman84,MargThesis}, but they are not well-motivated
theoretically and do not seem to work very well in practice when
compared to methods that use Bayes risk minimization \cite{Domingos}.
In this paper, we propose a method for multi-class cost-sensitive
learning based on an iterative scheme for example weighting. There are a
number of key techniques we employ in this method:
%\begin{enumerate}
%\item
1) an iterative process which empirically adjusts the example weighting according to the performance of the learning algorithm;
%\item
2) data space expansion for multi-class labels;
%\item
3) gradient boosting \cite{Mason00} with stochastic ensembles.
%\end{enumerate}
The first two ideas are combined in a unifying framework
given by the third.
We establish theoretical performance guarantee for a variant of
our algorithm. In particular, we show that this variant possesses the
so-called {\em boosting property}, given that the component
classification algorithm satisfies a certain {\em weak learning assumption}.
We test our method on
several multi-class data sets and discover
excellent predictive performance (i.e. cost minimization) as compared to
existing cost-sensitive algorithms. Moreover, our results show that when the
distribution of costs is skewed (as is common in many data mining
applications) our method has the added advantage that it
uses drastically smaller sample sizes and hence requires
much less computational resources.
% deleted for now
%performance both in terms of predictive performance
%(cost minimization) and computational efficiency.
\section{Preliminaries}
\label{sec:prelim}
We begin by introducing some general concepts and notation we use in
the rest of the paper.
\subsection{Cost-sensitive learning and related problems}
\label{subsec:problems}
A popular formulation of the cost-sensitive learning problem is via
the use of a cost matrix. A cost matrix, $C(y_1,y_2)$, specifies how
much cost is incurred when an example is predicted to belong to class $y_1$ when its
correct label is $y_2$, and the goal of a cost-sensitive learning method is to minimize
the expected cost. Zadrozny and Elkan \cite{ZaEl01b} noted that this
formulation is not applicable in situations in which misclassification
costs depend on particular instances, and proposed a more general form
of cost function, $C(x,y_1,y_2)$, that allows dependence on the
instance $x$. Here we adopt a formulation based on this (although slightly more general).
Once we allow the costs to depend on each example, it is natural to
assume that the costs are generated according to some distribution,
along with the examples, which leads to the following formulation. In
(multiclass) cost sensitive classification, examples of the form $(x,
\vec{C})$ are drawn from a distribution $D$ over a domain $X \times
R^{+^k}$. (Throughout the paper, we will let $k$ denote $|Y|$.)
Here, for each label $y \in Y$, $C_y$ equals the cost of misclassifying
instance $x$ as $y$, i.e.
$C_y = C(x,y,y^*)$ where $y^*$ is the minimum cost label for $x$.
Given a set of examples, $S = (x, \vec{C})^m$, the goal is to find a
classifier $h: X \rightarrow \{1,...,k\}$ which minimizes the expected
cost of the classifier:
\begin{equation}
\arg \min_{h} \Exp_{(x,\vec{C}) \sim D} [C_{h(x)}]
\label{eq:pre1}
\end{equation}
We can assume without loss of generality that the costs are
normalized so that
\[
\forall x \in X \: \min_{y \in Y} C_y = 0.
\]
Note that with this normalization,
the above formulation is equivalent to the common
formulation in terms of misclassification cost, i.e.,
\[
\arg \min_{h} \Exp_{(x,\vec{C}) \sim D} [C_{h(x)}I(h(x) \neq \arg\min_y C_y)]
\]
where we used $I(\cdot)$ to denote the indicator function
which takes on the value 1 whenever the statement is true,
and the value 0 otherwise.
Normally a learning method attempts to do this by minimizing the empirical
cost in the given training data, given some hypothesis class ${\cal H}$:
\begin{equation}
\arg \min_{h \in {\cal H}}
\Exphat_{(x,\vec{C}) \sim S} [C_{h(x)}] =
\frac{1}{|S|} \sum_{(x,\vec{C}) \in S} C_{h(x)}
\label{eq:csfun}
\end{equation}
Here the empirical expectation notation, $\Exphat$, refers to
the averaged empirical cost.
As a building block of our method, we make use of methods for solving
importance weighted classification problems, which we define below.
In importance weighted classification, examples of the form $(x,y,c)$
are drawn from a distribution $D$ over a domain $X \times Y \times R^+$.
Given a set of examples $S= (x,y,c)^m$,
the goal is to find a classifier $h: X \rightarrow Y$
having minimum importance-weighted misclassification error:
\[
\arg \min_h \Exp_{(x,y,c)\sim D}\left[ c \cdot I(h(x)\neq y)\right]
\]
Again, usually, a learning method attempts to meet this goal by
minimizing the empirical weighted error in
some hypothesis class ${\cal H}$:
\begin{equation}
\arg \min_{h \in {\cal H}} \Exphat_{(x,y,c)\sim S}\left[ c \cdot I(h(x)\neq y)\right]
\label{eq:iwc}
\end{equation}
We emphasize that the importance weighted formulation critically
differs from the per example formulation of
multi-class cost-sensitive learning in that there is a single weight
associated with each instance $x$, whereas in multi-class cost-sensitive learning
there is a weight (misclassification cost) associated with each label $y$.
We note that importance weighted classification can be solved
very well with a classifier learning method, by use of weighted
rejection sampling techniques \cite{ZLA03}.
\subsection{Hypothesis representations and other notation}
\label{subsec:hypo}
In the above, we assumed that the hypotheses output by a cost-sensitive
learner is a functional hypothesis $h$, i.e.
\(
h: X \rightarrow Y.
\)
It is also possible to allow hypotheses that are {\em stochastic}, namely
\[
h: X \times Y \rightarrow [0, 1]
\]
subject to the stochastic condition:
\[
\forall x \in X \: \sum_{y \in Y} h(y|x) = 1.
\]
With stochastic hypotheses, {\em stochastic} cost-sensitive learning is
defined as the process of finding a hypothesis minimizing
the following expected cost:
\[
\arg \min_{h} \Exp_{(x,\vec{C})\sim D} \Exp_{y\sim h(y|x)} [ C_y ]
\]
In general, we sometimes use the following short-hand notation for the
expected cost of a stochastic hypothesis $h$ on an instance $x$.
\[
C_{h(x)} \equiv \Exp_{y\sim h(y|x)} [ C_y ]
\]
Note that in the special case that $h$ is deterministic,
the above formulation is equivalent to the definition given in
Eq.~\ref{eq:pre1}.
Also, this is a {\em convexification} of the standard objective
function that we usually expect a stochastic cost-sensitive learner
to minimize, i.e.
\[
\Exp_{(x,\vec{C}) \sim D} [C_{\arg\max_y h(y|x)}]
\]
We also consider a variant of cost-sensitive learning
in which {\em relational} hypotheses are allowed.
Here relational hypotheses $h$ are relations over $X \times Y$, i.e.
$h: X \times Y \rightarrow \{ 0, 1 \}$.
In general $h$ is neither functional nor stochastic, and
in particular it may violate the stochasticity:
$\sum_{y \in Y} h(x,y) = 1$.
\section{The Methodology}
Our methodology can be interpreted as a {\em reduction},
which translates a multi-class cost-sensitive learning problem
to a classifier learning problem.
This methodology is derived using three key ideas:
1) iterative weighting; 2) expanding data space; and
3) gradient boosting with stochastic ensembles.
The first two ideas are combined in a unifying framework
given by the third.
Below we will explain the first two key ideas by exhibiting
a prototypical method based on each, and then derive our main
learning method that makes use of them in a gradient boosting framework.
\subsection{Iterative cost weighting}
\label{subsec:method1}
We note that the weighting scheme proposed in \cite{ZLA03}, called
{\em costing}, exploits the following observation: For the binary
class case, the above formulation in terms of per example cost for
each class can be further reduced to a formulation in terms of a
single importance number per example. This is possible by associating
a number indicating the importance of an example $(x,y)$, given by
$|C_0 - C_1|$. This conversion allows us to reduce the
cost-sensitive learning problem to a weighted classifier learning
problem, but it is not immediately obvious how that would be done for
the multi-class scenario. It is therefore natural to consider
iterative weighting schemes, in which example weights are iteratively
modified in search for the optimal weighting. The technique, which we
call {\em IW (Iterative Weighting)}, presented in
Figure~\ref{fig:method1}, is an example of such a scheme.
%[John - please add intuitive explanation of why this iterative weighting
%here should work.]
We can show that the final weights of IW are the optimal weights
in some sense, {\em provided} the algorithm converges.
%[John - please fill in intuitive explanation of why this property
%is desirable, why this is optimal in some sense, etc.]
\begin{figure}[t]
\fbox{~\\
\begin{minipage}{3.2in}
\begin{itemize}
\item An input sample
$S = (x, \vec{C})^m$.
\item A component learner $A$ that takes an importance-weighted sample
$S'= (x,y,c)^m$ and outputs a functional hypothesis $h$,
possibly by weighted sampling.
% which tries to minimize the weighted misclassification error.
\item An integer $T$ specifying the number of iterations to be performed.
\end{itemize}
\begin{enumerate}
\item Let $S' =$
\[
\{ (x, \arg\min_y C_y, \max_y C_y) | (x, \vec{C}) \in S \}\]
\item {\bf For} $t:= 1$ to $T$ {\bf do}
\begin{enumerate}
\item Let $h_t := A(S')$.
\item Choose $\alpha_t \in [0,1)$, for example $\alpha_t := 1/t$.
\item For each example $(x,y,c) \in S'$ with an error $h_t(x) \neq y$
update the importance weight by
\[
c := \alpha_t C_{h(x)} + (1 - \alpha_t) c
\]
\end{enumerate}
\item Return $h_{T}$.
\end{enumerate}
\end{minipage}}
\caption{Method {\em IW (Iterative Weighting)}}
\label{fig:method1}
\end{figure}
\begin{theorem}
\label{johnthm}
Assume IW converges and the final hypothesis is $h$. Then, the following holds:
\[
\Exphat_{(x,y,c) \sim S'} [c \cdot I(h(x) \neq y)]
= \Exphat_{(x,\vec{C})\sim S} [C_{h(x)}].
\]
\end{theorem}
This theorem says that, if the iterative algorithm converges,
then the loss of the component learner with respect to the final weights
(left hand side) is equal to the expected cost for the problem that
we wish to solve (right hand side).
{\bf Proof}
At convergence, for every example $(x,y,c)$ that the $h$ errs on, we
must have:
\[
c = \alpha_t C_{h(x)} + (1 - \alpha_t) c
\]
and thus
\[
c = C_{h(x)}
\]
Noting that when $h$ does not err on $x$, we have $I(h(x) \neq y) = 0$, Thus, it follows that
\[
\Exphat_{(x,y,c)\sim S'} [c \cdot I(h(x) \neq y)] = \Exphat_{(x,\vec{C})\sim S}
[C_{h_t(x)}]
\]
{\bf Q.E.D}
\subsection{Data space expansion}
\label{subsec:methodb}
One drawback to iterative weighting is an inability to directly take
into account the different costs associated with multiple ways of
misclassifying examples. This translates to non-convergence of the
method in practice. We address this issue by the technique of
expanding data space, as we describe below.
%The objective of minimizing the empirical cost on the original
%training sample is equivalent to minimization on the following {\em
%expanded} sample.
Given a labeled sample $S$ consisting of
$(x,\vec{C})$ of size $m$, we define an {\em expanded} sample
$S'$ of size $m k$ for weighted classification,
where $k$ is the size of the label set, i.e. $k = |Y|$, as follows.
\[
S' = \{ (x, y, \max_{y'} C_{y'} - C_y) | (x, \vec{C}) \in S, y \in Y \}
\]
Note here that the newly defined weights are more like {\em benefits}
than costs, since larger costs are mapped to smaller weights.
It turns out that minimizing the importance weighted loss,
\begin{equation}
\Exphat_{(x,y,c) \sim S'} c \cdot I(h(x) \neq y)
\nonumber
\end{equation}
on this new data set also minimizes the cost on our original sample.
The algorithm DSE (Data Space Expansion), shown in
Figure~\ref{fig:methodb}, is based on this observation,
which is summarized as theorem below.
\begin{figure}[t]
\fbox{~\\
\begin{minipage}{3.2in}
\begin{itemize}
\item An input sample
$S = (x, \vec{C})^m$.
\item A component learner $A$ that takes an importance-weighted sample
$S'= (x,y,c)^m$ and outputs a functional hypothesis $h$,
possibly by weighted sampling.
\end{itemize}
\begin{enumerate}
\item Set $S' = \{ (x,y,\max_{y'} C_{y'} - C_y) |
(x, \vec{C}) \in S, y \in Y \}$
\item Return $h := A(S')$
\end{enumerate}
\end{minipage}}
\caption{Method {\em DSE (Data Space Expansion)}}
\label{fig:methodb}
\end{figure}
\begin{theorem}
\label{biancathm}
With the definitions given in Figure~\ref{fig:methodb},
a hypothesis $h$ minimizing the weighted classification error on
the expanded weighted sample $S'$,
\begin{equation}
\Exphat_{(x,y,c) \sim S'} [c \cdot I(h(x) \neq y)]
\nonumber
\end{equation}
also minimizes the cost on the original sample $S$,
\begin{equation}
\Exphat_{(x,\vec{C}) \sim S} [C_{h(x)}]
\nonumber
\end{equation}
\end{theorem}
{\bf Proof}
\begin{eqnarray}
& &
\arg\min_h \Exphat_{(x,y,c) \sim S'} [c \cdot I(h(x) \neq y)] \nonumber \\
& = &
\arg\min_h \Exphat_{(x,\vec{C}) \sim S} \sum_{y \in Y}
[(\max_{y'\in Y} C_{y'} - C_y) \cdot I(h(x) \neq y)] \nonumber \\
& = &
\arg\max_h \Exphat_{(x,\vec{C}) \sim S} \sum_{y \in Y}
[ C_y \cdot I(h(x) \neq y)] \nonumber \\
& = &
\arg\max_h \Exphat_{(x,\vec{C}) \sim S}
[ (\sum_{y \in Y} C_y) - C_{h(x)}] \nonumber \\
& = &
\arg\min_h \Exphat_{(x,\vec{C}) \sim S} [C_{h(x)}] \nonumber
\end{eqnarray}
{\bf Q.E.D.}
\subsection{Gradient Boosting with Stochastic Ensembles}
\label{subsec:gradient}
Having described two key ideas, namely iterative weighting and
data space expansion, we now apply them together to arrive at our main method.
We do so by casting the {\em stochastic} multiclass cost-sensitive
learning in the framework of gradient boosting \cite{Mason00},
with the objective function defined as
the expected cost of the {\em stochastic ensemble},
obtained as a mixture of individual hypotheses,
on the {\em expanded} data set.
As we stated in Section~\ref{sec:prelim},
a functional hypothesis of the form $h: X \rightarrow Y$
can be viewed as a special case of a stochastic hypothesis.
We then define a stochastic ensemble hypothesis $H$, given
multiple functional hypotheses, $h_t, t=1,...,T$,
as the conditional distribution
defined as the mixture of the component hypotheses, namely,
\[
\forall x \in X, \forall y \in Y,
H(y|x) = \frac{1}{T}\sum_{t=1}^T I(h_t(x)=y)
\]
Let $H_t$ denote the mixture hypothesis of the learning
procedure at round $t$.
The procedure is to update its current combined hypothesis
by the mixture of the previous combined hypothesis and
a new hypothesis, i.e. by setting
\begin{equation}
H_t(y|x) = (1-\beta)H_{t-1}(y|x) + \beta I(h(x)=y) \nonumber
\label{eq:update}
\end{equation}
Thus, the expected cost of $H_t$ on $x$ is
\begin{equation}
\Exp_{y\sim H_t(y|x)} C_y = (1-\beta)\Exp_{y \sim H_{t-1}(y|x)} C_y + \beta C_{h_t(x)}
\nonumber
\end{equation}
If we now take a derivative of this function with respect to $\beta$,
we get:
\begin{equation}
\frac{\partial \Exp_{y\sim H_t(y|x)} C_y}{\partial \beta} = C_{h_t(x)} - \Exp_{y \sim H_{t-1}(y|x)} C_y
\nonumber
\label{eq:derivative2}
\end{equation}
Note that this is the difference between the average
cost of the current {\em ensemble} hypothesis and the new weak
hypothesis assigning probability one to the specified label.
We then take the expectation of this derivative with respect to all
data points $(x,y)$ in the expanded data set $S'$, and thus the
gradient is $mk$-dimensional. The weak learner is to find a hypothesis $h$
whose inner-product with the negative gradient is large. That is, the
output $h$ of the weak learner seeks to maximize the following sum.
\begin{equation}
\label{eq:weaklearner2}
- \langle h, \nabla C \rangle
= \Exphat_{(x,\vec{C}) \sim S}
[\Exp_{y' \sim H_{t-1}(y|x)} C_{y'} - C_{h(x)}]
\end{equation}
This leads to the following example weighting on the expanded sample:
\[
w_{x,y} = C_{H_{t-1}(x)} - C_y
\]
where we recall that $C_{H_{t-1}(x)}$ denotes
$\Exp_{y' \sim H_{t-1}(y|x)} C_{y'}$.
Note that these weight updates are similar to those used in IW and DSE.
In fact, the IW weight update rule is essentially equivalent to the GBSE rule,
except IW has, for each instance $x$, the weight for only the best
(least cost) label, and hence $C_y = 0$ holds.
This is because the IW update rule mixes the weights from
earlier iterations, which is equivalent to taking the average
over the stochastic ensemble as is done in GBSE.
The DSE update rule differs from the GBSE rule in that
$\max_{y'} C_{y'}$ is used in place of $C_{H_{t-1}(x)}$, which
ensures that the weight is always non-negative, even though DSE
has weights for all labels.
Thus, the GBSE weights can be viewed as IW weights,
applied on the expanded data set, as in DSE.
As a consequence, the GBSE weights
$w_{x,y} := C_{H_{t-1}(x)} - C_y$ can be negative,
since $y$ is not necessarily the best label.
This means that the weak learner now receives both
positive and negative weights. While the minimization of weighted
misclassification with positive and negative weights
makes perfect sense as an optimization problem, its interpretation
as a classification problem is not immediately clear.
In particular, it prohibits the use of weighted sampling as a means
of realizing the weighted classification problem.
We deal with this problem by converting a relational version of
the weighted multi-class classification problem
(i.e. of finding $h$ to maximize Eq.~\ref{eq:weaklearner2})
in each iteration to a weighted binary classification problem.
Specifically we convert each example pair $(x,y)$
to $((x,y), l)$, and set $l = 1$ if the weight on $(x,y)$ is positive,
and $l = 0$ if the weight is negative.
The output hypothesis of the binary classifier is in general
relational, so it is converted to a stochastic hypothesis
by the procedure {\bf Stochastic} shown in
Figure~\ref{fig:method2sub}.
(The particular way this procedure is defined is motivated
by the theoretical guarantee, which will be shown in the next
subsection.)
The overall process, consisting of multiple iterations of such
a reduction, constitutes a reduction of the {\em stochastic} multi-class
cost-sensitive classification to binary weighted classification.
With the foregoing definitions, we can now state our main method,
{\em GBSE} (Gradient Boosting with Stochastic Ensembles),
which is shown in Figure~\ref{fig:gbse}.
\begin{figure}[t]
\fbox{~\\
\begin{minipage}{3.2in}
\begin{itemize}
\item An input sample $S = (x, \vec{C})^m$.
\item A component learner $A$ for (importance weighted)
binary classification that takes a sample of the form
$((x, y), l, w)^*$,
and outputs a relational hypothesis $h$.
\item A subprocedure {\bf Stochastic}, as specified in
Figure~\ref{fig:method2sub}.
\item An integer $T$ specifying the number of iterations to be performed.
\end{itemize}
\begin{enumerate}
\item Set $S'= \{(x,y) | (x,\vec{C}) \in S, y \in Y\}$.
\item Initialize $H_0$ by $\forall x \in X, y \in Y \: H_0(y|x) = 1/k$.
\item {\bf For} $t:= 1$ to $T$ {\bf Do}
\begin{enumerate}
\item $w_{x,y} := \Exp_{y' \sim H_{t-1}(x)} [C_{y'}] - C_y$ for all $(x,y)$ in $S'$.
\item $S_t = \{ ((x, y),I(w_{x,y}\geq 0), |w_{x,y}|) | (x,y) \in S' \}$.
\item Let $h_t := A(S_t)$
\item $f_t$ := {\bf Stochastic}($h_t, H_{t-1}$).
\item Choose $\alpha_t \in [0,1)$, for example $\alpha_t = \frac{1}{t}$.
\item Set
$H_t := (1-\alpha_t) H_{t-1} + \alpha_t f_t$.
\end{enumerate}
\item {\bf End For}
\item Return $H_{T}$.
\end{enumerate}
\end{minipage}}
\caption{Method {\em GBSE (Gradient Boosting with Stochastic Ensembles)}}
\label{fig:gbse}
\end{figure}
\begin{figure}[t]
\fbox{~\\
\begin{minipage}{3.2in}
{\bf Stochastic}($h$: a relational hypothesis,
$H$: a stochastic hypothesis)
\begin{enumerate}
\item Define $f$ by setting for each $x \in X$:
\begin{itemize}
\item (default) if $|\{ y \in Y | h(x,y) = 1 \}| = 0$ then
define for all $y \in Y$, $f(y|x) = H(y|x)$.
\item else $f(y|x) = \frac{I(h(x,y)=1)}{|\{ y \in Y | h(x,y) = 1 \}|}$.
\end{itemize}
\item Output $f$.
\end{enumerate}
\end{minipage}}
\caption{Sub-procedure {\bf Stochastic}}
\label{fig:method2sub}
\end{figure}
\subsection{Theoretical Performance Guarantee on a Variant}
It turns out that a strong theoretical performance guarantee can be
proved on a variant of this method. The variant is obtained by simply
replacing the weight updating rule of GBSE by the following:
\begin{equation}
w_{x,y} = \frac{C_{H_{t-1}(x)}}{k} - C_y
\nonumber
\end{equation}
The resulting variant, which we call GBSE-T (Gradient Boosting with Stochastic Ensembles
- Theoretical version), is summarized in Figure~\ref{fig:gbset}.
\begin{figure}
\fbox{~\\
\begin{minipage}{3.2in}
Identical to Method {\em GBSE} of Figure~\ref{fig:gbse}, except
for the following change: \\
3.(a) Set $w_{x,y} := \frac{C_{H_{t-1}(x)}}{k} - C_y$,
for all $(x,y)$ in $S'$.
\end{minipage}}
\caption{Method {\em GBSE-T (Gradient Boosting with Stochastic Ensembles - Theoretical variant)}}
\label{fig:gbset}
\end{figure}
\iffalse
\begin{figure}[t]
\begin{itemize}
\item An (expanded) input sample $S = (x, y)^{mk}$.
\item A component learner $A$ for (importance weighted)
binary classification that takes a sample of the form
$((x, y), l, w)^m$,
and outputs a relational hypothesis $h$
belonging to some given hypothesis space ${\cal H}$.
%which weakly minimizes the weighted misclassification error.
%\item $mk$-dimensional inner product $\langle h, - \nabla L \rangle$
%defined as above.
%\item Normalized costs $C(x_i, y) \in [0,1]$ for all data pairs.
%\item (A la Bianca and John) define ``benefits'' by
%$B(x_i, y_i) = 1 - C(x_i, y_i)$, for all data pairs.
\item A subprocedure {\bf Stochastic}, as specified in
Figure~\ref{fig:method2sub}.
\end{itemize}
\begin{enumerate}
\item Initialize $H_0$ by \\
$\forall x \in X, y \in Y \: H_0(x,y) = 1/k$.
\item Initialize for all $(x,y) \in S$,
$w_{x,y} := \tilde{C}(x,H_0(x)) - C(x,y)$.
\item {\bf For} $t:= 1$ to $T$ {\bf Do}
\begin{enumerate}
\item Set $w_{x,y} := \tilde{C}(x, H_{t-1}(x)) - C(x,y)$, for
all $(x,y)$ in $S$.
%(Note that this is equivalent to $w_i := B(x_i,y_i) - B(x_i, H_{t-1}(x_i))$.)
\item Let
$S' = \{ ((x, y),I(w_{x,y}\geq 0), |w_{x,y}|) | (x,y) \in S \}$.
%\item Define $\vec{|w|} = \langle |w_i| \rangle_{i,y}$.
\item Let $h_t := A(S')$ (either by weighted sampling or feeding
weights to $A$.)
\item $f_t$ := {\bf Stochastic}($h_t, H_{t-1}$).
\item Choose $\alpha_t \in [0,1)$, and
%\[
%\alpha = \arg \max_{a} - L((1-a)H_{t-1}+ a h_t, S)
%\]
set $H_t := (1-\alpha_t) H_{t-1} + \alpha_t f_t$.
\item {\bf End For}
\end{enumerate}
\item Return $H_{T}$.
\end{enumerate}
\caption{Method {\em GBSE-T (Gradient Boosting with Stochastic Ensembles - Theoretical variant)}}
\label{fig:gbset}
\end{figure}
\fi
We can show that GBSE-T has a boosting property
given a version of weak learning condition on the
component classifier. This weak learning condition, which we
make precise below, is one that is sensitive to class imbalance.
\begin{define}
We say that an algorithm $A$ for the binary
importance weighted classification problem,
as defined in Section~\ref{sec:prelim},
%by Eq~\ref{eq:iwc},
satisfies the {\em weak learning condition} for a given classification
sample $S = (x,y)^m$, if for all weighted samples $S' = (x,y,c)^m$ for it,
its output $h$ satisfies the following, for some fixed $\gamma > 0$:
\begin{eqnarray}
& & \Exphat_{(x,y,c) \sim S'} c I(h(x) = y) \nonumber \\
&\geq &\Exphat_{(x,y,c) \sim S'} c I(y=0) +
\gamma \Exphat_{(x,y,c) \sim S'} c I(y=1)
\label{eq:weak2}
\end{eqnarray}
\label{def:weaklearning}
\end{define}
Intuitively, this weak learning condition requires that the weak learner
achieve better weighted accuracy than that attainable trivially by
assigning all examples to the negative class.
\begin{theorem}
\label{prop2}
Suppose that the component learner $A$ satisfies the weak learning
condition for sample $S'$ as defined by GBSE-T. If we set $\alpha_t =
\alpha$ for all $t$, the output of GBSE-T satisfies:
%whenever $T \geq \frac{1}{\gamma \alpha} \log \frac{1}{\epsilon}$,
\[
\Exphat_{(x,\vec{C}) \sim S} C_{H_T(x)}
\leq \exp{ \{ - \frac{\gamma \alpha}{k} T \}}
\Exphat_{(x,\vec{C}) \sim S} C_{H_0(x)}
\]
\end{theorem}
This theorem shows that the empirical cost of the output hypothesis
of GBSE-T converges exponentially fast, given the weak learning assumption.
{\bf Proof} \\
We first establish the following simple correspondence between
the {\em weak learning conditions} on the relational multi-class
classification problem that we wish to solve in each iteration,
and the weighted binary classification problem that is given to the
component algorithm to solve it.
\iffalse
The following definitions are
a restatement of the definitions given in Section~\ref{sec:prelim}
for the present context.
\fi
\begin{define}
Let $S$ be a weighted sample of the form
$S = (x,y,c)^m$, where weights $c$ can be both
positive and negative. %and $y \in Y$. %with $|Y| > 2$.
Then define a transformed sample $S'$ for weighted classification as
$S' = ((x,y),I(c>0),|c|)^m$.
\begin{enumerate}
\item
The relational weighted multi-class classification problem for $S$
is to find a relational hypothesis $h: X \times Y \rightarrow \{ 0,1 \}$
that maximizes the following sum:
\[
\Exphat_{(x,y,c) \sim S} c \cdot h(x,y)
\]
\item
The weighted binary classification problem for the transformed sample
$S'$ is to find a hypothesis $h: X \times Y \rightarrow \{ 0, 1 \}$
that maximizes the following weighted classification accuracy:
\[
\Exphat_{((x,y),I(c>0),|c|) \sim S'} |c| \cdot I(h(x,y) = I(c>0))
\]
\end{enumerate}
\label{def:reduction1}
\end{define}
Note that, in a relational weighted classification problem as defined in
Definition~\ref{def:reduction1}, the goal of a learner is to try to
assign 1 to pairs with positive weights and assign 0 to those with
negative weights as much as possible.
\begin{lemma}
For all $h$:
$$\Exphat_{(x,y,c) \sim S} c \cdot h(x,y)
= \Exphat_{((x,y),I(c>0),|c|) \sim S'}\left[ |c| I(h(x,y) = I(c>0)) \right]$$
$$ - \Exphat_{((x,y),I(c>0),|c|) \sim S'} \left[ |c| I(c<0) \right]$$
\label{lemma:weak}
\end{lemma}
{\bf Proof of Lemma~\ref{lemma:weak}}
\begin{eqnarray}
& & \Exphat_{((x,y),I(c>0),|c|) \in S'} |c|\cdot I(h(x,y) = I(c>0)) \nonumber \\
& = & \Exphat_{(x,y,c) \in S} |c| \cdot I(h(x,y) = I(c>0)) \nonumber \\
& = & \Exphat_{(x,y,c) \sim S} c \cdot I(h(x,y) = 1 \mbox{ and } c \geq 0) \nonumber \\
& & + \Exphat_{(x,y,c)\sim S} - c \cdot I(h(x,y) = 0 \mbox{ and } c <0) \nonumber \\
& = & \Exphat_{(x,y,c) \sim S} c \cdot h(x,y) I(c \geq 0) + c(h(x,y) -1) I(c <0) \nonumber \\
& = & \Exphat_{(x,y,c) \sim S} c \cdot h(x,y)
+ \Exphat_{(x,y,c)\sim S} |c| I(c<0) \nonumber
%\label{eq:lemmaweak}
\end{eqnarray}
Hence the lemma follows.
{\bf Q.E.D.}
This lemma establishes that getting positive weighted accuracy on the
original relational weighted multi-class classification problem
is equivalent to the weak learning condition on the
transformed weighted binary classification problem.
{\bf Proof of Theorem~\ref{prop2}}\\
First, note that applying {\bf Stochastic} to $h_t$ can
increase the expected cost only for $x$'s such that
$|\{ y | h_t(x,y)=1 \}| = 0$, and for such $x$'s
the cost of the output function $f$
equals that of $H_{t-1}$ by the definition of
{\bf Stochastic}.
Hence, the average empirical cost of $f$ on the original sample $S$,
satisfies the following:
\begin{eqnarray}
& & \Exphat_{(x,\vec{C}) \sim S} [C_{f(x)} - \sum_y h_t(x,y) C_y]
\nonumber \\
& \leq &
\Exphat_{(x,\vec{C}) \sim S} [C_{H_{t-1}(x)} I(\forall y \: h(x,y)=0)]
\label{eq:thm3a}
\end{eqnarray}
Now recall that the expected empirical cost of $H_t$ equals the
following, where we drop the subscript $t$ from $\alpha_t$.
\begin{eqnarray}
\Exphat_{(x,\vec{C}) \sim S} C_{H_t(x)}
& = & \Exphat_{(x,\vec{C}) \sim S} (1 - \alpha) C_{H_{t-1}(x)} + \alpha C_{f(x)}
\label{eq:thm3c}
\end{eqnarray}
Hence, by combining Eq.~\ref{eq:thm3a} and Eq.~\ref{eq:thm3c}, we can
show the following bound on the decrease in empirical cost in each
iteration. Here, we also drop the subscript $t$ on $h$.
\begin{eqnarray}
& & \Exphat_{(x,\vec{C}) \sim S} C_{H_{t-1}(x)} - \Exphat_{(x,\vec{C}) \sim S} C_{H_t(x)} \nonumber \\
& = &
\Exphat_{(x,\vec{C}) \sim S} \alpha (C_{H_{t-1}(x)} - C_{f(x)})
\nonumber \\
& = &
\alpha \Exphat_{(x,\vec{C}) \sim S} C_{H_{t-1}(x)} - \sum_y h(x,y) C_y \nonumber \\
& & + \sum_y h(x,y) C_y - C_{f(x)} \nonumber \\
& \geq &
\alpha \Exphat_{(x,\vec{C}) \sim S} [ C_{H_{t-1}(x)} \nonumber \\
& & - \sum_y h(x,y) C_y - C_{H_{t-1}(x)} I(\forall y \: h(x,y)=0)]
\nonumber \\
& \geq & \alpha \Exphat_{(x,\vec{C}) \sim S} [\sum_{y:h(x,y)=1} h(x,y)
\left( \frac{C_{H_{t-1}(x)}}{k} - C_y \right) \nonumber \\
& & + \sum_{y:h(x,y)=0} \frac{C_{H_{t-1}(x)}}{k}
- C_{H_{t-1}(x)} I(\forall y \: h(x,y)=0)]
\nonumber \\
& = & \alpha \Exphat_{(x,\vec{C}) \sim S} \sum_y h(x,y)
\left(\frac{C_{H_{t-1}(x)}}{k} - C_y\right) \nonumber \\
& & + C_{H_{t-1}}(x) \left(\frac{|\{y:h(x,y)=0\}|}{k}
- I(\forall y \: h(x,y)=0) \right)
\nonumber \\
& \geq &
\alpha \Exphat_{(x,\vec{C}) \sim S} \sum_y h(x,y)\left( \frac{C_{H_{t-1}(x)}}{k} - C_y \right)
\nonumber \\
& = &
\alpha \Exphat_{(x,\vec{C}) \sim S} \sum_y \left| \frac{C_{H_{t-1}(x)}}{k} - C_y \right| \nonumber \\
& & (I(h(x,y) = I( \frac{C_{H_{t-1}(x)}}{k} > C_y ) ) -
I( \frac{C_{H_{t-1}(x)}}{k} < C_y)) \nonumber \\
\end{eqnarray}
Here Lemma~\ref{lemma:weak} was applied to get the last equality.
Next apply the weak learning assumption on the induced measure
over $(x',y',c')$ defined by: $x' = (x,y)$, $y' =
I\left(\frac{C_{H_{t-1}(x)}}{k} > C_y\right)$, and $c' = \left|
\frac{C_{H_{t-1}(x)}}{k} - C_y \right|$ to get:
\begin{eqnarray}
& \geq &
\alpha \gamma \Exphat_{(x,\vec{C}) \sim S} \sum_y
\left| \frac{C_{H_{t-1}(x)}}{k} - C_y \right|
I\left( \frac{C_{H_{t-1}(x)}}{k} > C_y \right)
\nonumber \\
& \geq &
\frac{\alpha \gamma}{k} \Exphat_{(x,\vec{C}) \sim S} C_{H_{t-1}(x)}
\nonumber
\end{eqnarray}
The last inequality follows because for all $y$ $C_y \geq 0$,
there exists $y$ such that $C_y = 0$, and
the sum is bounded below by its largest term.
Since the expected cost is convex (in fact linear), this implies
convergence to the global optimum. Noting that in each iteration, the
empirical cost is reduced at least by a factor of $1 - \frac{\gamma
\alpha}{k}$, the theorem follows. {\bf Q.E.D.}
Note that at earlier iterations, the binary classifier used as the
component learner is likely to be given a weighted sample with balanced
positive and negative examples. As the number of iterations increases
and progress is made, however, it will receive samples that are
increasingly more negative. (This is because the positive examples
correspond to labels that can further improve the current performance.)
It therefore becomes easier to attain high weighted accuracy
by simply classifying all examples to be negative.
The weak learning condition of Eq.~\ref{eq:weak2} appropriately
deals with this issue, as it requires that the weak learner
achieve better weighted accuracy than that attainable by
assigning all examples to the negative class, as we mentioned earlier.
\iffalse
It is important to note that the output $H_T$ of the procedure described
above is a stochastic hypothesis. For the original cost-sensitive
learning problem, this needs to be converted to a deterministic classifier.
The obvious method is to choose, for each $x$, the $y$ for each $H_T$ gives the
maximum value. It is easy to see that the resulting
functional hypothesis will also be optimal at convergence,
in the realizable case, i.e. when the optimal labeling of the
training data can be realized by the hypothesis space at hand.
\begin{corollary}
Define a modified version of GBSE in which the final hypothesis
$H_T$ is translated to a deterministic hypothesis $G_T: X \rightarrow Y$ by
\[
G_T(x) = \arg \max H_T(x,y)
\]
Then, in the realizable case, i.e. when the class of linear combinations
of ${\cal F}$, written $LIN({\cal F})$, contains a member $F^*$
which attains the minimum cost on the input sample,
with the same weak learning condition as in Theorem~\ref{prop2},
$G_T$ will approximately minimize the empirical cost
on the original sample $S$, namely as $T$ tends to infinity, we have
\[
\lim_{T \rightarrow \infty}
\sum_{x_i \in S} \tilde{C}(x,H_T(x)) =
\min_{F \in LIN({\cal F})} \sum_{x_i \in S} \tilde{C}(x_i,F(x_i))
\]
\end{corollary}
{\bf Proof}
In the realizable case, the optimal stochastic hypothesis is one that
classifies all the examples in the expanded training data correctly
and is {\em functional} on the training data.
\iffalse
Hence, its empirical loss is equal to that of the optimal functional hypothesis.
\fi
Thus, at convergence the functional modification $G_T$ is
identical to $H_T$ on the training data,
and its empirical cost equals that of
the empirically optimal functional hypothesis.
Hence the corollary follows from Theorem~\ref{prop2}.
{\bf Q.E.D.}
\fi
\iffalse
\subsection{Variants}
The method just described, IEB, is a natural result of combining
the ideas of iterative weighting and data expansion in a gradient
boosting framework with stochastic ensembles.
Here we also consider slight variants of IEB.
Recall that the weighting schemes used in {\em IWE} and {\em IEB}
make use of the following weight for an example pair $(x,y)$.
\begin{equation}
\label{eq:weight}
w(x,y) = C(x, H_{t-1}(x)C) - (x, y)
\end{equation}
Since, in {\em IWE}, each $x$ is labeled with the ``best'' label $y^*$,
namely that $y$ which has the least cost for $x$, $C(x,y)$, {\em IW}'s
weighting scheme, written $w_1$, is related to that of {\em IEB},
$w_2$, by the following relationship.
\begin{equation}
\label{eq:weight2}
w_1(x,y^*) = \max_{y \in Y} w_2(x,y)
\end{equation}
Now observe that the distribution $D$ on $X \times Y$ can be
expressed as as the following product:
\begin{equation}
\label{eq:distribution2}
D(x,y) = V(x) \cdot W(y|x)
\end{equation}
It is thus natural to consider the weighting scheme, in which
IWE's weighting, $w_1$, is used for $V(x)$, and
IEB's weighting, $w_2$, is used for $W(y|x)$.
This informal argument is for the case in which the
weights are probabilities (non-negative), but
we abuse and apply this motivation to our present context
involving positive and negative weights and arrive at the following
weighting scheme.
\begin{equation}
\label{eq:weight3a}
V(x) = w_1(x,y^*) = \max_{y \in Y} C(x, H_{t-1}(x)) - C(x, y)
\end{equation}
\begin{equation}
\label{eq:weight3b}
W(y|x) = \frac{|w_2(x,y^*)|}{\sum_{y \in Y} |w_2(x,y)|} =
\frac{|C(x, H_{t-1}(x)) - C(x, y)|}
{\sum_{y \in Y} C(x, H_{t-1}(x)) - C(x, y)}
\end{equation}
We call this variant IEB2 (Iterative Expand Binary 2-level weights).
It is also possible to define a variant in which
$W(y|x) = 1$ for all labels $y$. We call this variant IEB2-A
(Iterative Expand Binary 2-level weights with all labels).
This two-level weight variant is an alternative way of combining
the techniques of iterative weighting and data expansions.
One advantage that this approach has over that of IEB is that
by using the weights, $w_1$, of IWE for the choice of $x$,
it is assigning importance weights to $x$ according to the
maximum progress still possible on that $x$ at each stage.
Note that the weak learning assumption required for the performance
guarantee of IEB becomes harder to meet, as more progress is made
and less further progress is possible. Thus the use of $w_1$ weights
can be viewed as shifting the distribution so as to make it
easier for the weak learner to meet that requirement.
\fi
\iffalse
\begin{figure}[t]
\begin{itemize}
\item An input sample $S = \{ (x_i, y_i) | i=1,...,m \}$.
\item A weak binary classification learner $A$ that takes a sample
of the form
$\{\langle (x_i, y), l \rangle | i=1,...,m, y \in Y, l \in \{ 0, 1 \} \}$,
and outputs a member $h'$ of some given hypothesis space ${\cal H}'$,
which weakly minimizes the misclassification error.
%\item Normalized costs $C(x_i, y) \in [0,1]$ for all data pairs.
\item A constant $\delta \in [0,1)$ determining how to mix current weights
with uniform weights.
\end{itemize}
\begin{enumerate}
\item Define $H_0$ by $\forall x, y\: H_0(x,y) = 1/k$.
\item Initialize
$w_{i,y} := (\frac{1}{k}\sum_{y'\in Y} C(x_i,y')) - C(x_i,y)$,
for $i:= 1$ to $m$, and $y \in Y$.
\item Initialize
$u_{i} := 1/m$, for $i:= 1$ to $m$.
\item Initialize
$w_{i} := \max_{y} w_{i,y}$, for $i:= 1$ to $m$.
\item Initialize
$v_{i} := \delta u_i + (1 - \delta) w_i$.
\item {\bf For} $t:= 1$ to $T$ {\bf Do}
\begin{enumerate}
\item Set $w_{i,y} := C(x_i, H_{t-1}(x_i)) - C(x_i,y)$, for
$i:= 1$ to $m$, and for $y \in Y$.
\item Set
$w_{i} := \max_{y} w_{i,y}$, for $i:= 1$ to $m$.
\item Set
$v_i := \delta u_i + (1 - \delta) w_i$.
\item
Obtain subsample $X_t = \{ x_i | i = 1,...,m_t \}$ by performing
rejection sampling from original sample $S$
with acceptance probability $v_i$ for each ($i$-th) example.
%\item
%Obtain subsample $S_t = \{ (x_i,y_i) \}$ by labeling each $x_i$ in $X_t$
%by a member of $Y$, with probability proportional to $|w_{i,y}|$.
\item Set
$S_t' = \{ \langle (x_i, y), I(w_{i,y}\geq 0) \rangle | i=1,...,m_t, y \in Y \}$.
\item Define $\vec{|w|} = \langle |w_i| \rangle_{i,y}$.
\item Let $h_t := A(S, \vec{|w|})$ (either by weighted sampling or feeding
weights to $A$.)
\item Choose $\alpha_t \in [0,1)$ and set
%\[
%\alpha = \arg \max_{a} - L((1-a)H_{t-1}+ a h_t, S)
%\]
$H_t := (1-\alpha_t) H_{t-1} + \alpha_t h_t$.
\item {\bf End For}
\end{enumerate}
\item Return $H_{T}$.
\end{enumerate}
\caption{Method {\em IEB2 (Iterative Expand Binary 2)}}
\label{fig3}
\end{figure}
\fi
\iffalse
\begin{conjecture}
\label{prop3}
Suppose that the component learner $A$ is a distribution-free strong
learner. Then, in the realizable case,
the output of Method 3 will converge to an ensemble hypothesis
achieving approximately minimal average cost on the expanded sample $S'$
(as defined in the section on Method 2)
among the class of linear combinations of members of ${\cal H}$.
That is, as $T$ tends to infinity,
\[
E[\sum_x \sum_y H_T(x,y)C(x,y)] \approx
\min_{H \in LIN({\cal H})}
E[\sum_x \sum_y H(x,y)C(x,y)]
\]
\end{conjecture}
{\bf Proof Idea}
Note that the proof of Theorem~\ref{prop2} can be generalized
for an arbitrary distribution over $X$.
Applying the strong learning condition on $A$, rather than the
weak learning condition, will imply that Method 2's weighting scheme
will guarantee almost perfect labeling on the training data, with
respect to whatever distribution $D_t$ over $X$ used at each
iteration $t$. Thus, the output hypothesis satisfies
the strong learning condition of Theorem~\ref{prop1},
and hence argument similar to the proof of Theorem~\ref{prop1}
will show a guaranteed positive progress in each iteration.
{\bf Q.E.D.}
\fi
\section{Experimental Evaluation}
\label{sec:experiments}
We use the C4.5 decision tree learner \cite{Quinlan} as the base classifier
learning method, because it is a standard for empirical comparisons and
it was used as the base learner by Domingos for the {\em MetaCost} method
\cite{Domingos}.
We compare our methods against three representative methods:
Bagging \cite{Breiman},
Averaging cost \cite{ChSt98,Domingos} and MetaCost.
The Averaging cost method was also used for comparison in
\cite{Domingos}. Note that Bagging is a cost-insensitive learning method.
Here we give a brief description of these methods, and
refer the reader to \cite{Breiman,Domingos} for the details.
\begin{itemize}
\item
{\em Bagging} obtains multiple sub-samples from the original training set
by sampling with replacement, feeds them to the base learner (C4.5), and takes the average
over the ensemble of output hypotheses
\item
{\em Averaging Cost} (AvgCost) obtains a sub-sample by weighted sampling
with weights defined as the average cost for each $x$, and then
feeds it to the base learner (C4.5).
\item {\em MetaCost} uses bagging to obtain an ensemble of hypotheses,
then uses the ensemble to estimate class probabilities, relabels the examples
with the labels that minimize the expected risk according to the probability estimates and
finally runs the base learner (C4.5) to obtain a single classifier that predicts the new labels.
\end{itemize}
There are some deviations from these methods in our implementation, which we clarify below.
The main deviation is that we use {\em rejection} sampling for all methods (including bagging),
while other sampling schemes such as resampling with replacement are
used in the original methods.\footnote{In weighted rejection sampling,
the original data are scanned once (without replacement), and each
example is accepted with probability equal to (or proportional to) its
weight.}
We do this for two reasons: (1) inadequacy of resampling with replacement (or over-sampling),
especially for C4.5, has been noted by various authors \cite{ZLA03,DH00};
(2) since our proposed methods use rejection sampling,
we do the same for the other methods for fairness of comparison.
We stress that this deviation should only improve their performance.
Another deviation is that we use a variant of MetaCost that skips the
last step of learning a classifier on the relabeled training data set, but
directly minimizes the expected risk on the test data.
It has been observed that this variant performs at least as well
as MetaCost, in terms of cost minimization.
(This variant has been called {\em BagCost} by Margineantu \cite{MargThesis}.)
Also, in our implementation of AvgCost, we perform weighted sampling
multiple times to obtain an ensemble of hypotheses, then output
their average as the final hypothesis. (In the original implementation,
only a single iteration of weighted sampling was performed.)
We note that, due to our
normalization assumption that the minimum cost for each instance
$x$ is always zero, our version of AvgCost is identical to a more
sophisticated variant in which the difference between the average
cost and the minimum cost is used for sampling weights.
Our experience shows that this variant of AvgCost performs better
than the original method.
The methods were applied to five benchmark data sets available from
the UCI machine learning repository \cite{uci} and one data set
from the UCI KDD archive \cite{kdd}.
These data sets were selected by the criterion of having approximately
1000 examples or more, besides being multi-class problems.
A summary of these data sets is given in Table~\ref{table:datasets}.
Here {\em class ratio} is defined as the class frequency of the least
frequent class divided by that of the most frequent one.
We note that the KDD-99 data set is actually a larger data set.
We used the so-called 10$\%$ training data set, which consists roughly of
500 thousand instances, and further sampled down by random sampling
40$\%$ of them, to get the data set of size 197,710 which we
used for our experimentation. Emphatically, we only used data from
the original training set,
and not data from the test set. We do this because of the idiosyncratic
property of this data set that the test data are generated from
a considerably different data distribution. While this property is
both realistic and interesting for empirical evaluation of a method
for intrusion detection, we judged it not to be desirable for the
current purpose of evaluating general
purpose cost-sensitive classification algorithms.
Except for the KDD-99 data set, these data sets do not have standard
misclassification costs associated with them.
For this reason, we follow Domingos and generate cost matrices according to a model that gives higher
costs for misclassifying a rare class as a frequent one,
and the lower costs for the reverse. (Note therefore that our experiments
do not exploit the full generality of the instance-dependent cost
formulation presented in Section~\ref{sec:prelim}.) This reflects
a situation that is found in many practical data mining applications,
including direct marketing and fraud detection,
where the rare classes are the most valuable to identify correctly.
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|}\hline
Data Set & \# of examples & \# of classes & Class ratio \\ \hline
Annealing & 898 & 5 & 0.01316 \\
KDD-99 & 197710 & 5 & 0.0001278\\
Letter & 20000 & 26 & 0.9028 \\
Satellite & 6435 & 6 & 0.4083\\
Solar flare & 1389 & 7 & 0.002562 \\
Splice & 3190 & 3 & 0.4634 \\ \hline
\end{tabular}
\end{center}
\caption{Data set characteristics: data size, number of classes, and the
ratio between the frequency of the most common class to the least common.}
\label{table:datasets}
\end{table}
Our cost model is as follows:
\iffalse
Let $C(y_1,y_2)$ be the cost of predicting that an example belongs
to class $y_1$ when its actual label is $y_2$.
\fi
Let $\hat{P}(y_1)$ and $\hat{P}(y_2)$ be the empirical probabilities
of occurrence of classes $y_1$ and $y_2$ in the training data.
We choose the non-diagonal entries of the cost matrix
$C(y_1,y_2)$, $y_1 \neq y_2$ with uniform probability from
the interval $[0,2000 \hat{P}(y_1)/\hat{P}(y_2)]$.
In \cite{Domingos}, the diagonal entries were then chosen
from the interval $[0,1000]$,
which often leads to cost matrices in which the correct label is
not the least costly one. Besides being unreasonable
(see Elkan \cite{El01}), these cost matrices can give
an unfair advantage to cost-sensitive methods over cost-insensitive ones.
We therefore set the diagonal entries to be identically zero,
which is consistent with our normalization assumption.
\begin{table*}[t]
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}\hline
Data Set & Bagging & AvgCost & MetaCost & IW & DSE & GBSE \\ \hline
Annealing & $1059 \pm 174$ & $127.4 \pm 12 .2 $ & $206.8 \pm 42.8$ & $67.38 \pm 9.22$ & $127.1 \pm 14.9$ & \boldmath$33.72 \pm 4.29$ \\
Solar & $5403 \pm 397$ & $237.8 \pm 37.5$ & $5317 \pm 390$ & $174.2 \pm 32.7$ & $110.9 \pm 28.7$ & \boldmath$48.17 \pm 9.52$ \\
KDD-99 & $319.4 \pm 42.2$ & $42.43 \pm 7.95$ & $49.39 \pm 9.34$ & $50.43 \pm 10.0$ & $46.68 \pm 10.16$ & \boldmath$1.69 \pm 0.78$ \\
letter & $151.0 \pm 2.58$ & $91.90 \pm 1.36$ & $129.6 \pm 2.44$ & $247.7 \pm 4.15$ & $114.0 \pm 1.43$ & \boldmath$84.63 \pm 2.44$ \\
Splice & $64.19 \pm 5.23$ & $60.78 \pm 3.65$ & \boldmath$49.95 \pm 3.05$ &
$67.26 \pm 4.18$ & $ 135.5 \pm 14$ & $57.50 \pm 4.38$ \\
Satellite & $189.9 \pm 9.57$ & $107.8 \pm 5.95$ & $104.4 \pm 6.43$ & $140.1 \pm 18.2$ & $116.8 \pm 6.28$ & \boldmath$93.05 \pm 5.57$ \\ \hline
\end{tabular}
\end{center}
\caption{Experimental results: the average cost and standard error.}
\label{table:results}
\end{table*}
\begin{table*}[t]
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}\hline
Data Set & Bagging & AvgCost & MetaCost & IW & DSE & GBSE \\ \hline
Annealing & $11991 \pm 13.1$ & $1002.8 \pm 183$ & $11987 \pm 9. 84$ & $ - $ & $3795.5 \pm 688$ & $1260.2 \pm 224$ \\
Solar & $18499 \pm 20.4$ & $334.80 \pm 37.5$ & $18510 \pm 14.4$ & $ - $ & $2112.8 \pm 276$ & $486.45 \pm 53.3$ \\
KDD-99 & $395310 \pm 143$ & $2551.9 \pm 428.6$ & $395580 \pm 143$ & $ - $ & $12512 \pm 2450$ & $4181 \pm 783.6$ \\
letter & $40037 \pm 44.3$ & $159720 \pm 2028$ & $40052 \pm 41$ & $ - $ & $479130 \pm 2710$ & $363001 \pm 5557$ \\
Splice & $42515 \pm 26.6$ & $33658 \pm 1697$ & $42501 \pm 21$ & $ - $ & $52123 \pm 592$ & $50284 \pm 3659$ \\
Satellite & $86136 \pm 123$ & $60876 \pm 1641$ & $85984 \pm 127$ & $ - $ & $218870 \pm 6516$ & $140810 \pm 3335$ \\ \hline
\end{tabular}
\end{center}
\caption{Experimental results: the average data size
used by each method in 30 iterations, and standard error.}
\label{table:datasizes}
\end{table*}
In all experiments, we randomly select two thirds of the examples in
the data set for training and use the remaining one third for testing.
Also, for each training/test split we generate a
different cost matrix according to the rules above.
Thus, the standard deviations that we report
reflect both variations in the data and in the misclassification costs.
We remark on certain implementation details of the proposed
learning methods in our experimentation.
First, we note that in all of the methods used for comparison,
except IW, C4.5 was used as the component algorithm with
weighted rejection sampling, and the final hypothesis is expressed
essentially as an ensemble of output decision tress of C4.5.
IW, as a meta-method, does not use ensembles; instead we used an
ensemble method of {\em costing} \cite{ZLA03} on C4.5, as the
component algorithm. Its output hypothesis is therefore also an
ensemble of decision trees.
DSE, as stated in its definition in Figure~\ref{fig:methodb},
is not an ensemble method, but analogously to AvgCost, we performed
multiple iterations of weighted sampling according to the weighting
scheme of DSE and averaged the resulting hypotheses to define the final
hypothesis. Finally, the choice of the mixture weight $\alpha_t$ was
set at $1/t$ for all methods.
The results of these experiments are summarized in
Table~\ref{table:results} and Table~\ref{table:datasizes}.
Table~\ref{table:results} lists the average costs attained by
each of these methods on the 6 data sets, and their standard errors.
These results were obtained by averaging over 20 runs, each run
consisting of 30 iterations of the respective learning method.
These results appear quite convincing:
GBSE outperforms all comparison methods on all data sets,
except on Splice, for which it ranks second after MetaCost.
Also, GBSE is the best performing among the proposed methods in
the paper, confirming our claim
that the combination of various techniques involved is
indeed necessary to attain this level of performance.
Table~\ref{table:datasizes} lists the average total data size used by
each of the methods in 30 iterations. The data size for IW is not
listed, since it consists of 30 iterations of 10 rounds of costing
and the direct comparison of total data size does not seem to make
as much sense for this method.
Examining these results in conjunction with the data characteristics
in Table~\ref{table:datasets} reveals a definite trend.
First, note that the data sets are divided into two groups:
those having very large skews, or very low class ratios
(Annealing, KDD-99 and Solar flare), and
those having moderate skews (Satellite, Splice and Letter).
It is evident that
the methods based on example weighting (AvgCost, GBSE, DSE) use
magnitudes smaller data sizes for the 3 data sets in the first
group (i.e. with large skews), as compared to the other methods (Bagging and MetaCost).
The performance of GBSE is especially impressive on this group,
achieving much lower cost while requiring very small data sizes.
It is worth mentioning that it is these data sets in the first group
with large skews, that require {\em cost-sensitive} learning the most.
\section{Discussion}
It is not the first time that the issue of incorporating
cost-sensitivity to boosting has been addressed. For example, AdaCost
\cite{Adacost} suggested a way of modifying AdaBoost's exponential
loss using a function (called {\em cost adjustment function}) of the
cost and confidence. The rational choice of this cost adjustment
function, however, appears not to be well-understood. The stochastic
ensemble that we employ in the present paper provides a
straightforward but reasonable way of incorporating cost and
confidence, i.e. in terms of {\em expected cost}. An interesting future
direction is to investigate the relationship between these alternative
approaches to cost-sensitive boosting. Also note that AdaCost, being
a modification of AdaBoost, is restricted to two-class problems.
Comparing and studying possible relationships between GBSE and
other (both cost-sensitive and insensitive) multi-class extensions of
boosting, such as AdaBoost.M2 \cite{Adaboost}, is
another interesting topic.
Finally, GBSE can also be viewed as a reduction from multi-class
classification to binary classification. Comparison with existing
methods for such reductions (e.g.\ \cite{AlScSi00}) is another
important research issue.
\section{Acknowledgments}
We thank Saharon Rosset of IBM Research
for fruitful discussions on related topics.
\newpage
\bibliographystyle{abbrv}
\bibliography{kdd04multi}
\balancecolumns
\end{document}