\documentclass[english]{article}
\usepackage{proceed2e}
\usepackage{float}
\usepackage{graphicx}
%\makeatletter
\floatstyle{ruled}
\newfloat{algorithm}{tbp}{loa}
\floatname{algorithm}{Algorithm}
\newtheorem{thm}{Theorem}
\newtheorem{cor}[thm]{Corollary}
\newenvironment{theorem}{\begin{thm}\begin{em}}%
{\end{em}\end{thm}}
\newcommand{\junk}[1]{}
% deluxe proof environment
\def\qsym{\vrule width0.6ex height1em depth0ex}
\newcount\proofqeded
\newcount\proofended
\def\qed{{\hspace{2pt}\rule[-1pt]{3pt}{9pt}}
\end{rm}\addtolength{\parskip}{-0pt}
\setlength{\parindent}{0pt}
\global\advance\proofqeded by 1 }
\newenvironment{proof}%
{\proofstart}%
{\ifnum\proofqeded=\proofended\qed\fi \global\advance\proofended by 1
\medskip}
\makeatletter
\def\proofstart{\@ifnextchar[{\@oprf}{\@nprf}}
\def\@oprf[#1]{\begin{rm}\protect\vspace{0pt}\noindent{\bf Proof of #1:\
}%
\addtolength{\parskip}{0pt}\setlength{\parindent}{0pt}}
\def\@nprf{\begin{rm}\protect\vspace{0pt}\noindent{\bf Proof:\ }%
\addtolength{\parskip}{0pt}\setlength{\parindent}{0pt}}
\makeatother
\usepackage{babel}
\makeatother
\title{Estimating Class Membership Probabilities using Classifier Learners}
\author{ {\bf John Langford} \\
Toyota Technological Institute at Chicago\\
Chicago, IL 60637 \\
\texttt{jl@tti-c.org}\\
\And
{\bf Bianca Zadrozny} \\
IBM T.J. Watson Research Center \\
Yorktown Heights, NY 10598\\
\texttt{zadrozny@us.ibm.com} \\
}
\renewcommand{\baselinestretch}{0.95}
\begin{document}
\maketitle
\begin{abstract}
We present an algorithm, ``Probing'', which reduces learning an
estimator of class probability membership to learning binary
classifiers. The reduction comes with a theoretical guarantee: a
small error rate for binary classification implies accurate estimation
of class membership probabilities. We tested our reduction on several
datasets with several classifier learning algorithms. The results show
strong performance as compared to other common methods for obtaining
class membership probability estimates from classifiers.
\end{abstract}
\maketitle
\section{Introduction}
{\bf Background.} Classifier learning is the problem of inducing a
predictor for distinguishing between two (or a few) classes given a
set of labeled training examples. There are many reasons for
considering classifier learning as the most fundamental learning task.
Predicting one bit (or a few bits) is as simple as non-trivial
prediction can be. Because of this simplicity, classifier learning is
often more amenable to study than other learning problems. On the
practical side, several fast learning algorithms (such as SVM
\cite{SVM}, boosting \cite{boosting} on decision trees \cite{DT}, and
neural networks \cite{nn}) exist which generally lead to classifiers
with excellent predictive performance.
{\bf The Problem.} On the other hand, there are several real-world
applications requiring probabilistic answers instead of discrete class
labels. For example, in medical domains, doctors often prefer to make
decisions based upon probabilistic predictions of the patient state,
rather than being given a discrete chosen action for each patient.
Even when the actual probability estimates are not required, a ranking
of examples from the most likely member to the least likely member of
a class may be desirable. This is the case, for example, in document
categorization, where users might like to see a list of documents
ranked by their relevance to a particular topic.
Probability estimates are also important when the classification
outputs are not used in isolation but are combined with information
from other components in a system. For example, in handwritten
character recognition the outputs from the classifier are used as
input to a high-level system which incorporates domain information,
such as a language model. In this case, probabilities provide a
standard language for summarizing information from multiple sources
into a single decision. When the components of the system are
distributed, bandwidth constraints may make this summarization
necessary as well as standard.
{\bf Solution Sketch.} Our method for answering the needs of these
real-world applications can be thought of as a reduction of class
probability estimation to classifier learning. The reduction (which we
name ``Probing'') satisfies strong optimality guarantees: good
performance with respect to classification implies good performance
with respect to class probability estimation. The essential
observation underlying the Probing reduction is that probability
estimates (in a Bayesian sense) can be extracted from preferences over
bets at different odds ratios. The machinery that applies this
observation to classifier learning algorithms is the Probing
reduction.
\section{Related Work}
There are several existing approaches for obtaining class probability
estimates from classifiers.
{\bf Bagging.} One common approach uses bagging
\cite{bag_prob}. Essentially, the learning algorithm is run on many
training sets formed by sampling with replacement from the original
dataset, and the estimated probability that any sample $x$ has label
$y=1$ is the proportion of learned classifiers that output $1$ for
$x$. The results of the bagging approach may have nothing to do with
$\Pr_{(x,y)\sim D}(y=1|x)$, the probability according to the data
generation process that $y=1$. As has been observed before, the
probability estimates obtained through bagging are measuring the
instability of the classifier learning method over data perturbations
\cite{dragos} rather than $\Pr_{(x,y)\sim D}(y=1|x).$ Indeed, for
larger datasets, bagging with certain learning algorithms might result
in the same classifier on every run, causing the bagging probability
to be always $0$ or $1$ even when the fundamental process is much less
certain. In contrast, our Probing reduction has a simple
guarantee---if the classifiers solve their associated classification
problems optimally, then the reported probability is precisely
$\Pr_{(x,y)\sim D}(y=1|x)$. In addition, unlike bagging, Probing
preserves independence in the original dataset%
\footnote{Resampling with replacement provides samples from the correct distribution
in a \emph{dependent} manner. %
}, which is quite important for many learning algorithms (see \cite{costing}
for further discussion).
{\bf Margin Sigmoid.} Another approach for extracting probabilities
from classifiers is to take an internal representation such as the
margin of a support vector machine and transform it into a value
$v\in[0,1]$ to achieve calibration on the training set (for example,
using a sigmoid function \cite{Platt-prob}). The Probing reduction is
more general (since it applies to classifiers without an internal
margin such as decision trees) and less ad-hoc. In particular, in the
Probing reduction a small classification error rate necessarily
implies good probability estimates over all distributions generating
the data. No guarantee of this quality can be made for any
transformation of a margin into a $[0,1]$ interval, essentially
because the margin is not a ``sufficient statistic'' for probability
estimation. See also \cite{bartlett} for a discussion of sparsity
versus probability estimation issues.
{\bf Direct Prediction.} The third standard approach is to simply
predict probabilities directly, tipically with some form of learned
probabilistic model. Our results can be viewed from this approach
as (1) providing compatibility between non-probabilistic and
probabilistic models and (2) enriching the set of probabilistic models
with new and often better performing models.
We discovered that a reduction similar to Probing had been published
previously by Halck \cite{Halck}. The main difference is that Probing
reweights the classes appropriately to obtain preferences over bets at
different odds ratios (see next section), while Halck's reduction
changes the labels of examples randomly for the same purpose. We
provide a theoretical performance guarantee and present a
comprehensive experimental evaluation of Probing. Halck's method may
be analyzable and may have good empirical performance, but the theory
and evaluation have not been done yet.
\section{The Probing Reduction}
\label{sec:probing-reduction}
Assume we have examples $(x,y)$ drawn from an unknown distribution $D$ with
domain $\mathcal{X}\times \mathcal{Y}$ where $\mathcal{X}$ is the feature space and
$\mathcal{Y}=\{0,1\}$ is the label space.
{\bf Basic Observations.} The Probing reduction is based upon the
observation that a binary classifier which predicts $1$ for an example
$x$ is implicitly predicting that the probability of class 1 given $x$
is greater than the probability of class 0 given $x$, that is, $D(y=1|x) > D(y=0|x)$.
Conversely, a classifier which predicts 0 for $x$ suggests that
$D(y=0|x) > D(y=1|x)$.
By weighting the training examples such that positive examples are $w$
times more costly to classify incorrectly than negative examples, it
is possible to learn classifiers which predict 1 if
\begin{small}
$$D(y=1|x)>\frac{1}{1+w}$$
\end{small}
and 0 otherwise.
\begin{cor}(Minimal Error Probing)
Let $i_p(y) = \frac{1-p}{p} y + (1-y)$. Then
\begin{small}
$$\arg \min_{c : X \rightarrow Y} E_{(x,y)\sim D} ~i_p(y)~I(c(x)\neq y) $$
$$ = I\left( D(y=1|x) > p \right)$$
\end{small}
except on a set of measure $0$.
\label{thm:minimal-error-probing}
\end{cor}
This corollary only discusses \emph{minimal} error classifiers, so it
only applies in the asymptotic limit of a bayes consistent classifier.
However, the probing algorithm can be proved sound even for
classifiers with nonminimal error rates. We prove this more general
theorem in section \ref{sec:Analysis-of-the} (theorem
\ref{thm:error-transformation}). The corollary states that a
classifier which minimizes a importance weighted loss can be used to
determine if the conditional distribution $D(y=1|x)$ is above or below
a threshold for a given value of $x$, where the threshold depends on
the relative weight given to positive and negative examples.
{\bf The Algorithm.} Now, suppose that we learn a classifier $c_p$ for
every choice of $p$ and each of these classifiers has a minimal
loss. Then, for an input $x$, each of the classifiers answers whether
$D(y=1|x)$ is greater than the threshold $p$. For $p = 0$, we
necessarily have $c_p(x)=1$ because $D(y=1|x)$ cannot be less than 0.
As $p$ grows, the predictions of the subsequent classifiers are
monotone in $p$ with one point $p^* = D(y=1|x)$ where the prediction
changes from $1$ to $0$. Therefore, the Probing reduction reports:
\begin{equation}
D(y=1|x)=p^*.
\label{eq:probing}
\end{equation}
Three issues remain:
\begin{enumerate}
\item Not all classifier learners automatically take the weight $w$ as
input. We address this by using the costing reduction \cite{costing}
from importance weighted classification to binary classification,
which applies to any classifier learner.
\item Learning a classifier for every choice of $w$ is too
computationally intense. We resolve this by discretizing the set of
$w$ and reporting a probability in the discrete interval.
\item Independently learned sub-optimal classifiers may not be
consistent, meaning that the sequence of predictions for a given
example is not necessarily monotonic. We simply sort the results of
the classifiers to make the sequence monotonic (all zeros before all
ones). This gives us the solution consistent with the largest
possible number of classifiers. Although this may appear to be a
hack, the proof of theorem \ref{thm:error-transformation}
fundamentally relies upon the sort.
\end{enumerate}
\begin{algorithm}
\caption{\label{alg:Probing-Learner}\textbf{\underline{Probing-Learner} (classifier learner $A$,
dataset $S=(x,y)^{m}$, number of iterations $t$)}}
\begin{enumerate}
\item Let $I = \{[0,1]\}$
\item For $i = 1$ to $t$
\begin{enumerate}
\item Let $[a,b] = $ minimax interval $\in I$
\item Let $p$ = minimax optimal $p \in [a,b]$
\item Let $I = (I - \{[a,b]\}) \cup \{[a,p],[p,b]\}$
\item Let $S_{p}=\{(x,y,i_p(y)):(x,y)\in S\}$
\item Let $c_{p}=A(S_{p})$
\end{enumerate}
\item Return all $c_{p}$\end{enumerate}
\end{algorithm}
\begin{algorithm}
\caption{\label{alg:Probing-Predictor}\textbf{\underline{Probing-Predictor}
(set of weighted classifiers $c_{p}$, input $x$)}}
\begin{enumerate}
\item Let $n=\sum_{p} c_{p}(x)$
\item Let $[a,b]$ = the $n$th interval
\item Return minimax optimal $p \in [a,b]$.\end{enumerate}
\end{algorithm}
{\bf Learning.} The algorithms for learning and making predictions
using the Probing reduction are reported in figures Algorithm
\ref{alg:Probing-Learner} and Algorithm
\ref{alg:Probing-Predictor}. The Probing-Learner algorithm takes as
input a classifier learning algorithm, a set of training examples and
some number of iterations. On each iteration, it refines the
discretization of the interval with the largest potential gain from
refinement (this is loss dependent, see below for details). For each
chosen interval a minimax choice of probability and the
corresponding weight (obtained by inverting equation \ref{eq:probing})
are calculated and a weighted classifier is learned.
{\bf Prediction.} The Probing-Predictor algorithm takes as input the
set of weighted classifiers and an input $x$. It sums the predictions
of each classifier on $x$ to determine the number of classifiers that
predict 1 for $x$. It then picks the minimax probability in the $n$th
interval (counting from $0$). This is equivalent to sorting the
classifier predictions to obtain monotonicity and reporting a probability
from the interval in which the prediction changes from 1 to 0.
{\bf Loss Details.} Steps 2(a) and 2(b) in Probing-Learner are dependent
on the probabilistic loss function of interest. Here we describe these
steps for two of the most commonly used probabilistic loss functions: cross
entropy and squared error. In order to determine the
minimax interval and the minimax probability for these loss
functions, we make the somewhat unjustified but convenient assumption that all
classifiers are correct.
\begin{enumerate}
\item Cross entropy is given by
$$
E_{(x,y)\sim D}I(y=1)\ln\frac{1}{\hat{p}(x)}+I(y=0)\ln\frac{1}{1-\hat{p}(x)}.
$$
The minimax optimal $\hat{p} \in [a,b]$ is given by:
$$
\begin{array}{l}
\min_{\hat{p}\in [a,b]} \max_{p\in[a,b]}[\textrm{loss on } \hat{p} - \textrm{loss on } p] \\
= \min_{\hat{p}\in [a,b]} \max_{p\in[a,b]} \left[p \ln \frac{1}{\hat{p}} + (1-p) \ln \frac{1}{1-\hat{p}}\right. \\
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\left.- p \ln \frac{1}{p} - (1 - p) \ln \frac{1}{1-p}\right] \\
\end{array}
$$
This implies that
$$\hat{p} = \frac{1}{1+e^{\frac{H(b)-H(a)}{b-a}}},$$
where $H(a)$ is Shannon entropy (in nats) of a coin with bias $a$.
The minimax interval is the interval which creates the largest
potential change in loss when refined on the training set. Suppose we
let $S_a$ be the set of training examples whose current probabilistic
prediction is $\hat{p} \in [a,b]$. The chosen \footnote{Note that the
definition $\hat{p}$ implies this is equivalent to $\arg
\max_{[a,b]\in I} |S_a| \left[(1-b)\ln \frac{1-b}{1 -
\hat{p}} + b \ln \frac{b}{\hat{p}} \right]$} interval is:
$$ \arg \max_{[a,b]\in I} |S_a| \left[(1-a)\ln
\frac{1-a}{1 - \hat{p}} + a \ln \frac{a}{\hat{p}} \right]$$
\item Mean squared error is given by
$$
E_{(x,y)\sim D}[(y-D(y|x))^2].
$$
The minimax optimal $\hat{p} \in [a,b]$ is given by:
$$
\begin{array}{l}
\min_{\hat{p}\in [a,b]} \max_{p\in[a,b]}[\textrm{loss on } \hat{p} - \textrm{loss on } p] \\
= \min_{\hat{p}\in [a,b]}\max_{p\in[a,b]} \left[ p (1-\hat{p})^2 +(1-p) \hat{p}^2 \right.\\
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\left. - p (1-p)^2 - (1-p) p^2 \right] \\
\end{array}
$$
This implies that
$$\hat{p} = (a+b)/2.$$
The minimax interval is similarly given by
$$ \arg \max_{[a,b]\in I} |S_a| (b-a).$$
\end{enumerate}
In a transductive setting, using the test set
distribution to weight the minimax interval is a better than $|S_a|$.
{\bf Multiclass Case.} For the multiclass case with $k > 2$ labels, we
simply apply the binary method to $k-1$ binary problems defined by
the mapping:
$$ (x,y) \rightarrow (x,I (l = y))$$ for $k-1$ choices of $l$. This
method gives us probability estimates $\hat{p}_1,...,\hat{p}_{k-1}$
for $k-1$ classes, so we can estimate the probability of the last
class according to $\hat{p}_k = 1 - \sum_{i=1}^{k-1} \hat{p}_{i}$.
\section{Analysis of the Probing Reduction}
\label{sec:Analysis-of-the}
The following theorem shows that the probing reduction is inherently
robust against classification errors. For the theorem, remember
$i_p(y) = \frac{1-p}{p}y + (1-y)$ and let the importance weighted loss
of a classifier $c$ under distribution $D$ be $$l_p(c) =
\frac{1}{Z_p}E_{(x,y)\sim D} i_p(y) I(c(x) \neq y)$$ where $Z_p =
E_{(x,y)\sim D'}~i_p(y) $ is a normalization constant.
This theorem analyzes probing in the limit as the number of
classifiers approaches $\infty$ uniformly over the unit interval.
Discretization to intervals of size $d$ adds at most $d + d^2/4$ to
the expected squared error.
\begin{theorem}
(Probing Error Transformation) If the classifiers $c_p$ learned under
Probing have average relative importance weighted loss
$$
\epsilon = E_{p \sim U(0,1)} \left[ l_p(c_p) - \min_c l_p (c) \right],
$$
then
$$
E_{x \sim D}(D(1|x)-\hat{p}(x))^2\leq\ 2 \max\{D(1),D(0)\} \epsilon \leq 2 \epsilon
$$
where $\hat{p}(x)$ is the probing prediction.
\label{thm:error-transformation}
\end{theorem}
This is a strong theorem in the sense that it ties the error in the
probability predictions to the \emph{average} \emph{relative}
importance weighted loss of the classifiers. Using the average loss
over $w$ results in a more powerful statement than using the maximal
loss, because it is easier to obtain a set of classifiers which have
small average loss than to obtain a set of classifiers, all with a
small loss. Using a loss that is relative to the loss of the Bayes
optimal classifier means that the theorem applies even when the
fundamental noise rate is large.
Note that when the proportion of positive and negative examples is balanced (D(1)=D(0)=0.5) the
bound on the error in the probability predictions is exactly the average relative importance
weighted loss,
$$E_{x \sim D}(D(1|x)-\hat{p}(x))^2\leq\epsilon$$
\begin{proof}
Let $c_p^*=\min_c l_p (c)$, then
\begin{small}
$$\begin{array}{l}
\epsilon = E_{p \sim U(0,1)} \left[ l_p(c_p) - l_p (c_p^*) \right] \\
= E_p \left[\frac{1}{Z_p}E_{(x,y)\sim D} [i_p(y) (I(c_p(x) \neq y) - I(c_p^*(x) \neq y))]\right] \\
= E_p \left[\frac{1}{Z_p}E_x [D(1|x) \frac{1-p}{p} (c_p^*(x) - c_p(x))\right. \\
~~~~~~~~~~~~~~~~~~~~~\left.+ D(0|x) (c_p(x) - c_p^*(x))]\right] \\
= E_p \left[\frac{1}{Z_p}E_x [D(1|x) \frac{1-p}{p} (c_p^*(x) - c_p(x)) \right.\\
~~~~~~~~~~~~~~~~~~~~~\left.+ (1 - D(1|x)) (c_p(x) - c_p^*(x))]\right]\\
= E_p \left[\frac{1}{Z_p}E_x [(c_p^*(x) - c_p(x))(\frac{1}{p}D(1|x) - 1)]\right].\\
\end{array}$$
\end{small}
Rewriting with $D(1|x) > p \Rightarrow c_p^*(x) = 1$ and
$D(1|x) < p \Rightarrow c_p^*(x) = 0$, we have
\begin{small}
$$\epsilon = E_p \left[\frac{1}{p Z_p} E_x [I(c_p^*(x) \neq c_p(x)) \left| D(1|x) - p \right|]\right].$$
\end{small}
Notice that for all $p$, $Z_p = D(0) + D(1)\frac{1-p}{p}$. Hence
$$p Z_p = p D(0) + (1-p) D(1) \leq \max\{D(1),D(0)\}.$$
Therefore, we have
$$E_p E_x [I(c_p^* \neq c_p(x)) |D(1|x) - p|] \leq \max\{D(1),D(0)\}\epsilon, $$
which can be re-written as
$$E_{x\sim D} \int_0^1 I(c_p^* \neq c_p(x)) |D(1|x) - p|d_p $$
\begin{equation}
\leq \max\{D(1),D(0)\}\epsilon,
\label{eq:integral}
\end{equation}
Given a fixed budget of $\epsilon$ for binary errors, the scenario
which maximizes probability error estimates has all classifiers erring
in one direction (either for $p < D(1|x)$ or $p > D(1|x)$) because two
errors in different directions are canceled in the sorting phase of
Probing-Predictor. Furthermore, errors by classifiers nearer to
$D(1|x)$ are preferred to errors far from $D(1|x)$, since the
importance weighted loss payed by the adversary increases
monotonically with distance from $D(1|x)$. Therefore, when the
Probing prediction is $\hat{p}(x)$, all the classifiers $c_p$ for
which the corresponding $p$ are between $\hat{p}(x)$ and the correct
probability $D(1|x)$ give incorrect predictions whereas the other
classifiers give correct predictions. As a result, the integrand of
equation \ref{eq:integral} is non-zero only for
$\hat{p}(x)