\documentclass{sig-alternate}
%\documentclass{acm_proc_article-sp}
%\documentclass[11pt,english]{article}
%\usepackage[T1]{fontenc}
%\usepackage[latin1]{inputenc}
\usepackage{graphicx,psfig,enumerate}
\usepackage{amsmath,amssymb,theorem,float,crop,algorithmic,algorithm}
%\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
%\providecommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\@}
%% Bold symbol macro for standard LaTeX users
%\newcommand{\boldsymbol}[1]{\mbox{\boldmath $#1$}}
%\usepackage{babel}
%\makeatother
\newtheorem{theo}{Theorem}
\newtheorem{prop}{Proposition}
\newtheorem{lemm}{Lemma}
\newtheorem{corr}{Corollary}
\newtheorem{clm}{Claim}
\newcommand{\bucket}{{\rm Bucket}}
\newcommand{\bmax}{{\rm bmax}}
\newcommand{\myref}[1]{(\ref{#1})}
%\renewcommand{\baselinestretch}{0.95}
\begin{document}
\conferenceinfo{KDD'04,} {August 22--25, 2004, Seattle, Washington, USA.}
\CopyrightYear{2004}
\crdata{1-58113-888-1/04/0008}
\title{An Objective Evaluation Criterion for Clustering}
\numberofauthors{2}
\author{
\alignauthor Arindam Banerjee\\
\affaddr{Dept of ECE}\\
\affaddr{University of Texas at Austin}\\
\affaddr{Austin, TX, USA}\\
\email{abanerje@ece.utexas.edu}
\alignauthor John Langford\\
\affaddr{Toyota Technological Institute}\\
\affaddr{Chicago, IL, USA}\\
\email{jl@tti-c.org}
}
\maketitle
\vspace*{-15mm}
\begin{abstract}
We propose and test an objective criterion for evaluation of
clustering performance: How well does a clustering algorithm run on
unlabeled data aid a classification algorithm? The accuracy is
quantified using the PAC-MDL bound~\cite{blla03} in a semisupervised
setting. Clustering algorithms which naturally separate the data
according to (hidden) labels with a small number of clusters perform
well. A simple extension of the argument leads to an objective model
selection method. Experimental results on text analysis datasets
demonstrate that this approach empirically results in very competitive
bounds on test set performance on natural datasets.
\end{abstract}
\vspace{1mm}
\noindent
{\bf Categories and Subject Descriptors:} I.5.3 [Pattern Recognition]: Clustering
\vspace{1mm}
\noindent
{\bf Keywords:} Clustering, Evaluation, PAC bounds, MDL
\section{Motivation}
Clustering is perhaps the most widely used exploratory data analysis
tool using in data mining. There are many clustering algorithms that
have been proposed in the literature based on various requirements.
Each one optimizes some unsupervised internal quality measure such as
minimizing the intra-cluster distances, maximizing the inter-cluster
distances, maximizing the log-likelihood of a parametric mixture
model, minimizing the cut-value of a pairwise similarity graph,
etc.
%However, it is not clear how to objectively evaluate and compare
%performances of various algorithms and how to choose the best
%algorithm for a task.
%\par Almost all clustering algorithms optimize an internal quality
%measure to come up with a clustering.
The differences between these internal quality measures make it
practically impossible to objectively compare clustering
algorithms. Due to the lack of an objective measure, choosing the
``right'' algorithm for a particular task is confusing.
%When it is time to use a clustering algorithm, we quite often have yet
%another metric of optimality in mind. These many metrics have resulted
%in confusions.
%Should I use kmeans? Should I use EM on some mixture model? Should I
%use my favorite similarity measure to come up with a pairwise
%similarity graph and partition it?
% Should I invent my own algorithm
%because the thing that I want to optimize has not yet been exactly
%optimized?
\par
We propose a method which eliminates some of these confusions in some
settings. Clustering is often used as an intermediate exploratory data
analysis step for a prediction problem. Hence, the answer to {\em what
is a good clustering algorithm for my problem?} often
depends on how much prediction power the clustering step
provides, or, more directly, {\em what is the predictive accuracy of a
clustering algorithm?} In other words,
since clustering is often used as a method for gaining insight
into a dataset, our method here evaluates the degree to which clustering
aids the understanding of a dataset \emph{for the purpose of classification}.
On natural semi-supervised learning datasets, if the clustering {}``agrees
well'' with a set of hidden labels using a small number of clusters,
we say that the clustering algorithm is good for prediction. The precise
definition of {}``agrees well'' is mediated by the PAC-MDL bound~\cite{blla03}
on the accuracy of a test set.
%\par
Our proposed criterion has several natural properties:
\begin{enumerate}[1.]
\vspace*{-2mm}
\item It applies to every clustering algorithm.
\vspace*{-2mm}
\item It is inherently normalized to the interval $[0,1]$.
\vspace*{-2mm}
\item All possible values are exercised on natural datasets.
\vspace*{-2mm}
\item The metric can flexibly incorporate prior information by proper design of
a {\em description language}.
\vspace*{-2mm}
\item It can be used for model selection.
\vspace*{-2mm}
\item It is directly related to a concrete goal (good prediction).
\end{enumerate}
\vspace*{-1mm}
It is important to note that this is not the only possible objective
criterion for evaluation of clustering algorithms. Clustering
algorithms may be used for other purposes, and when so used, other
measures may be more appropriate. Our results here are only directly
applicable when the goal in clustering is related to prediction. \par
The rest of the paper is organized as follows. First, we explain the
PAC-MDL bound~\cite{blla03} in section~\ref{sec:pacmdl}. In
section~\ref{sec:cluster}, we discuss how the PAC-MDL bound can be
applied to a clustering setting for performance evaluation and model
selection. We present experimental results on benchmark text datasets
in section~\ref{sec:results} to demonstrate the proposed approach.
Related work is presented in section~\ref{sec:related}.
We end with a discussion on the proposed criterion in section~\ref{sec:discuss}.
\section{The PAC-MDL Bound}
\label{sec:pacmdl}
The PAC-MDL bound is the core mechanism used to trade off between a
sufficiently rich representation to capture the data and over-fitting
on the data. Clustering algorithms with a small bound must have a
small number of clusters which agree well with a set of
(hidden) labels.
Consider the following learning setting: Let $D$ be any distribution
over $(X,Y)$ where $X$ denotes the input and $Y$ denotes the label. We
assume $Y$ can take one of $\ell > 1$ possible values, i.e., $Y \in
\{1,\cdots,\ell\}$. Consider a train set $S = (X^m,Y^m)$ and a test
set $S' = (X^n,Y^n)$, where $(X^i,Y^i)$ denotes the set of $i$
independently drawn samples from the (unknown) joint distribution $D$
over $(X,Y)$. The PAC-MDL bound applies to a transductive learning
algorithm, $T: (X \times Y)^m \times X^n \mapsto \sigma$, where
$\sigma: X^{m+n} \mapsto \hat{Y}^{m+n}$ is a transductive classifier
which produces a simultaneous labeling $\hat{Y}^{m+n}$ for all of the
data points $X^{m+n}$. Given the description complexity of the
transductive classifier measured by its bit description length
$|\sigma|$, and the prediction error count on the train set,
$\hat{\sigma}_S = |\{Y^m \neq \hat{Y}^m\}|$, the bound limits the
error count on the test set $\hat{\sigma}_{S'} = |\{Y^n \neq
\hat{Y}^n\}|$.
The precise bound is defined in terms of a cumulative hypergeometric
distribution. To understand this distribution, imagine a bucket with
$m$ red balls and $n$ blue balls, from which $(a+b)$ balls are drawn
without replacement. Now, define $\bucket(m,n,a,b)$ to be the
probability that at least $b$ blue balls are drawn. That is,
\begin{eqnarray*}
\bucket(m,n,a,b) & = & \sum_{t=b}^{a+b} \frac{{n \choose t}{m \choose
a+b-t}}{{n+m \choose a+b}}.
\end{eqnarray*}
The bound is actually defined in terms of a ``worst case'' over the
value of $b$ defined according to:
\begin{eqnarray*}
\bmax(m,n,a,\delta) & = & \max \{b: \bucket(m,n,a,b) \geq \delta\}
\end{eqnarray*}
Thus, for any $b > \bmax(m,n,a,\delta)$, if $(a+b)$ balls are drawn
out of the $(m+n)$ balls, the probability of getting at least $b$ blue
balls is less than $\delta$. In the PAC-MDL bound, $a$ plays the role
of $\hat{\sigma}_S$, the number of errors on the train set, and $b$
plays the role of $\hat{\sigma}_{S'}$, the number of errors on the
test set, which is exactly the number we wish to bound.
%The PAC-MDL bound can be stated as follows:
%\vspace*{-3mm}
\begin{theo}[PAC-MDL bound~\cite{blla03}]
For any distribution $D$, for any label description language $L = \{
\sigma \}$, with probability at least $(1-\delta)$ over the draw of
the train and test sets $S,S'\sim D^{m+n}$:
%\[
~~$\forall \sigma, \quad
\hat{\sigma}_{S'} \leq \bmax(m,n,\hat{\sigma}_S,2^{-|\sigma|}\delta)\}$
%\]
\label{thm:PAC-MDL}
\end{theo}
%\vspace*{-5mm}
Intuitively, the theorem says that if a transductive classifier with a
short description length achieves few errors on the train set, then
the number of errors on the test set is small with high probability.
For details on this theorem, its proof, and its connection to other
PAC bounds, see~\cite{blla03}. For now, it is important to note that
the applicability of this theorem is only limited by the assumption
that the train and test sets are each drawn independently from the
distribution $D$, and that $L = \{ \sigma \}$ is a ``valid'' description
language.
\section{Application to Clustering}
\label{sec:cluster}
In this section, we discuss the application of the PAC-MDL bound to
clustering. Two important issues are relevant for clustering bounds:
\begin{enumerate}[A.]
%\vspace*{-2mm}
\item How does a clustering algorithm produce a prediction?
%\vspace*{-2mm}
\item What is a ``valid'' description language for transductive classifiers?
%\vspace*{-2mm}
\end{enumerate}
For A, recall that the PAC-MDL bound is applicable to a transductive
classifier. It turns out that {\em any} clustering can be
converted into a transductive classifier. Given the entire train set
$(X^m,Y^m)$, and $X^n$ from the test set, consider $X^{m+n}$ as the
input to the clustering algorithm. Say the clustering algorithm
partitions $X^{m+n}$ into $k$ subsets. To find the labels
$\hat{Y}^{m+n}$ on all points, first compute the most common label in
each cluster using $Y^m$. Then,
\begin{enumerate}[i.]
\vspace*{-2mm}
\item if the most common is of class $i, ~i \in {1,\cdots,\ell}$, label
{\em all} points in that cluster as class $i$;
\vspace*{-2mm}
\item if there is a tie between two or more class labels,
pick one of them uniformly at random
and label {\em all} points in that cluster with that label;
\vspace*{-2mm}
\item if there are no labeled points
in a cluster, choose a label $i \in \{1, \cdots,\ell \}$ uniformly at
random and label {\em all} points in that cluster with that label.
\vspace*{-2mm}
\end{enumerate}
After the assignment of the new labels, all points in each cluster
have the same label.
For issue B above, note that the description $\sigma$ of a classifier
can be considered a binary code that contains all the relevant
information for generating the labels $\hat{Y}^{m+n}$. Hence,
formally speaking, there is a Turing machine that takes $\sigma$
%and $X^{m+n}$
as an input, produces the labels $\hat{Y}^{m+n}$ as
output and halts.
%Since $X^{m+n}$ is fixed for a problem,
Therefore, the description language $L = \{ \sigma \}$ must be a
prefix-free code (by the halting property) and hence satisfy Kraft's
inequality~(see~\cite{coth91}, Theorem 5.2.1, Lemma 7.3.1, for
details), i.e.,
%\[
$\sum_{\sigma \in L} 2^{-|\sigma|} \leq 1$.
%\]
This is the {\em only} condition a label description language $L = \{
\sigma \}$ must satisfy for the proposed bound to hold.
%All we need to decide in any given setting is an appropriate
%description language $L =\{ \sigma \}$. In the most basic setting, a
%particular clustering algorithm is applied to a particular dataset
%with a fixed number of clusters, and a fixed
%initialization\footnote{Note that most parametric clustering
%algorithm, e.g., EM, Kmeans, etc., need an initialization.}. In more
%general settings (of higher description complexities) the best
%performance over a set of random initializations, a varying number of
%clusters, and even a set of clustering algorithms may be considered.
\subsection{PAC-MDL Bound for Clustering}
Consider the problem of clustering a dataset $X^{m+n}$, where $m$ is
the train-set size and $n$ is the test-set size. For any
fixed\footnote{We mean fixed initialization and fixed cluster number
here.} clustering algorithm, we can construct a description language
$L = \{\sigma\}$ by letting each description $\sigma$ have a constant
label on each cluster, as discussed earlier. Given the clustering,
this description is sufficient to generate the new labels
$\hat{Y}^{m+n}$ over the entire data. Now, if $c$ is the number of
clusters and $\ell$ is the number of labels, then the set of
descriptions, i.e., the language $L = \{ \sigma \}$, has size at most
$\ell^c$ which can be indexed using only $|\sigma| = c \log \ell$ bits
(``fractional bits'' are ok here). Since $|L| \leq \ell^c$, it is
straightforward to see that Kraft's inequality is indeed satisfied,
and hence $L$ is a valid description language. On a more general note,
if one assigns a probability measure $p(\sigma)$ to all $\sigma \in
L$, since
\[
\sum_{\sigma \in L} p(\sigma) = 1 \quad \Rightarrow \quad \sum_{\sigma \in L} 2^{-\log
\left( \frac{1}{p(\sigma)} \right)} =1~,
\]
$|\sigma| = \log \left( \frac{1}{p(\sigma)} \right)$ always gives a
valid bit description complexity for $\sigma$. Here, since $|L| =
\ell^c$, we have essentially assigned a uniform probability of
$p(\sigma) = \frac{1}{\ell^c}, ~\forall \sigma \in L$.
We call the above description language {\sf Simple}. Using a direct
application of the PAC-MDL bound, we get: With
probability at least $(1-\delta)$ over the draw of the train and test
sets $S,S' \sim D^{m+n}$, the test-set error is bounded by:
\begin{equation}
\hat{\sigma}_{S'} \leq \bmax \left( m,n,\hat{\sigma}_S,\frac{\delta}{l^c} \right)
\label{eqn:bound}
\end{equation}
\par
In practice, clustering algorithms are highly dependent upon random
initializations, so the algorithm is run multiple times with the
best\footnote{``Best'' might be defined with respect to some
algorithm-specific metric or with respect to bound performance,
depending on what you want to evaluate.} performing run chosen. Note
that the description length of this scheme is higher since the
description language must specify which random initialization to use.
If the best initialization is chosen from $r$ random initializations,
the description length is $|\sigma| = c \log \ell + \log r$. We call
this language {\sf Init}. Applying the PAC-MDL bound for {\sf Init},
we get with probability at least $(1-\delta)$ over the draw of the
train and test sets:
\begin{equation}
\hat{\sigma}_{S'}\leq \bmax \left( m,n,\hat{\sigma}_S,\frac{\delta}{rl^c} \right)
\label{eqn:randBound}
\end{equation}
\subsection{The Right Number of Clusters}
Most clustering algorithms need the number of clusters as an input to
the algorithm. The PAC-MDL bound provides a natural mechanism for
cluster number selection when some label information is available.
Consider running a clustering algorithm on a dataset over a range of
cluster numbers and picking the cluster number with the tightest bound
on the test-set error. This process increases the size of our
set of descriptions again. We use a description language for the
cluster number $c$ requiring $\log c(c+1)$ bits. This is ``legal''
because it does not violate the Kraft inequality since:
%\[
$\sum_{c=1}^\infty \frac{1}{c(c+1)} = 1 $.
%\]
One slight optimization is possible here: the nature of our metric
disallows the $c=1$ case, implying that we can substitute
$c\rightarrow c-1$. With this description language, the length of our
description is $|\sigma| = c\log \ell + \log r + \log( c(c-1) )$ bits.
We call the language {\sf Cluster}. Applying the PAC-MDL bound for
{\sf Cluster}, we get that with probability at least $(1-\delta)$ over
the draw of the train and test sets, the optimal cluster number $c^*$
achieves a test-set error of:
\begin{equation}
\hat{\sigma}_{S'} \leq \bmax \left( m,n,\hat{\sigma}_S,\frac{\delta}{rl^{c^*}c^*(c^*-1)} \right)
\label{eqn:clusRandBound}
\end{equation}
\subsection{The Right Algorithm}
In practice, for a given dataset, it is not clear which clustering
algorithm is appropriate for use. Typically an algorithm is chosen
using domain knowledge, and there is normally no objective way to
verify whether the choice made was good or bad. To cope with this, we
can extend our description language to specify one of multiple
algorithms. More precisely, if we are choosing the best among $s$
clustering algorithms, an extra $\log s$ bits are required to send the
index of the clustering algorithm that performs the best. Hence,
$|\sigma| = c^*\log \ell + \log r + \log( (c^*-1)c^* ) + \log s$.
Being the best over all algorithms, we call this language {\sf Algo}.
Applying the PAC-MDL bound, we see that with probability at least
$(1-\delta)$ over the draw of the train and test sets, the best
algorithm with the optimal cluster number and optimal initialization
achieves a test-set error of:
\begin{equation}
\hat{\sigma}_{S'} \leq b_{\max}\left(m,n,\hat{\sigma}_{S},
\frac{\delta}{r \ell^{c^*} c^*(c^*-1)s} \right)~.
\label{eqn:bestBound}
\end{equation}
\section{Empirical results}
\label{sec:results}
We present an empirical study of the proposed evaluation technique on
the problem of text clustering for several benchmark datasets using
various algorithms.
\subsection{Datasets}
\label{subsec:datasets}
The datasets that we used for empirical validation and comparison of
our algorithms were carefully selected to represent some typical
clustering problems: (a) {\sf classic3} is a well known collection of documents
that contains 3893
documents, among which 1400 {\sc Cranfield} documents are from
aeronautical system papers, 1033 {\sc Medline} documents are from
medical journals, and 1460 {\sc Cisi} documents are from information
retrieval papers; (b) {\sf classic2} is a subset of 2860 documents from the {\sf classic3}
collection formed with the 1400 {\sc Cranfield} documents and the 1460
{\sc Cisi} documents; (c) {\sf cmu-newsgroup-clean-1000}, or, the CMU
20 newsgroups dataset is a widely used text analysis dataset that is a
collection of approximately 20,000 messages from 20 different USENET
newsgroups, with approximately 1000 messages per group;
(d) {\sf cmu-newsgroup-clean-100} was formed by sampling 100 messages per
group from the full 20 newsgroup dataset; (e) {\sf cmu-different-1000} is a subset of the original
20 newsgroups dataset consisting of 3 groups on very different topics:
{\em alt.atheism}, {\em rec.sport.baseball},
{\em sci.space};
(f) {\sf cmu-different-100} is a subset of (e) formed by sampling 100
documents per topic;
(g) {\sf cmu-similar-1000} is a subset of the original 20 newsgroups
dataset consisting of 3 groups on similar topics: {\em
talk.politics.guns}, {\em talk.politics.mideast}, {\em
talk.politics.misc};
(h) {\sf cmu-similar-100} is a subset of (g)
formed by taking 100 documents per topic;
(i) {\sf cmu-same-1000} is a subset of the original 20 newsgroups
dataset consisting of 3 groups on the same topic viz computers, with
different subtopics : {\em comp.graphics},
{\em comp.os.ms-windows}, {\em
comp.windows.x};
(j) {\sf cmu-same-100} is a subset of (i) formed by sampling 100
documents per topic;
and, (k) {\sf yahoo}, or, the Yahoo News (K-series) dataset that has
2340 Yahoo news articles from 20 different categories.
\subsection{Algorithms}
We experiment with 6 algorithms that have been applied to text
datasets with varying degrees of success. Since the motivation behind
the experiments is to establish the efficacy of the proposed criterion
in evaluation, comparison and model selection for clustering, we have
not tried to be exhaustive in the list of algorithms that have been
used. However, we have chosen algorithms that represent the state-of-the-art
and have been applied to text clustering in the literature.
The algorithms we consider are:
%\begin{enumerate}[1.]
%\item
{\tt SPKMeans}~\cite{dhmo01}, better known as spherical kmeans, that
% the algorithm
employs the widely used cosine similarity;
% in the kmeans problem. With
%$L_2$ normalization of both the data points and the cluster centers, a
%greedy update algorithm was derived in~\cite{dhmo01}. In the
%literature, this algorithm has been shown to perform well on benchmark
%text datasets.
%\item
{\tt FSKMeans}~\cite{bagh04}, a frequency sensitive version of spherical
kmeans;
% that attempts to maintain clusters of approximately equal size
%employing a frequency sensitive competitive learning mechanism.
%\item
{\tt Hard-moVMF}~\cite{bdgs03}, a generative model based clustering that
uses a mixture of von Mises-Fisher (vMF) distributions to model the data;
%The algorithm employs a hard-assignment version of the EM scheme.
%\item
{\tt Soft-moVMF}~\cite{bdgs03}, that also uses a mixture
of von Mises-Fisher distributions to model the data
with soft-assignments that are finally converted
to hard assignments
% the difference being that this performs full-fledged EM iterations using soft
%assignments. Conversion to hard assignments is achieved
by the standard method assigning a data point to the highest probability
cluster;
%\footnote{Random assignments following posterior probabilities
%are reasonable~\cite{kemn97}, but are not explored in this paper.};
%\item
{\tt KMeans}~\cite{jadu88}, the standard kmeans clustering
algorithm;
% Though it has not been particularly useful for clustering
%high-dimensional text data, it is definitely the most widely
%used clustering algorithm.
%\item
and {\tt KLKMeans}~\cite{dhmk03}, better known as information theoretic
clustering, that uses KL-divergence between $L_1$ normalized
document vectors instead of squared Euclidean distance in the kmeans
framework.
%It has been applied for word clustering for text
%classification in the literature. However, we use it to perform text
%clustering directly.
%\end{enumerate}
%%\vspace*{-5mm}
%Note that generative model based algorithms that use mixtures of
%Bernoulli or multinomial distributions, which have been shown to
%perform well for text datasets, have not been included in the
%experiments since they have been empirically shown to perform very
%similarly to vMF models by extensive experimentation~\cite{zhgh03}.
\subsection{Methodology}
Before performing any experiments, an independent train-test split
needs to be made. All experiments reported in this paper were
performed on 5 different train-test splits: 10-90, 30-70, 50-50,
70-30, 90-10. On each train-test split, we performed 4 sets of experiments
for each dataset corresponding to the 4 bounds discussed in section 3:
\begin{enumerate}[(a)]
\vspace*{-2mm}
\item Experiments for a particular algorithm, with a fixed cluster
number, with a particular initialization. The test-set error-rate
bound is computed using~\myref{eqn:bound}.\footnote{
{\small Code for computing
the PAC-MDL bound is available at
http://hunch.net/$\sim$jl/projects/prediction\_bounds/info\_theory
\_n\_learning\_theory/pac-mdl\_bound.tar.gz}}
% With $s$ algorithms,
%$c_{\max}$ clusters, and $r$ different initializations, this
%corresponds to a total of $sc_{\max}r$ experiments.
\vspace*{-2mm}
\item The best results for a particular algorithm with a fixed
cluster number, where the best is computed over all
the $r$ possible initializations. The test-set error-rate bound is
computed using~\myref{eqn:randBound}.
\vspace*{-2mm}
\item The best results for a particular algorithm,
where the best is computed over all the $r$ possible initializations
and all the $c_{\max}$ possible cluster numbers. The test-set
error-rate bound is computed using~\myref{eqn:clusRandBound}.
\vspace*{-2mm}
\item The best performance for a given dataset, where the best
is computed over all algorithms over all possible cluster sizes
and all possible initializations. The bound on the test-set error-rate
is computed using~\myref{eqn:bestBound}.
\end{enumerate}
\begin{figure}[t]
\vspace*{-2mm}
\centering
\epsfig{file=figs/special.Init.cmu-different-1000.cmu-same-1000.SPKMeans.KMeans.0.5.eps, width=0.95\linewidth}
\caption{Test set error rate bounds for {\tt KMeans} and {\tt SPKMeans} on {\sf cmu-different-1000} and {\sf cmu-same-1000}:
10 runs with different initializations for 3 clusters}
\label{diffInit}
\vspace*{-4mm}
\end{figure}
\subsection{Results}
Now we are ready to present results on the various datasets. We start
with the simplest language and using~\myref{eqn:bound} present
representative results comparing individual runs of a particular
algorithm for a fixed cluster number with different random
initializations (Fig~\ref{diffInit}). Taking the best over 10
initializations for each cluster number, we then compare performance
of a particular algorithm over a range of cluster numbers
(Fig~\ref{diffClus}) using~\myref{eqn:randBound}. Next, by taking the
best over the entire cluster number range considered, we compare the
performance of various algorithms (Fig~\ref{diffAlgo})
using~\myref{eqn:clusRandBound}. Finally, by taking the best over the
best of all the algorithms considered, using~\myref{eqn:bestBound}, we
present the best results on particular datasets for various train-test
splits (Fig~\ref{classic}-\ref{20news}). Unless otherwise stated,
all results are on a 50-50 train-test split. \par
%\begin{figure}[t]
%\centering
%\epsfig{file=figs/special.Init.cmu-same-1000.SPKMeans.KMeans.0.5.eps, width=0.98\linewidth}
%\caption{Test set error rate bounds for {\tt KMeans} and {\tt SPKMeans} on {\sf cmu-same-1000}:
%10 runs with different initializations for 3 clusters.}
%\label{sameInit}
%\end{figure}
In Fig~\ref{diffInit}, we present test-set error-rate bounds for {\tt
KMeans} and {\tt SPKMeans} on {\sf cmu-different-1000} and {\sf
cmu-same-1000} for 10 runs with different initializations for 3
clusters. All bounds are computed using~\myref{eqn:bound}. {\sf
cmu-different-1000} is a relatively easy dataset in that its labels
are reasonably separated being samples from 3 quite different
newsgroups. As a result, both algorithms achieve low bounds on the
error-rate. {\tt SPKMeans} performs particularly well since it was
designed to be a text clustering algorithm~\cite{dhmo01}. Over the 10
runs, {\tt KMeans} achieves a lowest bound of 34.13 \% with
probability 0.9 in run 4, and {\tt SPKMeans} achieves a lowest bound
of 4.93 \% with probability 0.9 in run 6. The best constant classifier has
error-rate 66.67 \%. {\sf cmu-same-1000} is a relatively difficult
dataset since the true labels have significant overlaps. Again, {\tt
SPKMeans} achieves lower bounds than {\tt KMeans} in most runs,
although the bounds are in general higher than those for {\sf
cmu-different-1000}. Over the 10 runs, {\tt KMeans} achieves a lowest
bound of 50.8 \% with probability 0.9 in run 7, and {\tt SPKMeans}
achieves a lowest bound of 31 \% with probability 0.9 in run 3, the
best constant classifier has error-rate 66.67~\%.
Next, we consider the best performance over all the 10 iterations for
each cluster number and compare them over a range of cluster numbers.
The bound calculation is done using~\myref{eqn:randBound}. In
Fig~\ref{diffClus}, we present test-set error-rate bounds for {\tt
KMeans} and {\tt SPKMeans} on {\sf cmu-different-1000} and {\sf
cmu-same-1000} over a range of cluster numbers (2 to 20). We observe
that {\tt SPKMeans} achieves a lower bound than {\tt KMeans} for most
cluster numbers for reasons described above. Over the entire range,
{\tt SPKMeans} achieves a lowest bound of 5.46 \% with a probability
of 0.9 for 3 clusters. This is marginally higher than the lowest
bound of 4.93 \% in Fig~\ref{diffInit}, since the bound calculation
uses~\myref{eqn:randBound} after incorporating the extra $\log 10$
bits required to index the best over the 10 runs with different random
initializations. This is the extra cost for not knowing upfront which
random initialization is going to perform the best. This demonstrates
the trade-off between improvement in prediction accuracy and
considering more runs with different random initializations. On the
other hand, {\tt KMeans} achieves a lowest bound of 17.2 \% with a
probability of 0.9 for 13 clusters.
%In Fig~\ref{sameClus}, we present similar comparisons
For {\sf cmu-same-1000}, over the entire range, {\tt SPKMeans} achieves a
lowest bound of 32.2 \% with probability 0.9 for 3 clusters, which is
marginally higher than the lowest of 31 \% in Fig~\ref{diffInit} due
to extra description complexity of $\log 10$ bits in indexing the
best. {\tt KMeans} achieves a lowest bound of 40.3 \% with
probability 0.9 for 10 clusters.
\begin{figure}[t]
\vspace*{-2mm}
\centering
\epsfig{file=figs/special.cmu-different-1000.cmu-same-1000.SPKMeans.KMeans.0.5.eps, width=0.95\linewidth}
\caption{Test-set error-rate bounds for {\tt KMeans} and {\tt SPKMeans} on {\sf cmu-different-1000} and {\sf cmu-same-1000}:
Best over 10 runs with different initializations over a cluster number range of (2 to 20).}
\label{diffClus}
\vspace*{-6mm}
\end{figure}
%\begin{figure}[bh]
%\vspace*{-2mm}
%\centering
%\epsfig{file=figs/special.cmu-same-1000.SPKMeans.KMeans.0.5.eps, width=0.98\linewidth}
%\caption{Test-set error-rate bounds for {\tt KMeans} and {\tt SPKMeans} on {\sf cmu-same-1000}:
%Best over 10 runs with different initializations over a range of cluster numbers (2 to 20).}
%\label{sameClus}
%\end{figure}
\par
From the results in Fig~\ref{diffClus}, we make an
interesting observation.
%Clearly, {\tt SPKMeans} has a better
%performance than {\tt KMeans} on the both the datasets considered.
Note that for {\tt SPKMeans}, the {\em optimal number of clusters}, as
dictated by the lowest bound over the entire range of cluster numbers
considered, is 3 for both the datasets. Interestingly, the number of
true labels in both the datasets is 3. This demonstrates how the
proposed criterion can be used for {\em model-selection}, the
``right'' number of clusters in this particular case, for a given
algorithm and a dataset.
%\begin{figure}[ht]
%\vspace*{-2mm}
%\centering
%\epsfig{file=figs/special.classic2.classic3.trainFrac.eps, width=0.98\linewidth}
%\caption{Test-set error-rate bounds on {\sf classic2} and {\sf classic3}:
%Best over all cluster numbers and initializations.}
%\label{classicAlgo}
%\end{figure}
\par
\begin{figure}[htbp]
\vspace*{-4mm}
\centering
\rotatebox{270}{\includegraphics[scale=0.5]{figs/alg_perf.ps}}
\vspace*{-3mm}
\caption{Test-set error-rate bounds for each algorithm on all the 11 datasets:
{\sf classic2}, {\sf calssic3}, {\sf cmu-different-1000}, {\sf
cmu-similar-1000}, {\sf cmu-same-1000}, {\sf cmu-different-100}, {\sf
cmu-similar-100}, {\sf cmu-same-100}, {\sf cmu-newsgroup-clean-1000},
{\sf cmu-newgroup-clean-100}, {\sf yahoo}: Best over all number of
clusters and initializations.}
\label{diffAlgo}
\vspace*{-2mm}
\end{figure}
Next, we compare the best performance of each of the algorithms, with
best taken over all cluster numbers and initializations, on various
datasets. This comparison is of great practical interest since this
determines the appropriateness of an algorithm for a given
dataset.
Since the best performance of each algorithm over cluster
numbers and initializations is considered,
\myref{eqn:clusRandBound} is used to compute the bounds in
Fig~\ref{diffAlgo}. In Fig~\ref{diffAlgo}, we compare the test-set
error-rate bounds for all the 6 algorithms on
%{\sf cmu-different-100}, {\sf cmu-different-1000}, {\sf cmu-similar-100},
%and {\sf cmu-similar-1000} are presented in Fig~\ref{diffAlgo}.
all the 11 datasets under consideration. For datasets such as {\sf
cmu-different-100}, {\sf cmu-similar-100}, {\sf cmu-same-100}, the low
number of samples in high-dimensions make the clustering problem hard
for most algorithms. Among the algorithms considered, {\tt Soft-moVMF}
performs quite well, e.g., it achieves the lowest test-set error-rate
bound of 18.6 \% with a probability of 0.9 on {\sf
cmu-different-100}. On the other hand, datasets such as {\sf
cmu-different-1000} has more samples from the same distribution that
makes the clustering problem reasonably simple for quite a few
algorithms.
We note that 4 algorithms have comparative performances, with {\tt
Soft-moVMF} achieving the lowest bound of 5.67 \% with a probability
of 0.9. It is interesting to note that {\tt SPKMeans} achieves a bound
of 5.87~\% which is marginally higher than the bound of 5.46 \% we
observed in Fig~\ref{diffClus}. The marginal increase is due to the
extra description length of $\log ((3-1)3)$ bits used to describe the
fact that the optimal cluster number is 3.
%Again, this demonstrates
%the tradeoff between considering more options, different cluster
%numbers in this case, and improvement in prediction accuracy.
As for the relative performance of the algorithms, as expected, there
is no clear winner across all datasets although {\tt Soft-moVMF}
appears to win quite often.
%(e.g., {\sf cmu-same-100} (Fig~\ref{sameAlgo}),
%{\sf cmu-similar-1000} (Fig~\ref{simAlgo})), and never lose by much
%(e.g., {\sf cmu-same-1000} (Fig~\ref{sameAlgo}), {\sf yahoo} (see
%Fig~\ref{yahoo})).
%\begin{figure}[ht]
%\vspace*{-2mm}
%\centering
%\epsfig{file=figs/special.cmu-same-100.cmu-same-1000.trainFrac.eps, width=0.98\linewidth}
%\caption{Test-set error-rate bounds on {\sf cmu-same-100} and {\sf cmu-same-1000}:
%Best over all cluster numbers and initializations.}
%\label{sameAlgo}
%\end{figure}
\par
We now present best bounds on the test-set error-rate by taking
the best over all algorithms, cluster numbers and initializations. We
use~\myref{eqn:bestBound} to compute the bound. Results over 5
different train-test splits on all the datasets considered are
presented in Figs~\ref{classic}-\ref{20news}.
\par
{\sf classic2} is a relatively simple dataset with only 2
reasonably separate classes. As we see in Fig~\ref{classic}, for all
train-test splits, the bound on the test-set error-rate is very
low. For the 50-50 train-test split, with probability 0.9 we get a PAC
bound of 1.68\% on the error-rate on the test-set. (The best constant
classifier has error-rate of 48.95\%.) This is a remarkably low
error-rate bound by PAC standards\footnote{Note that increasing the
train-set fraction need not increase prediction accuracy since the
clustering algorithms never actually look at the labels.}.
%We shall return to this point later in this section and again
%in section~\ref{sec:discuss}.
{\sf classic3} is also a relatively simple dataset with 3 classes. As
shown in Fig~\ref{classic}, for the 50-50 train-test split, with
probability 0.9 we get a bound of 2\% on the error-rate on the
test-set. (The best constant classifier has error-rate 62.50\%.)
\begin{figure}[h]
\vspace*{-2mm}
\centering
\epsfig{file=figs/classic2.classic3.cmu-different-100.cmu-different-1000.eps, width=0.95\linewidth}
%\vspace*{-10mm}
\caption{Test set error rate bounds for {\sf classic2}, {\sf classic3},
{\sf cmu-different-100}, {\sf cmu-different-1000}: Best over all algorithms,
cluster numbers, initializations}
\label{classic}
\vspace*{-4mm}
\end{figure}
\par
%\begin{figure}[t]
%\centering
%\epsfig{file=figs/cmu-different-100.cmu-different-1000.eps, width=0.98\linewidth}
%%\vspace*{-10mm}
%\caption{Test set error rate bounds for {\sf cmu-different-100} and {\sf cmu-different-1000}: Best over all algorithms,
%cluster numbers, initializations}
%\label{different}
%\end{figure}
%\par
Although {\sf cmu-different-100} consists of samples from 3
relatively different classes, the small number of samples and high
dimensionality make the problem difficult. As shown in
Fig~\ref{classic}, with a probability of 0.9, we get a bound of
20.6\% on the test-set error-rate for the 50-50 train-test split.
(The best constant classifier has error-rate 66.67\%.) In {\sf
cmu-different-1000}, the extra samples make finding structure in the
data easier
%as shown in Fig~\ref{classic}
--- a bound of 6 \% is obtained.
\par
%{\sf cmu-similar-100} is a relatively difficult dataset to learn to
%predict by clustering due to overlapping content and the small number
%of samples. A bound of 56.67\% is achieved. (The best constant
%classifier has error-rate 66.67\%.) The extra samples in {\sf
%cmu-similar-1000}, provides a bound of 25.2\% (see Fig~\ref{similar}).
%\begin{figure}[ht]
%\centering
%\epsfig{file=figs/cmu-similar-100.cmu-similar-1000.eps, width=0.98\linewidth}
%\caption{Test set error rate bounds for {\sf cmu-similar-100} and {\sf cmu-similar-1000}: Best over
%all algorithms, cluster numbers, initializations}
%\label{similar}
%\end{figure}
\par
Due to a large overlap between the underlying labels, both {\sf
cmu-same-100} and {\sf cmu-same-1000} are difficult datasets to get
good predictions on by just clustering. As shown in Fig~\ref{20news},
with a probability of 0.9, we achieve error-rate bounds of 64\% and
28.4\% respectively, while the best constant classifier has error-rate
66.67\%.
%\begin{figure}[th]
%\centering
%\epsfig{file=figs/cmu-same-100.cmu-same-1000.eps, width=0.98\linewidth}
%\caption{Test set error rate bounds for {\sf cmu-same-100} and {\sf cmu-same-1000}: Best over all algorithms,
%cluster numbers, initializations}
%\label{same}
%\end{figure}
\par
{\sf cmu-newsgroup-clean-100} is a difficult dataset since there are
20 underlying classes with significant overlaps and small number of
samples per class. As Fig~\ref{20news} shows, the lowest bound on the
test-set error-rate is 84 \% with probability 0.9 on a 50-50 train
test split, whereas the best constant classifier has an error-rate of
95 \%. For {\sf cmu-newsgroup-clean-1000}, with increased number of
samples from the same problem, a lowest bound of 47.94 \% is achieved
with probability 0.9.
\begin{figure}[hbt]
\vspace*{-2mm}
\centering
\epsfig{file=figs/cmu-same-100.cmu-same-1000.cmu-newsgroup-clean-100.cmu-newsgroup-clean-1000.eps, width=0.95\linewidth}
\caption{Test set error rate bounds for {\sf cmu-same-100}, {\sf
cmu-same-1000}, {\sf cmu-newsgroup-clean-100} and
{\sf cmu-newsgroup-clean-1000}: Best over all algorithms,
cluster numbers, initializations}
\label{20news}
\vspace*{-3mm}
\end{figure}
\par
{\sf yahoo} is a dataset with 20 underlying classes and 2340
examples. The class portions are highly skewed ranging from as low as
0.0038 to as high as 0.2111. Naturally, unsupervised prediction is
nontrivial.
%As shown in Fig~\ref{yahoo},
The best performance (not shown) achieves a bound of 73.84\% with
probability 0.9 on the 50-50 train-test split. (The best constant
classifier has error-rate 78.89\%).
%\begin{figure}[t]
%\centering
%\epsfig{file=figs/yahoo.best12.eps, width=0.98\linewidth}
%\caption{Test-set error-rate bounds for {\sf yahoo}: Over all best for each train-test
%split, and, best for each algorithm over cluster numbers, initializations}
%\label{yahoo}
%\end{figure}
\par
At this point, we make two observations:
%\begin{enumerate}[1)]
%\item
(i) on all the datasets considered, the test-set error-rate bound
is {\em always} better than the best constant classifier; and, more interestingly,
%\item
(ii) for most datasets, the bound on the test-set error-rate for the
best clustering algorithm is {\em comparable}, perhaps even {\em
better} than many supervised learning bounds.
%\end{enumerate}
Clustering appears to capture the structure of labeling in natural
datasets.
\par
Finally, we present a comparison between the bounds obtained from
the 4 language families considered: {\sf Simple}, {\sf Init},
{\sf Cluster} and {\sf Algo}, corresponding to
Eqns~\myref{eqn:bound}-\myref{eqn:bestBound}. It is difficult to
directly compare languages because each optimizes over a different set
of possibilities. For example, it would be unfair to compare the best
bound (across all possibilities) for {\sf Init} to the best bound
(across all possibilities) for {\sf Cluster}, since {\sf Cluster} takes into
account the optimization over all number of clusters while {\sf Init} does
not. To make the comparison fair, we compare the average bound
of {\sf Init} to that for {\sf Cluster}.
%\footnote{A reasonable alternative could
%have been comparing the expected bound from language 3 to the best language 4 bound.}.
%the ``average best'' language 3 bound to the best language 4 bound. By ``average best'',
%we mean taking the expectation over all possibilities not considered
%by language 3 of the best bound within language 3.
Similar arguments apply to other languages. Fig~\ref{lang} displays
comparisons of the 4 languages on all the datasets under consideration.
%, where a language's number is the language used in the corresponding equation number. Hence, lower numbers indicate simpler languages.
As shown in Fig~\ref{lang}, there seems to be some advantage to using
a more complicated language, i.e., trying to optimize over several
iterations, cluster numbers and algorithms. In practice, using a more
complicated language of course implies more computational effort.
\begin{figure}[t]
\vspace*{-2mm}
\centering
\rotatebox{270}{\includegraphics[scale=0.38]{figs/lang_perf.ps}}
%\epsfig{file=figs/special.language.eps, width=0.95\linewidth}
\caption{Test-set error-rate bounds on 11 datasets for
4 languages: {\sf Simple}, {\sf Init}, {\sf Cluster}, and {\sf Algo}.}
\vspace*{-4mm}
\label{lang}
\end{figure}
\vspace*{-4mm}
\section{Related Work }
\label{sec:related}
%Almost all unsupervised clustering algorithms have an internal quality
%measure.
% i.e., one that does not depend on supervision such as labels,
%feedback in terms of rewards etc. Examples of internal measures
%include
%such as average squared loss from a cluster
%representative~\cite{jadu88}, some variant of a normalized edge-cut
%value of a pairwise similarity graph~\cite{kahk99},
%category utility function~\cite{fish87} etc.
A clustering algorithm typically tries to optimize its
internal quality measure, thereby making objective comparison of
various algorithms practically impossible. Note that several
unsupervised methods for comparing clusterings, e.g., Jaccard index,
Rand index, Fowlkes-Mallows index, Mirkin metric, variation of
information etc., exist in the literature (for details, see
\cite{jadu88,meil03} and references therein). These completely
unsupervised methods are incapable of measuring performance in
supervised tasks, such as prediction.
Since
the predictive ability of clustering algorithms is often key to
their successful application,
external prediction-related quality
measures are often appropriate. Several supervised measures such as
purity, entropy, normalized mutual information, supervised F-measure
etc. have been used in the literature (see~\cite{ghos03} for
details).
%Although the main goal of these measures is to give an idea
%of the prediction ability of a clustering algorithm,
However, it is not clear how these measures are related to the
error-rate of the clustering algorithm when used for prediction.
Furthermore, none of these supervised measures help with cluster
number selection, which is often a big issue for these supervised
measures.
An information theoretic external validity measure motivated by the
{\em minimum description length} (MDL) principle has been recently
proposed~\cite{dom01}.
%This measure correctly captures the predictive
%ability of a clustering algorithm on a train-set from the lossless
%compression and MDL viewpoint. Starting from a given clustering, the
%measure computes the number of extra bits required to exactly get the
%class label information.
In spite of having several desirable
properties, this measure has a few drawbacks with respect to the
measure proposed here:
%\begin{enumerate}[1.]
%\item
(i) the measure is not normalized to the interval [0,1] (and not
easily normalized to exercise that interval)
% A metric with a bounded range
which is desirable in several settings, and many other quality
measures actually satisfy this; and,
%\item
(ii) the measure does not provide guarantees of prediction ability
for test data.
%\end{enumerate}
On a more general note, by attempting to measure the bits required to
precisely specify all the class labels in the dataset, the
method~\cite{dom01} overlooks the more general possibility of lossy
compression, which appears useful and is heavily utilized in our
proposed criterion.
\section{Discussion}
\label{sec:discuss}
The PAC-MDL bound provides an objective criterion for evaluation,
comparison and model selection for clustering that is applicable when
the goal of clustering is related to prediction. Experimental results
show that this criterion is practically and flexibly useful.
It is particularly striking (and perhaps even shocking) to notice
test-set error-rate bounds achieved by the best clustering algorithms
are very competitive with various supervised learning bounds on the
true error rate of learned classifiers. This good performance
suggests that clustering algorithms are doing something fundamentally
``right'' for prediction purposes on natural datasets.
\small
\begin{thebibliography}{10}
\vspace*{2mm}
\bibitem{bdgs03}
A.~Banerjee, I.~Dhillon, J.~Ghosh, and S.~Sra.
\newblock Generative model-based clustering of directional data.
\newblock In {\em KDD}, pages 19--28, 2003.
\bibitem{bagh04}
A.~Banerjee and J.~Ghosh.
\newblock Frequency sensitive competitive learning for balanced clustering on
high-dimensional hyperspheres.
\newblock In {\em IEEE Transactions on Neural Networks}, May 2004.
\bibitem{blla03}
A.~Blum and J.~Langford.
\newblock {PAC-MDL} bounds.
\newblock In {\em COLT}, 2003.
\bibitem{coth91}
T.~M. Cover and J.~A. Thomas.
\newblock {\em Elements of Information Theory}.
\newblock Wiley-Interscience, 1991.
\bibitem{dhmk03}
I.~Dhillon, S.~Mallela, and R.~Kumar.
\newblock A divisive information-theoretic feature clustering algorithm for
text classification.
\newblock {\em Journal of Machine Learning Research}, 3(4):1265--1287, 2003.
\bibitem{dhmo01}
I.~S. Dhillon and D.~S. Modha.
\newblock Concept decompositions for large sparse text data using clustering.
\newblock {\em Machine Learning}, 42(1):143--175, 2001.
\bibitem{dom01}
B.~E. Dom.
\newblock An information-theoretic external cluster-validity measure.
\newblock Technical Report RJ 10219, IBM Research Report, 2001.
\bibitem{ghos03}
J.~Ghosh.
\newblock Scalable clustering.
\newblock In Nong Ye, editor, {\em The Handbook of Data Mining}, pages
247--277. Lawrence Erlbaum Assoc., 2003.
\bibitem{jadu88}
A.~K. Jain and R.~C. Dubes.
\newblock {\em Algorithms for Clustering Data}.
\newblock Prentice Hall, New Jersey, 1988.
\bibitem{meil03}
M.~Meil\u{a}.
\newblock Comparing clusterings by the variation of information.
\newblock In {\em COLT}, 2003.
\end{thebibliography}
\end{document}