\documentclass{llncs}
\usepackage{epic,eepic}
\usepackage{latexsym}
\usepackage{amssymb,amsmath}
\usepackage{color}
\usepackage{epsfig}
\usepackage{float}
\usepackage{url}
\input{bibs/bibmacs.tex}
\renewcommand{\note}[1]{\newline{\bf{#1}}\newline}
\newcommand{\mnote}[1]{\marginpar{#1}}
\newcommand{\bnote}[1]{{\bf{#1}}}
\floatstyle{ruled}
\newfloat{algorithm}{tbp}{loa}
\floatname{algorithm}{Algorithm}
\def\clip{{\mathrm clip}}
\def\bin{\{0,1\}}
\def\int{[0,1]}
\def\yh{\hat{y}}
\def\S{\mathcal{S}}
\def\R{\mathbb{R}}
\def\E{\mathbb{E}}
\def\St{\widetilde{\S}}
\def\binning{\tt{Binning}}
\def\optpred{\tt{OptPredict}_k}
\def\s{\mathbf{s}}
\def\r{\mathbf{r}}
\def\q{\mathbf{q}}
\def\b{\mathbf{b}}
\def\e{\mathbf{e}}
\def\z{\mathbf{z}}
\def\l{\ell}
\def\h{\mathit{h}}
\def\lt{\widetilde{\ell}}
\def\yb{{\bar{y}}}
\def\zero{\mathbf{0}}
\def\half{\frac{1}{2}}
\def\Lt{\mathcal{L}}
\def\sry{\s^{\r,y}}
\def\sone{\s^{\r,1}}
\def\szero{\s^{\r,0}}
\def\zone{\z^{\q,1}}
\def\zzero{\z^{\q,0}}
\def\r{\mathbf{r}}
\def\dmax{\mathsf{dm}}
\title{Continuous Experts and the Binning Algorithm}
\author{Jacob Abernethy\inst{1}
\and John Langford\inst{1}
\and Manfred K. Warmuth\inst{2}\inst{3}}
\institute{Toyota Technological Institute,
Chicago
\and University of California at Santa Cruz
\and Supported by NSF grant CCR CCR 9821087
\\
\email{\{jabernethy,jl\}@tti-c.org, manfred@cse.ucsc.edu}
}
\begin{document}
\maketitle
\begin{abstract}
We consider the design of online master algorithms for combining the
predictions from a set of experts where the absolute loss of the
master is to be close to the absolute loss of the best expert. For
the case when the master must produce binary predictions, the
Binomial Weighting algorithm is known to be optimal when the number
of experts is large. It has remained an open problem how to design
master algorithms based on binomial weights when the predictions of
the master are allowed to be real valued. In this paper we provide
such an algorithm and call it the Binning algorithm because it maintains
experts in an array of bins. We show that this algorithm is
optimal in a relaxed setting in which we consider experts as
continuous quantities. The algorithm is efficient and near-optimal
in the standard experts setting.
\end{abstract}
\section{Introduction}
A large number of on-line learning algorithms have been developed for
the so-called {\em expert setting} \cite{\wm,\vovk,\experts,\bw,\hkw}:
learning proceeds in trials; at each trial the master algorithm
combines the predictions from the experts to form its own prediction;
finally a label is received and both the experts and master incur a
loss that quantifies the discrepancy between the predictions and the
label. The goal of the master is to predict as well as the best
expert. In this paper we focus on the absolute loss when the
predictions of the experts and the labels are binary in $\bin$, but
the prediction of the master can be continuous in the range
$\int$.\footnote{The loss $|\yh-y|$ (for prediction $\yh \in \int$ and
label $y\in \{0,1\}$) equals $\E(|Z-y|)$ for a binary valued
prediction random variable $Z$ for which $\E(Z)=\yh$.}
Perhaps the simplest expert algorithm is the Halving algorithm: the
master predicts with the majority of experts and, whenever the
majority is wrong, the incorrect experts are eliminated. If there is
at least one expert that never errs then this algorithm makes at most
$\log_2 n$ mistakes, where $n$ is the number of experts.
Master algorithms often maintain a weight for each
expert that represent
the ``belief'' that the expert is best. In the Halving
algorithm the weights of all consistent experts are uniform and the
weights of inconsistent experts immediately drop to zero. When there
is no consistent expert for the sequence of trials, then a more
gradual decay of the weights is needed. Most expert algorithms (such
as the Weighted Majority (WM) algorithm \cite{\wm} and Vovk's
Aggregate algorithms \cite{\vovk}) use exponentially decaying weights,
i.e. the weights are proportional to $\exp(-\eta L_i)$, where $L_i$
is the total loss of expert $i$ and $\eta$ a non-negative learning
rate. If we assume that there is an expert that makes at most $k$
mistakes, then large $k$ (high noise) requires small $\eta$ and small
$k$ (low noise) high $\eta$. In the Halving algorithm $k=0$ and
$\eta=\infty$.
A superior algorithm, Binomial Weighting (BW), uses binomial weights
for the experts \cite{\bw}. These weights are motivated by a version
space argument: if an expert has $k'\leq k$ mistakes left, then it is
expanded into $q^* + 1 \choose \leq k'$ experts\footnote{ This number of
expansions is the current binomial weight of the expert. Exponential
weights always change by a fixed factor $exp(-\eta)$ in case of a
mistake. However the update factors to the binomial weights ``cool
down'' in subtle ways as $k'$ gets close to $k$ and $q^*$ decreases.},
where $q^*$ is a bound on the number of remaining mistakes of the
master. In each of the expansions, at most $k'$ of the $q^*$ trials are
chosen in which the expanded expert negates its prediction. We now
can run the Halving algorithm on the set of all expanded experts.
However this argument requires that the number of mistakes $q^*$ of
the master is bounded. This is easily achieved when the master makes
binary predictions and incurs units of loss. In that case, all trials
in which the master predicts correctly can be ignored and in trials
when the master makes a mistake, at least half of the expanded experts
are eliminated and there can't be too many such trials.
Restricting the master to use binary predictions is a significant
handicap as it does not allow the algorithm to hedge effectively when
the experts produce a relatively even vote. In this case, the master
prefers to predict $.5$ instead of predicting\footnote{As discussed
before predicting $.5$ is equivalent to predicting randomly.} 0 or 1.
The main open problem posed in \cite{\bw} is the question of how the
fancier binomial weights can be used in the case when the master's
predictions lie in $\int$. In that case there are no good bounds on
the number of trials because now all trials in which the master
incurs any loss need to be counted.
In this paper we provide such a prediction strategy, called
$\binning$, and we show that this strategy is essentially optimal. We
generalize the standard experts setting to consider experts as
continuous quantities: we allow each expert to split itself into parts
$r$ and $1-r$, where part $r$ of the expert predicts 1 and part $1-r$
predicts 0. Intuitively, the relaxation to continuous quantities of
experts removes the discretization effects that make the computation
of the optimal strategy difficult.
In our approach we consider an associated game where the master plays
against an adversary who controls the predictions of the experts and
the outcomes of every round to maximize the master's total loss. We
show that for this relaxed setting the adversary can always play
optimally by splitting all remaining experts in half.
$\binning$ is very similar to the Binomial Weighting algorithm (BW) in
that it implicitly uses binomial weights. In the case of exponential
weights, the bound for the algorithm with predictions in $\int$
(i.e. Vovks aggregating algorithm for the absolute loss) is half of
the bound for the algorithm with binary predictions (i.e. the WM
algorithm). Essentially the same happens for binomial weights. The
bound we prove for the new Binning algorithm (predictions in $\int$)
is a constant plus half
of the bound of the BW algorithm (predictions
in $\bin$) and the additive constant is well behaved. It is already
known that for large $n$, BW is optimal when
the prediction of the master must be binary.
Since no algorithm with predictions
in $\int$ can be better that half of the best binary
prediction algorithm, our new $\binning$ algorithm is essentially optimal.
{\bf Summary of paper:} We begin in Section \ref{s:opt} by introducing
our notation and formulating the optimal prediction strategy of the
expert's game when the master uses predictions in $\int$. In Section
\ref{s:relaxed} we define the continuous game in which we allow continuous
quantities of experts. We show the following is always an
optimal split in each trial of the continuous game:
all experts split themselves in half
(resulting in experts of size $\frac{1}{2^i}$).
We also relate the optimal
algorithm for this game (i.e. the new Binning algorithm) to the BW
algorithm. In Section \ref{s:exp} we give some experimental results
showing $\binning$'s performance and its worst case bound on real
world datasets. Finally, in Section \ref{s:concl} we discuss various
open problems as well as high level goals that might be attainable
with our new continuous-experts technique.
\section{The Optimal Prediction Strategy}
\label{s:opt}
We have $n$ experts that make
binary predictions and we are to design a master algorithm that
combines the predictions of the experts with the goal of performing
well compared to the best expert. Our on-line learning model can be
viewed as a game that proceeds in trials. At each trial the following
occurs: first, each expert $i$ produces a prediction $x_{i} \in \bin$;
then the master produces a prediction $\yh \in \int$ and finally, the
true label $y\in \bin$ is received and both the experts and the
master incur a loss: expert $i$ incurs loss $|x_i-y|$ and the master
loss $|\yh-y|$. Recall that the experts' predictions and the labels
are binary, but the prediction of the master lies in $\int$.
(Generalizations are discussed in the conclusion).
\iffalse
Note the experts' predictions are binary, but the prediction of the
master lies in $\int$.%
\footnote{The loss $|\yh-y|$ of the master is the probability of
disagreeing with $y$ if the master flips a binary coin with
probability $\yh$.} \fi
The two parties of the game are nature and the master: nature provides
the predictions of the experts, the master gives a prediction in each
trial, and finally nature provides the true label. We must restrict
the adversary, since an unrestricted adversary can continue to inflict
loss at least $\half$ in each round. A common restriction is the
following: the true labels and the choices of the experts'
predictions have to be such that at least one expert has total loss
at most $k$. We call this restriction the {\em $k$-mistake rule}. It
is assumed that $k$ is known to both the master and the adversary
before the game starts.
Notice that, with the addition of the $k$-mistake rule, the game is
essentially finite. If, after many rounds, all but one expert has made
more than $k$ mistakes, and the last expert has made exactly $k$, then
this expert is required to predict correctly from this point on. In
this case, the master can simply mimic the prediction of this expert,
and the adversary cannot allow this expert to err. Since this simple
strategy of the master assures that the master incurs no future loss,
the game has ended.
Observe that, after a number of trials, the only relevant information
about the past is the number of experts that made $0,1,\ldots,k$
mistakes. The experts can then be partitioned into $k+1$ bins and the
past can therefore be summarized by the state vector
$\s=(s_0,\ldots,s_k)$, where $s_i$ is the number of experts that has
made $i$ mistakes. We also use the notation $| \s | := | \s |_1 =
\sum_{i=0}^k s_i$ for the total number of experts.
It is possible that $|\s|\leq n$, as some experts
might have incurred already more than $k$ mistakes and therefore will
be ignored. Our state space is therefore the set $\S=\{\s\in
\{0,1,\ldots, n\}^k:|\s|\leq n\}$.
By choosing binary predictions for the experts,
the adversary splits the state vector $\s$ into
$\r$ and $\s-\r$ such that $\r \leq \s$ and $\r\in\S$. The
vector $\r$ represents the experts that predict one and the vector
$\s-\r$ the experts that predict zero. After the adversary provides
the binary label $y$, the experts that predict wrongly advance by one
bin. For any state vector $\z\in \S$, we use $\z^+$ to denote the
shifted vector $(0,z_0,z_1,\ldots,z_{k-1})$. When $y=0$, then $\r$
advances, and when $y=1$, then $\s-\r$ does,
and the successor state is
$$\sry=\left\{
\begin{array}{ll}
\r^+ + (\s-\r)&\text{if }y=0\\
\r + (\s-\r)^+&\text{if }y=1.
\end{array}
\right.
$$
Let's give an example of one round of our game. Assume the
mistake bound $k$ is 2 and the number of experts $n$
is $6$. Initially, our state vector is
$(6,0,0)$. However, after several rounds, 2 experts have made no
mistakes, 1 expert has made 2 mistakes, and 3 experts have made more
than 2 mistakes. Our state vector is now $(2,0,1)$. On the next round
we receive predictions from the experts, and we find that the only
expert to predict 1 was one of the experts with no mistakes. In this
case, the split of the state vector is
\begin{equation*}
\s + (2,0,1) = \overbrace{(1,0,0)}^{\r} +
\overbrace{(1,0,1)}^{\s - \r},
\end{equation*}
and the two resulting possible states would be
\begin{eqnarray*}
\szero & = & \overbrace{(0,1,0)}^{\r^+} +
\overbrace{(1,0,1)}^{\s - \r} = (1,1,1)\\
\sone & = & \underbrace{(1,0,0)}_{\r} + \underbrace{(0,1,0)}_{(\s -
\r)^+} = (1,1,0).
\end{eqnarray*}
\subsection{The Value of the Game} \label{sec:gamevalue}
We define the value of our game at a state $\s$ as the total loss the
adversary can force the master to incur if its choices satisfy the
$k$-mistake rule. With the above notation we can express the value of
the game as:
\begin{equation}
\label{e:rec}
\l(\s):=
\left\{
\begin{array}{ll}
-\infty&\mbox{if }\s=\zero
\\
0&\mbox{if } \s = (0,\ldots,0,1)
\\
{\displaystyle \max_{\r \leq \s,\r\in\S}
\min_{\yh\in \int}
\max_{y\in \bin}}
\left(
|\yh-y|+\l(\sry)
\right)
& \s \in \;\mbox{rest of } \S.
\end{array}
\right.
\end{equation}
At various points in the paper we induct on the {\em mistake budget}
$B(\s)$ of a state, which we define as the total number of mistakes
that can be made by all of the experts before arriving in the final
state $\b_k=(0,...,0,1)$. Explicitly, $B(\s) := -1 + \sum_{i=0}^k
(k-i+1)s_i$. Notice, $B(\zero)=-1$ and if $B(\s) = 0$ then $\s$ must
be the final state $\b_k$.
Some care needs to be taken to assure that the above game is well
defined because of the possibility that all of the experts make the
same prediction. If a split is {\em unanimous}, i.e. all experts
predict $u=1$ and $\r=\s$, or all experts predict $u=0$ and $\r=\zero$,
then the master must choose $\yh$ as the unanimous label $u$. If the
master chose $\yh\ne u$, the adversary would simply choose $y = u$,
inflicting positive loss $|\yh-u|$ on the master while the experts
incur no mistakes. Therefore whenever all experts predict with some
unanimous label $u$, the optimal choice of the master is $\yh = u$
also.
How should the adversary choose its label $y$ when all experts predict
with some unanimous label $u$ and $\yh=u$? If $y=u$ then the current
trial is vacuous because $\sry=\s$ and none of the parties incur any
loss. We need to make the mild assumption that such vacuous trials are
disallowed. Therefore $y \ne u$ in unanimous trials and in this case
the successor state is $\sry=\s^+$. In summary,
$$ \r \in \{ \zero, \s \} \Longrightarrow
{\displaystyle \min_{\yh\in \int}
\max_{y\in \bin}} \left( |\yh-y|+\l(\sry) \right)
= 1+\l(\s^+).$$
We now expand the recurrence slightly by rewriting (\ref{e:rec}) as
follows. As before, $\l(\zero) = -\infty$ and $\l(\b_k) = 0$, and for
every other state $\s \in \S$,
\begin{equation}
\label{e:bigrecursion}
\l(\s) =
{\displaystyle
\max_{\scriptsize\begin{array}{c}
{\r \leq \s}\\
\r\in S
\end{array}
}
}
\begin{cases}%{ll}
1 + \l(\s^+) &
\text{if $\r \in \{ \zero,\s$\} } \\
\max\{\l(\szero),\l(\sone)\} &
\text{if } \r \notin \{ \zero,\s \}, |\l(\sone) - \l(\szero)| > 1 \\
\frac{\l(\sone) + \l(\szero) + 1 }{2} &
\text{if } \r \notin \{ \zero,\s \}, |\l(\sone) - \l(\szero)| \leq 1
\end{cases}
\end{equation}
\iffalse
The unanimous case (i.e. when $\r=0,\s$) follows from the above
discussion. By dropping the loss term $|\yh-y|$ from the original
recurrence, one sees that the value of the $\max\min\max$ is at least
$\max \{ \l(\szero), \l(\sone) \}.$ The master tries to minimize the
value of the game. When the gap $|\l(\szero) - \l(\sone)|$ is more
than 1, then the master can achieve this lower bound by setting $\yh$
to 1 when $\l(\sone) \geq \l(\szero)+1$ and to 0 when $\l(\szero) \geq
\l(\sone)+1$. When the gap is at most 1,
then by choosing
$\yh= \frac{\l(\sone) - \l(\szero) + 1 }{2}\in \int$,
the loss of the master is the same no matter which
label $y$ is chosen by the adversary, i.e.
$$\yh + \l(\szero) = 1-\yh + \l(\sone) =
\frac{\l(\sone) + \l(\szero) + 1 }{2} \geq \max \{ \l(\szero),
\l(\sone) \}.$$
\fi
The unanimous case (i.e. when $\r=0,\s$) follows from the above
discussion. The remaining two cases arise when we consider how the
game is played once $\r$ has been chosen. The master algorithm wants
to choose $\yh$, while knowing that the adversary can simply choose the
larger of $|\yh - y| + \l(\sry)$ for $y\in\bin$. Thus it would like
to minimize $L(\r,\yh) := \max \{\yh + \l(\szero), 1 - \yh + \l(\sone)
\}$. It can accomplish this by making these two quantities as close as
possible. When $|\l(\szero) - \l(\sone)| \leq 1$, it can make them
exactly equal by setting $\yh= \frac{\l(\sone) - \l(\szero) + 1
}{2}$. However, when $\l(\sone) > \l(\szero)+1$, the master should
choose $\yh = 1$ and $L(\r,\yh) = \l(\sone)$. Similarly when
$\l(\szero) > \l(\sone) + 1$ then $\yh = 0$ and $L(\r,\yh) =
\l(\szero)$. The latter two cases are summarized in line 2 of above
recursion, completing the argument that the above recursion is
equivalent to (\ref{e:rec}).
A further and more subtle simplification of the above recurrence
is provided by the following lemma which rules out the first two lines
of (\ref{e:bigrecursion}). In particular, it implies two useful facts:
(a) unanimous splits only occur when there is a single
expert and (b)
when $|\s| > 1$, the adversary will never choose $\r$ so that
$|\l(\sone) - \l(\szero)| > 1$.
Note that when there is exactly one expert left
which has made $i\leq k$ mistakes,
i.e. $\s$ is the standard basis vector $\b_i$,
then all splits must be unanimous and $\l(\b_i)=k-i$.
\begin{lemma}
\label{l:ugly}
For any state $\s\in \S$ s.t. $|\s| > 1$,
\begin{equation}
\label{e:rec2}
\l(\s) = \max_{0 < \r < \s} \left\{
\frac{\l(\szero) + \l(\sone) + 1}{2} \right\}.
\end{equation}
\end{lemma}
The proof requires a rather technical induction and is given in the
appendix.
% Note that the depth of the recurrence of the lemma is bounded by
% $n(k+1)$. In Figure \ref{f:value} we give the value function for the
% case of $k=1$ and $n=3$. The $k$-mistake rule is realized by the base
% case $\l(0,\ldots,0,1):=0$. Since the adversary aims to maximize the
% loss, it will always avoid state $\zero$. Initially all experts have
% loss zero. Therefore the value of the game is $\l(n,0,\ldots,0)$.
% \begin{figure}[h]\label{f:value}
% \begin{center}
% \begin{tabular}{c|ccccc}
% $\l((s_0,s_1))$ & $s_0=0$ & 1 & 2 & 3 & 4 \\ \hline
% $s_1=0$ & $-\infty$ & 1 & 7/4 & 67/32 & 19/8 \\
% 1 & 0 & 5/4 & 29/16 & 69/32 & \\
% 2 & 1/2 & 11/8 & 15/8 & & \\
% 3 & 3/4 & 3/2 & & & \\
% 4 & 1 & & & & \\
% \end{tabular}
% \end{center}
% \caption{The values of $\l(\s)$ for all possible states $\s =
% (s_0,s_1)$ when $k=1$, $n=4$. The recurisve forumla (\ref{e:rec2})
% implies that the loss value will always be of the form
% $\frac{a}{2^b}$.}
% \end{figure}
We now have a very simple recursion for computing $\l$, and we can
easily define the optimal prediction strategy $\optpred$, as in
(\ref{e:opt}), with oracle access to this function. Unfortunately,
$\l$ is still too expensive to compute: it requires the solution of a
dynamic programming problem over $O(|\S|) = O((n+1)^k)$ states.
\begin{algorithm}
\caption{$\optpred$}
At round $t$, the number of past mistakes
of the experts are tallied in the state vector $\s$
and the experts that predict 1 (respectively 0) are tallied
in the split vector $\r$ (respectively $\s-\r$).
The master algorithm $\optpred$ outputs prediction:
\begin{equation}
\label{e:opt}
\yh =
\clip\left(
\frac {\l(\s^{\r,1}) - \l(\s^{\r,0}) + 1} {2}
\right),
\end{equation}
where $\clip(x)$ is the point in $\int$ closest to $x$.
\end{algorithm}
\section{The Continuous Game}
\label{s:relaxed}
Computing the value function $\l$ using (\ref{e:rec2}) is too
difficult: at every round we must consider all possible splits $\r$.
However, empirical observations have suggested that splits $\r$ close
to $\frac{\s}{2}$ are optimal for the adversary. In particular, we
have observed that, whenever $\s$ has only even entries, the split $\r
= \frac{\s}{2}$ is always optimal. This evidence leads one to believe
that an optimal adversarial strategy is to evenly divide the experts,
thus balancing the loss value of the two successor states as much as
possible. Unfortunately, this is not always possible when the experts
come in discrete units.
We therefore develop a continuous game that always
allows for such ``even splits'' and show that the value of this
continuous game is easy to compute and tightly upper bounds the value
function of the original game. The continuous game follows the same
on-line protocol given in the first paragraph of Section \ref{s:opt},
however now each expert has a mass in $\int$ and at each trial each
expert is allowed to split itself into two parts which predict with
opposite labels. The total mass of the two parts must equal the original
mass.
\subsection{The Continuous Experts Setting}
As in the discrete game,
the state of the new game is again summarized by a vector,
i.e. it is not necessary to keep track of the identities of
the individual experts of mass in $\int$.
The state space of the new game is $\St= \{\s\in [0,n]^{k+1}:|\s|\leq
n\}$. The initial state vector is again $(n,0,\ldots,0)$, but the
split $\r$ may have non-negative real valued components.
We now define a new value function $\lt$ on $\St$. We would like
the definition of $\lt$ to mimic the recursive definition of $\l$ in
(\ref{e:rec2}), yet we must be careful to define the ``base case'' of
this recursion. In particular, how do we redefine the $k$-mistake rule
within this ``continuous experts'' setting? We require the following
constraints on $\lt$: when $|\s|< 1$, which we call an {\em infeasible}
state, we define $\lt(\s) := -\infty$; when $|\s|\geq 1$, i.e. a total
of one unit of experts remains, then $\lt(\s)$ should be
non-negative.
These two requirements are not sufficient: we must consider the case
when we are at a feasible state $\s$ in which, given any split $\r$
that the adversary chooses, at least one of the successor states is
infeasible. This would be a state where the game has effectively
ended, and we will therefore consider it a base case of our
recursion. That is, the recursive definition in (\ref{e:rec2}) would
not be appropriate on such a state, for
$\lt(\szero)$ or $\lt(\sone)$ is $-\infty$ (for any $\r$),
and we must therefore fix the value $\lt(\s)$. We let
$S_0$ denote the set of such base-case states:
\begin{eqnarray*}
S_0 & = & \left\{\s :
|\s|\geq 1 \text{ and }\forall
\zero < \r < \s:
|\szero| < 1 \text{ or } |\sone| < 1 \right\} \\
& = & \left\{\s :
|\s| \geq 1 \text{ and } \forall \zero < \r < \s:
|\s|-r_k < 1\text{ or } |\s|-s_k+r_k < 1 \right\} \\
& = & \left\{ \s : |\s| \geq 1 \text{ and } |\s|-\frac{s_k}{2} <
1 \right\} \\
& = & \text{convex-hull} \{\b_0, \ldots, \b_k, 2\b_k\}
\setminus \left\{\s : |\s| - \frac{s_k}{2} = 1 \right\}
\end{eqnarray*} %%%% quantify \r
We obtain the convex hull representation because $S_0$ is described by
$k+3$ linear constraints: $s_0 \geq 0$, \ldots, $s_k \geq 0$, $\sum
s_i \geq 1$ and $\frac{s_k}{2} + \sum_{i < k} s_i < 1$. The subsequent
polytop has corners $\b_0, \ldots, \b_k$ and $2\b_k$. The region is
not exactly the convex hull since the last constraint is a strict
inequality, and thus we must subtract one face. Notice that, while
$\b_k$ lies within $S_0$, the remaining states $\b_0,\ldots,\b_{k-1},$
and $2\b_k$ all lie on the subtracted face.
\subsection{The Value of the Continuous Game}
We are now in position to define our recursion for $\lt$. For any
state $\s\in \St$,
\begin{equation}
\label{e:rect}
\lt(\s):=
\begin{cases}
\lt_0(\s) & \text{ if } \s \in S_0 \\
{\displaystyle \max_{\scriptsize
\begin{array}{c}
\r \leq \s,\r\in\St \\
\frac{1}{2} \leq |\r|\leq |\s|-\frac{1}{2}
\end{array}}
\frac{\lt(\szero) + \lt(\sone) + 1}{2}} & \s \in \;\mbox{rest of } \St.
\end{cases}
\end{equation}
Note that it is crucial that in the above recurrence for $\lt$ we
bound\footnote{$\lt$
does not change if we use instead the constraint $\epsilon \leq |\r|
\leq |\s| - \epsilon$ for any $\epsilon$ with $0 < \epsilon \leq
\half$.} the split $\r$ away from $\zero$ and $\s$. Thus whenever we
recurse, at least a total of half a unit of experts is advanced and
the depth of the recurrence is bounded by $2n(k+1)$.
We still need a natural definition of $\lt_0(\s)$ for states $\s \in
S_0$. The simplest definition would be to define $\lt_0$ as
zero. However, this would cause the value function $\lt$ to be
discontinuous at the corners of the base region $\S_0$. Intuitively,
we want $\lt$ to mimic $\l$ as much as possible. Thus, the two
properties that we require of $\lt_0$ is (a) that it agrees with $\l$
on the discrete corners of $S_0$, and (b) that it is continuous on the
``in between'' states. We therefore define $\lt_0$ on the interior of
$S_0$ as the linear interpolation of $\l$ on the corner states
$\{\b_0,\ldots,\b_k,2\b_k\}$. We can explicitly define $\lt_0(\s)$ as
follows: write $\s = \alpha_0 \b_0 + \ldots + \alpha_k \b_k +
\alpha_{k+1} (2\b_k)$, where $\alpha_i \in \int$ and $\sum \alpha_i =
1$. Let
\begin{eqnarray*}
\lt_0(\s) := \left(\sum_{i=0}^k \alpha_i \l(\b_i)\right)
+ \alpha_{k+1}\l(2\b_k) \notag
&=& \left( \sum_{i=0}^k \alpha_i (k-i) \right)
+ \frac{\alpha_{k+1}}{2}
\notag \\
&=& \sum_{i=0}^{k-1} s_i \left( k - i \right) +
\frac{|\s|-1}{2}.\label{e:lt0}
\end{eqnarray*}
We chose $S_0$ and $\lt_0(\s)$ so that the following lemma holds:
\begin{lemma}
For all $\s\in \S$, $\lt(\s) \geq \l(\s)$.
\end{lemma}
\begin{proof}
The continuous game is identical to the discrete game, except that we
have given the adversary a larger space to choose a split
$\r$. Essentially, increasing the number of strategies for the adversary can only
increase the loss of the game.
\iffalse \mnote{iffalse this} We
induct on $B(\s)$. For the base case $\s = \b_k$, note that $\l(\b_k)
= \lt(\b_k) = 0$. For any other $\s \in \S$, let $\r \in \S$ be an
optimal split of $\s$. While $\r \in \St$ may not be an optimal split
of $\s$ in the continuous game, the definition implies that $\lt(\s)
\geq \half (\lt(\szero) + \lt(\sone) + 1)$. By induction, $\l(\sry)
\leq \lt(\sry)$ for $y \in \bin$. Combining these last two statements
we have,
\begin{equation*}
\l(\s) = \half(\l(\szero) + \l(\sone) + 1) \leq
\half(\lt(\szero) + \lt(\sone) + 1) \leq \lt(\s),
\end{equation*}
and we are done.
\fi
\qed \end{proof}
% Surprisingly, the continuous loss $\lt$, defined using a much larger
% domain $\St$ on which to recurse, differs only very slightly from
% $\l$. In figure \ref{f:diff} we see some examples of the gap $\lt(\s) -
% \l(\s)$ between the two functions. We were able to compute $\l$ (for
% small values of $n$ and $k$) using dynamic programming, and we see in
% section \ref{sec:bin} how we can efficiently compute $\lt$.
% \begin{figure}[h]
% \begin{center}
% \begin{tabular}{c|c|c|c|c|c|c|c}
% \hline
% & n = 2 & 3 & 4 & 5 & 6 & 7 & 8 \\ \hline \hline
% k = 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
% 1 & 0 & 0.0313& 0 & 0.0156& 0 & 0 & 0 \\
% 2 & 0 & 0.0430& 0.0156& 0.0137& 0.0078& 0.0254& 0 \\
% 3 & 0 & 0.0376& 0.0332& 0.0149& 0.0215& 0.0049& 0.0098\\
% 4 & 0 & 0.0385& 0.0222& 0.0326& 0.0078& 0.0185& 0.0081\\
% %5 & 0 & 0.0427& 0.0218& 0.0298& 0.0157& 0.0207& 0.0061\\
% %6 & \; 0 \; & \; 0.0487\; & \; 0.0276\; & \; 0.0245\; & \; 0.0337\; &
% %\; 0.0126\; & \; 0.0194\\
% \hline
% \end{tabular}
% \end{center}
% \caption{\label{f:diff} The gap $\lt(\s) - \l(\s)$ between the
% continuous and discrete loss for the $k+1$-dimensional state
% vector$\s = {(n,0,\ldots,0)}$.}
% \end{figure}
\subsection{An Optimal Adversarial Strategy}
We now show that for the value function $\lt$ of the continuous game, $\r
= \s - \r = \s/2$ is always an optimal split for the adversary. In
this case the successor state is $\h(\s):=\frac{\s + \s^+}{2}$ no
matter how $y\in \bin$ is chosen.
Define a function $\Lt$ as follows:
\begin{equation}
\label{e:half}
\Lt(\s):=
\left\{
\begin{array}{ll}
-\infty &\mbox{if } |\s|< 1\\
\lt_0(\s) &\mbox{ if }\s \in S_0
\\
\half + \Lt(\h(\s)) & \mbox{ if }\s \mbox{ rest of }\St.
\end{array}
\right.
\end{equation}
We prove $\lt = \Lt$ in two steps. The first is the following crucial lemma.
\begin{lemma} \label{lem:concave}
$\Lt$ is concave.
\end{lemma}
\begin{proof}
We have defined our base region $S_0$ above. Notice that we can
rewrite $S_0$ as $\{ \s \in \St : |\s| \geq 1, |h(\s)| < 1 \}$. Now
define $S_n$ for $n \geq 2$ as $\{ \s : \s \notin S_{n-1} \text{ and }
h(\s) \in S_{n-1} \} = \{\s : |h^n(\s)| > 1, |h^{n+1}(\s)| < 1
\}$. We will show that we can effectively reduce the concavity of
$\Lt$ to the concavity of $\Lt$ on the region $S_0 \cup S_1$.
It suffices to show that $\Lt$ is concave on a set of convex open
neighbors which cover $\St$ since the concavity property always fails
locally. Let $R_0 := S_0$, and for $n > 0$, define $R_n$ as the
interior of $S_{n-1} \cup S_n$, i.e. $\{ \s \in \St : |h^{n-1}(\s)| >
1, |h^{n+1}(\s)| < 1 \}$. Let $h^0(\s) := \s$ by convention. Notice
that $\cup R_n = \St$ and each $R_n$ is open and convex.
Let $\Lt$ restricted to $R_n$ be denoted $\Lt \arrowvert_{R_n}$. We
show that $\Lt \arrowvert_{R_n}$ is concave for every $n$. $\Lt
\arrowvert_{R_0}$ is certainly concave, for $\Lt$ is defined linearly
on $R_0 = S_0$.
% \begin{figure}[htbp]
% \begin{center}
% \input{regions.pstex_t}
% \caption{The partition of $\St$ into the regions $S_i$ for the
% 2-dimensional case when $k=1$.}
% \label{figure:yourreferencename}
% \end{center}
% \end{figure}
We show that $\Lt \arrowvert_{R_1}$ is concave by first noting that
\begin{equation}
\Lt \arrowvert_{R_1} =
\begin{cases}
\lt_0(\s) = \sum s_i \left( k - i \right) +
\frac{|\s|-1}{2} & \text{if $\s \in S_0$} \\
\lt_1(\s) = \lt_0(h(\s)) + \half = \sum s_i \left( k - i \right) +
\frac{s_k}{4}& \text{if $\s \in S_1$}.
\end{cases}
\end{equation}
These two linear functions are equal when $\frac{|\s|-1}{2} =
\frac{s_k}{4} \Longrightarrow |\s| - \frac{s_k}{2} = 1$, which is
exactly the border between $S_0$ and $S_1$. Since $S_0$ is defined by
the constraint $|\s| - s_k/2 < 1$, we see that $\lt_0(\s) < \lt_1(\s)$
when $\s \in S_0$. These last two statements imply that
$\Lt \arrowvert_{R_1}(\s) = \min \{ \lt_0(\s), \lt_1(\s) \}$ and the
minimum of two linear functions is always concave.
Assume $n > 1$. Notice, $\s \in R_n$ implies that $h(\s) \in
R_{n-1}$. Thus $\Lt \arrowvert_{R_n} = \Lt\arrowvert_{R_{n-1}} \circ h
+ \half$. Note that: (a) $h$ is an orientation-preserving linear
function ($\det(h) > 0$), (b) addition by a constant preserves
concavity, and (c) $\Lt \arrowvert_{R_{n-1}}$ is concave by
induction. Therefore, $\Lt \arrowvert_{R_{n}}$ is concave as well.
\qed \end{proof}
We are now in position to prove the following theorem.
\begin{theorem}\label{thm:half}
For all $\s\in \St$,
\begin{equation}
\label{e:half2}
\lt(\s):=
\left\{
\begin{array}{ll}
-\infty &\mbox{if } |\s|< 1\\
\lt_0(\s) &\mbox{ if }\s \in S_0
\\
\half + \lt(\h(\s)) & \mbox{ if }\s \mbox{ rest of }\St.
\end{array}
\right.
\end{equation}
\end{theorem}
\begin{proof}
We show that, for all $\s \in \St$, $\lt(\s) = \Lt(\s)$. We induct
on the mistake budget $B(\s)$. When $B(\s) < \half$, then $\s \in
S_0$, and $\lt$ and $\Lt$ are defined identically on $S_0$. Now
assume that $\frac{p}{2} \leq B(\s) < \frac{p+1}{2}$ for some
positive integer $p$. It is possible that $\s \in S_0$, in which
case certainly $\lt(\s) = \Lt(\s)$. Otherwise,
\begin{equation}\label{e:lrecdef}
\lt(\s) =
{\displaystyle \max_{\scriptsize
\begin{array}{c}
\r \leq \s,\r\in\St \\
\frac{1}{2} \leq |\r|\leq |\s|-\frac{1}{2}
\end{array}}
\frac{\lt(\szero) + \lt(\sone) + 1}{2}}.
\end{equation}
However, since we may only choose $\r$ such that $\frac{1}{2} \leq
|\r|\leq |\s|-\frac{1}{2}$, it must be that $B(\sone) < \frac{p}{2}$
and $B(\szero) < \frac{p}{2}$. By induction, $\lt(\szero) =
\Lt(\szero)$ and $\lt(\sone) = \Lt(\sone)$. However, by the concavity
of $\Lt$, we see that for any $\r$,
\begin{eqnarray*}
\frac{\lt(\szero) + \lt(\sone) + 1}{2}
& = & \frac{\Lt(\szero) + \Lt(\sone) + 1}{2}
\leq \Lt \left( \frac{\szero + \sone}{2}\right) + \half \\
& = &\Lt \left( \frac{(\s - \r) + (\r)^+ +
\r + (\s - \r)^+}{2}\right) + \half \\
& = &\Lt \left( \frac{\s + \s^+}{2}\right) + \half =
\Lt(h(\s)) + \half = \Lt(\s)
\end{eqnarray*}
and thus $\Lt$ is an upper bound on $\lt$. On the other hand,
notice also that $B(h(\s)) < \frac{p}{2}$ so $\lt(h(\s)) =
\Lt(h(s))$. Thus, for the choice $\r = \frac{\s}{2}$, we see that
\begin{equation*}
\lt(\s) \geq \frac{\lt(\szero) + \lt(\sone) + 1}{2} =
\frac{\lt(h(\s)) + \lt(h(\s)) + 1}{2} =
\Lt(h(\s)) + \half = \Lt(\s),
\end{equation*}
and thus $\Lt$ is also a lower bound on $\lt$. Therefore, $\lt(\s) =
\Lt(\s)$ and we are done. \qed \end{proof}
\begin{corollary}
The split $\r = \frac{\s}{2}$ is always an\footnote{In general,
there are multiple optimal splits.} optimal choice for the adversary
in the continuous game.
\end{corollary}
\subsection{The Binning Algorithm} \label{sec:bin}
We can now define our new algorithm $\binning_k$
by simply replacing $\l$ in equation (\ref{e:opt}) with the new
function $\lt$. We will reason below that $\lt$ can be
computed efficiently.
Theorem $\ref{thm:half}$ tells us that, to get the value of $\lt(\s)$,
we apply the function $h$ to $\s$ several times until we are in the base region
$S_0$. Let
\begin{equation}\label{e:qs}
q^\s := \min \{ q : \h^q(\s) \in S_0 \} = \max \{ q : |\h^q(\s)|
\geq 1 \}.
\end{equation}
This allows us to write $\lt(\s)=\frac{q^\s}{2} + \lt_0(\h^{q^\s}(\s)).$
The function $h$ is linear on the state space $\St$
and can be represented as a square matrix of dimension
$k+1$ (The matrix for $k=3$ is given below).
Thus $h^n$ corresponds to an $n$-fold power of this
matrix and leads to binomial coefficients:
\begin{eqnarray*}
{h =
\left[\begin{smallmatrix}
1/2 & 0 & 0 & 0 \\
1/2 & 1/2 & 0 & 0 \\
0 & 1/2 & 1/2 & 0 \\
0 & 0 & 1/2 & 1/2
\end{smallmatrix}\right]}
& \hspace{30pt}
h^n = \frac{1}{2^n}\left[ {n \choose i - j} \right]_{i,j}.
\end{eqnarray*}
From this observation, we see that, for any state $\s$,
$$\left| h^n(\s) \right| = \frac{1}{2^n}\sum_{i=0}^k {n \choose \leq
(k - i) } s_i,$$ where we define ${a \choose \leq b} := \sum_{0 \leq
b' \leq b} {a \choose b'}$. This notation allows us to rewrite
(\ref{e:qs}) as (\ref{e:qs2}) and concisely define $\binning_k$.
\begin{algorithm}
\label{alg:bin}
\caption{$\binning_k$}
Summarize the expert performance with $\s$, and current predictions
with $\r$. For both $y \in \bin$, compute
\begin{equation}
\label{e:qs2}
q_y = q^{\sry} = \max \left\{ q : \frac{1}{2^q}\sum_{i=0}^k {q \choose \leq (k -
i) } (\sry)_i \geq 1 \right\}.
\end{equation}
Using the functions $clip(\cdot)$ defined in equation (\ref{e:opt}) and
$\lt_0(\cdot)$ defined in (\ref{e:lt0}), output prediction
\begin{equation}
\label{e:qs3}
\yh = clip \left( \frac{q_1 - q_0}{4} +
\frac{\lt_0(h^{q_1}(\sone)) - \lt_0(h^{q_0}(\szero)) + 1}{2}
\right)
\end{equation}
\end{algorithm}
The naive complexity of solving \eqref{e:qs2} at every round is $O(k
q^\s) = O(k^2 + k \log |\s|)$, and for computing $\lt_0 \circ \h^q$
in \eqref{e:qs3} is $O(k q^\s)$.
The overall computation for one round requires
therefore $O(k^2 + k \log |\s| + n)$,
where $n$ is the initial number of experts. Using
bookkeeping one can reduce the per-round complexity to $O(n)$
by maintaining binomials ${q \choose k - i}$, binomial tails ${q \choose
\leq k - i}$, and the function $\lt_0 \circ h^q(\b_i)$ for an expert
with $i$ mistakes. These values can be updated in constant time using
recurrences.
\subsection{Binning and Binomial Weighting}
In the introduction we discussed the algorithm Binomial Weighting
which makes deterministic predictions ($\yh \in \bin$) when a mistake
bound $k$ is given. BW finds a bound, $q^*$, on the number of times
the master errs, and considers a set of virtual experts of size
$\sum_j{q^* + 1 \choose \leq k-m_j}$ where the sum is taken over all
experts $j$, and $m_j$ denotes the number of mistakes of
expert $j$. In some
sense, $q^*$ is computed in hindsight: we make $q^*$ big enough so
that we produce ``enough'' virtual experts, i.e. so that we don't
halve the set of virtual experts too many times. It is chosen as
\begin{equation}
q^* = \max \left\{ q : q \leq \log_2 \sum_j{q \choose \leq k -
m_j} \right\}.
\end{equation}
Recall that, if we summarize the past performance of our experts with
a state vector $\s$, then $s_i$ is the number of experts $e$ that have
made $i$ mistakes and therefore $\sum_j {q^* \choose k -
m_j } =
\sum_{i=0}^k {q^* \choose k - i} s_i$. Interestingly, if we
exponentiate the above equation and divide by $2^{q^*}$ we arrive at
equation (\ref{e:qs2}) and thus $q^\s = q^*$.
The loss bound on $\binning_k$ is $\frac{q^\s}{2} +
\lt_0(h^{q^\s}(\s))$. Notice, the binomial nature of $h^{q^\s}$ forces
the majority of the weight in $\s$ to be collected in $s_k$, yet this
term has coefficient $\half$ in the function $\lt_0$. However, the
term $\lt_0(h^{q^\s}(\s))$ quickly drops to a constant $c$ independent
of $k$ as the number of experts goes to $\infty$. Thus, the loss
$\lt(\s) \leq \frac{q^*}{2} + c$ for large enough $n$. (The exact
bound on $\lt_0(h^{q^\s}(\s))$ requires some computation and is
discussed in the full version of this paper.)
The factor of $\half$ is to be expected: the deterministic algorithm
BW suffers loss 1 at every round in the worst case, while $\binning_k$
will be forced to predict $\yh = \half$ against an optimal adversary,
thus suffering loss $\half$.
\section{Experiments}
%
We ran experiments with several real-world datasets (see table
\ref{fig:experiments}) obtained from the UCI Machine Learning Repository. We
chose rather simple experts: real valued features were replaced with
an {}``above or below median'' expert. Categorical features were
replaced with a set of experts, one for each category value. When
category value $v$ was encountered, expert $v$ would be set to $1$ and
all others would be set to $0$. We also included the constant expert
and for every expert, the complement expert. The original ordering of
the datasets was preserved.
\begin{figure}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
data & echo & bupa & hep & wpbc & dia & bcw & Aust & ion & beast &
wdbc & kr-kp & cat & a-l
\tabularnewline
\hline
\hline
rounds & 131 & 345 & 155 & 198 & 768 & 699 & 690 & 351 & 286 & 569 &
3196 & 873 & 8124
\tabularnewline
\hline
experts & 28 & 14 & 70 & 72 & 18 & 20 & 30 & 70 & 100 & 64 & 220 &
1234 & 280
\tabularnewline
\hline
mistakes & 39 & 145 & 32 & 47 & 237 & 65 & 184 & 122 & 79 & 83 & 1012 &
3 & 920
\tabularnewline
\hline
\end{tabular}
\end{center}
\includegraphics[%
scale=0.5,
angle=270]{figs/226-binning.ps}\includegraphics[%
scale=0.5,
angle=270]{figs/226-algorithms.ps}
\caption{\label{fig:experiments}In the table above we give the number of rounds, the number
of experts, and the number of mistakes of the best expert. The
datasets are ordered above by byte size. In the graph on the
lower left we show how the performance of Binning compares to the
upper Bound on Binning performance, the Best single expert, and
the bound on Vovk's algorithm. The right graph shows how the
performance of Binning compares to Binomial Weighting (BW) (2
entries missing), Weighted Majority (WM), and Vovk's algorithm.}
\end{figure}
All of the algorithms require the value $k$ (the mistake bound) for
tuning. Since we want to compare the algorithms when they are
optimally tuned we precomputed this value and provided it to each
algorithm. The results are graphed in Figure
\ref{fig:experiments}. The left graph shows that the Binning bound is
tighter than the bound for Vovk's algorithm and often quite close to
the actual Binning performance. The right graph is surprising: the
performance appears to worsen with the tighter bound. In fact, BW and
WM, both deterministic algorithms, performed better than the best
expert on 2 datasets. Perhaps a tighter bound has an associated cost
when, in actuality, we are not in an adversarial setting.
\label{s:exp}
% \section{Expert's predictions in $\int$ and labels still in
% $\bin$}
% \note{MW: Is seems that the following approach
% will generalize the Binning algorithm.
% Assume that the expert's predictions lie in
% $[\epsilon,1-\epsilon] \cup \bin$ for $0<\epsilon \leq \half$.
% Now the unanimous trials can be handled as before and the
% depth of the recurrences is bounded for any fixed $\epsilon$.
% Everything should go through and by taking the limit of
% $\epsilon\rightarrow 0$, we can argue that the binning algorithm
% generalizes to the case when the expert's predictions lie
% in $\int$.
% Now there are no bins any more. Each live expert sits at
% position $l_i\leq k$, where $l_i$ is its loss so far.
% The recursion are still computable because each experts spawns a trail
% of half splits anchored at $l_i$ of limited depth.}
% \section{Expert's predictions and labels in $\int$}
% \note{MW: This is much more tricky. Following the
% analysis done for the Vee algorithm, we first would
% need to show that the worst-case label is always binary
% or agrees with one of the predictions of the experts,
% i.e. $y \in {x_1,\ldots,x_n} \cup \bin$.}
\section{Conclusion}
\label{s:concl}
We discovered a new technique that replaces exponential weights by an
optimal algorithm for a continuous game. The key idea is to allow partial
experts. Our method uses binomial weights which are more refined then
exponentially decaying weights. Note that the latter weights can be
derived using a relative entropy as a divergence
\cite{\expgrad,\dotpred}. This reminds us of the application of
entropies in statistical mechanics where the entropy is used to
approximate exact counts that are based on binomial tails. In on-line
learning we first discovered the entropy based algorithms which use
approximate counting by approximating the binomial tails with
exponentials. More recently refined algorithms are emerging that are
based on binomial counting (See also \cite{\yoavboost,fo-dgbm-02} for
related work on Boosting).
The algorithms based on exponential weights are clearly
simpler, but they also require knowledge of $k$ for
tuning the learning rate $\eta$ well. The algorithms
that use binomial weights always make use of $k$. If
no tight upper bound of the true $k$ is not known,
then simple doubling tricks can be employed
(see e.g. Section 4.6 of \cite{\experts}).
Much work has been done in the realm of exponentially decaying
weights: shifting experts \cite{\trackexp}, multi-arm bandit problems
\cite{\bandit} and so forth. In general, the question is whether
other cases in which exponential weights have been used are amenable
to our new technique of splitting experts. Also, in some settings
\cite{\decgenboost} the algorithm needs to commit to a probability
vector $(w_i)$ over the experts at the beginning of each trial. It
then receives a loss vector $(L_i)\in [0,1]^n$ and incurs a loss
$\sum_i w_i L_i$. The question is whether weights can be extracted
from our Binning algorithm and the optimal algorithm can be found for
the modified setting when the experts are allowed to be continuous
quantities.
For a more immediate goal note the following. We assumed that the
predictions of the experts and the labels were binary. However in the
realm of exponentially decaying weights, Vovk's aggregating algorithm
for the absolute loss \cite{\vovk,\experts} can handle expert's
predictions in $\int$ and the Vee algorithm of \cite{\hkw} can in
addition handle labels in $\int$. We believe that with some
additional effort our methods will generalize to handle
these cases as well.
\appendix
\section{Proof of Lemma \ref{l:ugly}}
%% This proof has been omitted for space. The full version can be found
%% at the following site: \url{http://hunch.net/~jl/projects/projects.html}.
To prove the lemma it suffices to prove the following statements:
\begin{itemize}
\item[(a)] For all splits $\zero < \r < \s$
and $y\in\bin$,
$\l(\s) > \l(\sry)$.
\item[(b)] For all $|\s| > 1$, $\l(\s) > 1 + \l(\s^+)$
and unanimous splits are not optimal.
\end{itemize}
Notice that (a) implies that $\l(\s) > \max\{\l(\szero),\l(\sone)\}$,
which rules out line 2 of the recursion (\ref{e:bigrecursion}).
Also, statement (b) rules out line 1 and
together, they imply the lemma.
We prove the above two statements by induction on the
mistake budget
$B(\s)$. Statements (a) and (b) trivially hold for the base case
$\s=\zero$, i.e. when $B(\s) = - 1$. Assume that (a) and (b) holds
for all states $\s'$ where $B(\s') < B(\s)$. We now show that the
statements holds for state $\s$.
For (a), let $\r$ be any non-unanimous split of $\s$ and let
$\z := \szero$.
Consider two cases depending on the value of $|\z|$.
Assume $|\z| = 1$, then $\z = \b_i$. This implies that $\s = \b_i +
m\b_k$ and $\r = m\b_k$ for some $m > 0$. Then $\szero = \b_i$ and
$\sone = \b_{i+1} + m\b_k$, and by (\ref{e:bigrecursion}), we see that
$$\l(\s) \geq \frac{\l(\b_{i}) + \l(\b_{i+1} + m\b_k) + 1}{2} =
\frac{k-i + \l(\b_{i+1} + m\b_k) + 1}{2}.$$ We now show that
$\l(\b_{i+1} + m\b_k) > k - i - 1$ which implies that $\l(\s) > k - i
= \l(\z) $. If $i = k$ then $\b_{i+1} = \zero$ and $\b_{i+1} + m\b_k =
m\b_k$. Therefore, we can see by induction that $\l(m\b_k) \geq
\l(\b_k) = 0>-1$.
%If $i = k-1$, then $\b_{i+1} + m\b_k = (m+1)\b_k$, and by
%induction $\l((m+1)\b_k) > \l(m\b_k) \geq 0 = k - i - 1$.
Also if $i < k$ then by a similar induction, $\l(\b_{i+1} + m\b_k) >
\l(\b_{i+1}) = k-i-1$. This implies that $\l(\s) > k - i$ as
desired.
\iffalse
Assume $|\z| \geq 2$. Then, by induction, there is some non-unanimous
split $\q'$ of $\z$ where $\l(\z) = \frac{ \l(\z^{\q',1}) +
\l(\z^{\q',0}) + 1}{2}$.
We now construct a split $\q$ of $\s$ as follows. Let $f:\s \to \z$
denote the map that takes the experts in $\s$ and moves them to obtain
$\z$. Because we can invert this operation, we choose a split $\q$ of
$\s$ o that $f(\q) = \q'$. Now consider states $\s^{\q,0}$ and
$\s^{\q,1}$. Notice that, using statement (a) and our induction
hypothesis, $\l(\s^{\q,y}) > \l(\z^{\q',y})$ for $y\in\bin$. However,
this implies that $\l(s) \geq \frac{\l(\s^{\q,0}) + \l(\s^{\q,1}) +
1}{2} >\frac{\l(\z^{\q',0}) + \l(\z^{\q',1}) + 1}{2} =
\l(\z)$.
\fi
Assume $|\z| \geq 2$. Since $B(\z) < B(\s)$,
it follows by induction that there is some non-unanimous
split $\q$ of $\z$ s.t.
$$\l(\z) = \frac{ \l(\z^{\q,1}) + \l(\z^{\q,0}) + 1}{2}
= \frac{ \l((\szero)^{\q,1}) + \l((\szero)^{\q,0}) + 1}{2}.$$
Since $\r$ is non-unanimous it follows by induction that the
above is less than
$\frac{ \l(\s^{\q,1}) + \l(\s^{\q,0}) + 1}{2} \leq \l(\s).$
For the proof of (b), let $\z := \s^+$. We consider two cases
depending on the value of $|\z|$.
Assume $|\z| = 1$. Then $\z = \b_{i+1}$ for some $i$, and thus $\s =
\b_i + m\b_k$ for some $m \geq 1$. Notice, we already proved above
that, when $\s$ is of this form, that $\l(\s) > k - i = \l(\s^+) + 1$
as desired.
Assume $|\z| \geq 2$. We prove the statement by induction
on the mistake
budget $B(\s)$. For the base case $B(\s) = 1$, we must be in state
$(0,\ldots,0,2) = 2\b_k$ and therefore $\z = \zero$, so the statement
is true. We now proceed to the induction step. Notice that $B(\z) <
B(\s)$ and $|\z| \geq 2$. By induction, we can therefore find a
non-unanimous split $\q'$ of $\z$ where $\l(\z) = \frac{ \l(\z^{\q',1})
+ \l(\z^{\q',0}) + 1}{2}$. We now choose $\q$ so that $\q^+ =
\q'$. Observe that $(\s^{\q,y})^+ = \z^{\q',y}$ for $y=0,1$. Also,
$B(\s^{\q,y}) < B(\s)$, and thus by induction we can apply (b), giving
us that $\l(\s^{\q,y}) > \l((\s^{\q,y})^+) + 1 = \l(\z^{\q',y}) +
1$. Combining these statements, we see that
\begin{equation*}
\l(\s) \geq \frac{\l(\s^{\q,0}) + \l(\s^{\q,1}) + 1}{2}
> \frac{\l(\z^{\q',0}) + \l(\z^{\q',1}) + 3}{2} = \l(\z) + 1 = \l(\s^+) + 1,
\end{equation*}
as desired.
This completes the proof of the lemma
\bibliographystyle{alpha}
\bibliography{bibs/coltlocal,bibs/colt,bibs/additional}
\end{document}