I'm rewriting the abstract again to shift the emphasis towards
implications over means.  Also, I disagree with the some portion of
the current abstract.  It is not true that the improved margin bound
tells you: "...that averaging basic hypotheses in a data-dependent way
can drastically improve classification."  The improved margin bound
says something about the _gap_ between empirical and true error and
does not make a direct statement about the empirical error.

I don't like the term 'mixture discriminant'.  It is too jargonized.
Not every person at ICML will know what a 'discriminant' is and
'mixture' is too overloaded to be unambiguous.  There are other terms
which are more transparent to the average ICML person.  I suggest
'average hypotheses' or 'averaging classifier'.  I like 'averaging
classifier'.  It appears that the earlier paper used 'combined
classifier' and 'voted classifier'.

I'm adding back the first paragraph of the old introduction on the
theory that it provides motivation lacking in the current
introduction. 

I like the litte {\em ... } statements.

With an improved first paragraph, the old first paragraph repeats too
much.  Nuking.

'MC Sampling' isn't clear enough.  In general, it seems like you are
defining a bit more in the introduction.  I'm not against defining,
but it makes getting the point across harder.  

I like saying "Schapire, Freund, Bartlett and Lee" rather then
"Schapire et al" even though it's a little hoky to have 4 authors.
This is the prinicipal prior work and it feels a little bit clearer to
do all four authors.

Merging old paragraph 2 and new paragraph 2.  I like the history
lesson in old paragraph 2 - it gives a sense of perspective and
context while also forshadowing.

The authors do _not_ conclude that "The averaging distribution should
be chosen s.t.\ a large margin is achieved on most of the training
set."  This is an important little land-mine.  Other people concluded
this and were dissapointed.

Modifying the latex so it is self contained.  You seem to have added
reliance on a private macro file and on nips for some reason.

It isn't a "proposed PAC-Bayesian technique".  It's a "PAC-Bayesian
bound".  Actually, we haven't defined 'PAC' and we shouldn't just
assume ICML people know it.

The discussion of the importance of a prior seems like a distraction
in the introduction.  The important improvement arises from
considering the "posterior", not the "prior".  I'm using quote marks,
because neither the "posterior" nor the "prior" occurring in the bound
are necessarily Bayesian posteriors or priors.

Actually, I'm nuking the whole paragraph based upon a "means vs. ends"
argument.  It's trying to explain what we will do rather then
importance of what we do.  I could imagine a revised version sitting
in the discussion section.

I like the older first 'setting' paragraph a bit more because it does
things more incrementally.  For example, it defines X before defning
H.  This is more natural.  Also, Q should not be defined before the
averaging classifier because Q is not a part of the setting - it is
just a part of the analysis.

Ornamenting with \cal seems good.  It makes things stand out a bit
more.

suffix -> subscript

It isn't good to have an example which doesn't aid in understanding
the material.  Altered to aid in understanding the material.

I'm not sure what "Large margin" is supposed to mean in the title.

I'm restructring into two subsections 'setting' and 'derived
quantities for analysis' because it makes things vaguely clearer.

Nimrod, I'm swapping \theta(x,y) -> t(x,y) and t -> \theta so we
can preserve similarity with the original margin paper.

PAC is used without definition.  I'm defining it.

There is some style difference between the way Nimrod writes and the
way Matthias writes.  Nimrod tends to use a more broken up word
sequence while Matthias just states things.  I think I like Nimrod's
style more - it makes reading a bit less monotonous.  But there is
room for argument.  I'm going with Nimrod's definitions at the moment.

Matthias did a mapping a_i -> q_i which I'm not sure I like because it
departs from the earlier margin paper and I don't see a justification.
Switching back, but feel free to argue.

<rant>Matthias: you need to include the _motivation_ of a section.  As an
example, the beginning of your earlier results is:

"In this subsection we restrict ourselves, unless otherwise stated, to
a finite hypothesis space ${\cal H} = \{h_1,\dots,h_k\}$. The margin
bound of \cite{bartlett:98b}, already mentioned in section
\ref{sec-intro}, is stated here for reference."

The older version was: 

"The improved margin bound arises from an improvement in a critical
step in the proof of the original margin bound, which we state here
for reference.  We denote by $\pr{D}$ the probability measure of the
distribution $D$ defined above.  For any set $S$ of examples, we
denote by $\pr{S}$ the uniform probability distribution over the set
$S$."

Especially for ICML, every section must have motivation.  
</rant>

It's generally best to avoid negative-laden sentences like: "we employ
a result which is neither a margin bound nor deals with averaging
classifiers..."  Mutating.

Nimrod, I know you like \Delta to denote the KL-divergence.  I'm
sticking with D( || ) for the moment because it is more standard.  I
expect you changed it to avoid confusion between D the distribution
and D the pseudo-metric, which is a worthy goal.  Maybe D the
distribution should be shifted.

Shifting from McAllester -> PAC-Bayes because it is a more explanatory
name.

This: "The important novelty in McAllester's technique is that prior
information about the classification task can be incorporated in a
Bayesian style, by specifying a {\em prior distribution} $P$ over
${\cal H}$." is not correct.  The Occam's razor bound from 1985 can
incorporate a prior in a meaningful way.

I'm shifting the statement of the PAC-Bayes theorem away from the
original and towards the language of our paper.  In particular, we
don't want to futz with arbitrary loss functions.  Let's just
specialize to what is required for our proof.

Nimrod - Matthias statement of the Occam's razor theorem seems a bit
clearer.  Take a look.

Matthias and Nimrod - the Occam's razor bound isn't really limited to
a finite hypothesis space.  It just isn't very interesting for most of
the hypotheses in an infinite hypotheses space.

I like avoiding the '-' in the definition of entropy because it's
easier to understand an equation where every term is positive.

Matthias - the reference to N was just obsolete.  It should have been
log(m).

Nimrod - I'm going with Matthias's statement of the main theorem
because it seems a bit clearer to me.

The note in the old draft about how to apply this to various machine
learning algorithms is important.  Reincluding.

I don't like: "We note that there is a non-asymptotic version of
theorem \ref{th-main} whose less accessible form, however, might
obscure its practical meaning." because it assumes too much about the
mind-state of the reader and seems sort of patronizing.

I think it's important to state explicitly how Hoeffding's bound
applies to the 'g'.

I'm changing Q_N -> Q^N because it really is a cross product
distribution.

Is it: \sum_N \delta/(N(N+1)) = \delta
or \sum_N \delta/(N(N+1)) \leq \delta

Matthias - in general I like the changes you made to the proof.  

Referring to the original proof for getting the asymptotic bound is
weak.  We want the paper to be self-contained.  Adding.

Matthias - I'm removing your sections 4, 5, and 6 because no effort
was made to work with the old section 5 and 6 which cover
substantially the same material in a manner better for ICML.  If you
think something is missing, then I encourage you to add it in rather
then attempting a wholesale replacement.

You also cut the old section 4.  I'm going to leave it out for the
moment because I like the idea of placing it into a supporting tech
report.  I think including an example of the bound on a problem is
more important.

Nimrod - I'm reworking the Maximum entropy section a bit.  See if that
clears things up for you.

With section 4 gone, we should seriously contemplate pushing the proof
into the appendix.

Added example.  I think I will actually program it up - it might give
us some pretty results which might help us get into ICML.

Added a little discussion about the algorithm directly motivated by
our algorithm.

Getting rid of the dependence on the long mybib.bib file.  It's not
good to not be self contained, especially when attempting to work
remotely.
