I'm rewriting the abstract again to shift the emphasis towards implications over means. Also, I disagree with the some portion of the current abstract. It is not true that the improved margin bound tells you: "...that averaging basic hypotheses in a data-dependent way can drastically improve classification." The improved margin bound says something about the _gap_ between empirical and true error and does not make a direct statement about the empirical error. I don't like the term 'mixture discriminant'. It is too jargonized. Not every person at ICML will know what a 'discriminant' is and 'mixture' is too overloaded to be unambiguous. There are other terms which are more transparent to the average ICML person. I suggest 'average hypotheses' or 'averaging classifier'. I like 'averaging classifier'. It appears that the earlier paper used 'combined classifier' and 'voted classifier'. I'm adding back the first paragraph of the old introduction on the theory that it provides motivation lacking in the current introduction. I like the litte {\em ... } statements. With an improved first paragraph, the old first paragraph repeats too much. Nuking. 'MC Sampling' isn't clear enough. In general, it seems like you are defining a bit more in the introduction. I'm not against defining, but it makes getting the point across harder. I like saying "Schapire, Freund, Bartlett and Lee" rather then "Schapire et al" even though it's a little hoky to have 4 authors. This is the prinicipal prior work and it feels a little bit clearer to do all four authors. Merging old paragraph 2 and new paragraph 2. I like the history lesson in old paragraph 2 - it gives a sense of perspective and context while also forshadowing. The authors do _not_ conclude that "The averaging distribution should be chosen s.t.\ a large margin is achieved on most of the training set." This is an important little land-mine. Other people concluded this and were dissapointed. Modifying the latex so it is self contained. You seem to have added reliance on a private macro file and on nips for some reason. It isn't a "proposed PAC-Bayesian technique". It's a "PAC-Bayesian bound". Actually, we haven't defined 'PAC' and we shouldn't just assume ICML people know it. The discussion of the importance of a prior seems like a distraction in the introduction. The important improvement arises from considering the "posterior", not the "prior". I'm using quote marks, because neither the "posterior" nor the "prior" occurring in the bound are necessarily Bayesian posteriors or priors. Actually, I'm nuking the whole paragraph based upon a "means vs. ends" argument. It's trying to explain what we will do rather then importance of what we do. I could imagine a revised version sitting in the discussion section. I like the older first 'setting' paragraph a bit more because it does things more incrementally. For example, it defines X before defning H. This is more natural. Also, Q should not be defined before the averaging classifier because Q is not a part of the setting - it is just a part of the analysis. Ornamenting with \cal seems good. It makes things stand out a bit more. suffix -> subscript It isn't good to have an example which doesn't aid in understanding the material. Altered to aid in understanding the material. I'm not sure what "Large margin" is supposed to mean in the title. I'm restructring into two subsections 'setting' and 'derived quantities for analysis' because it makes things vaguely clearer. Nimrod, I'm swapping \theta(x,y) -> t(x,y) and t -> \theta so we can preserve similarity with the original margin paper. PAC is used without definition. I'm defining it. There is some style difference between the way Nimrod writes and the way Matthias writes. Nimrod tends to use a more broken up word sequence while Matthias just states things. I think I like Nimrod's style more - it makes reading a bit less monotonous. But there is room for argument. I'm going with Nimrod's definitions at the moment. Matthias did a mapping a_i -> q_i which I'm not sure I like because it departs from the earlier margin paper and I don't see a justification. Switching back, but feel free to argue. Matthias: you need to include the _motivation_ of a section. As an example, the beginning of your earlier results is: "In this subsection we restrict ourselves, unless otherwise stated, to a finite hypothesis space ${\cal H} = \{h_1,\dots,h_k\}$. The margin bound of \cite{bartlett:98b}, already mentioned in section \ref{sec-intro}, is stated here for reference." The older version was: "The improved margin bound arises from an improvement in a critical step in the proof of the original margin bound, which we state here for reference. We denote by $\pr{D}$ the probability measure of the distribution $D$ defined above. For any set $S$ of examples, we denote by $\pr{S}$ the uniform probability distribution over the set $S$." Especially for ICML, every section must have motivation. It's generally best to avoid negative-laden sentences like: "we employ a result which is neither a margin bound nor deals with averaging classifiers..." Mutating. Nimrod, I know you like \Delta to denote the KL-divergence. I'm sticking with D( || ) for the moment because it is more standard. I expect you changed it to avoid confusion between D the distribution and D the pseudo-metric, which is a worthy goal. Maybe D the distribution should be shifted. Shifting from McAllester -> PAC-Bayes because it is a more explanatory name. This: "The important novelty in McAllester's technique is that prior information about the classification task can be incorporated in a Bayesian style, by specifying a {\em prior distribution} $P$ over ${\cal H}$." is not correct. The Occam's razor bound from 1985 can incorporate a prior in a meaningful way. I'm shifting the statement of the PAC-Bayes theorem away from the original and towards the language of our paper. In particular, we don't want to futz with arbitrary loss functions. Let's just specialize to what is required for our proof. Nimrod - Matthias statement of the Occam's razor theorem seems a bit clearer. Take a look. Matthias and Nimrod - the Occam's razor bound isn't really limited to a finite hypothesis space. It just isn't very interesting for most of the hypotheses in an infinite hypotheses space. I like avoiding the '-' in the definition of entropy because it's easier to understand an equation where every term is positive. Matthias - the reference to N was just obsolete. It should have been log(m). Nimrod - I'm going with Matthias's statement of the main theorem because it seems a bit clearer to me. The note in the old draft about how to apply this to various machine learning algorithms is important. Reincluding. I don't like: "We note that there is a non-asymptotic version of theorem \ref{th-main} whose less accessible form, however, might obscure its practical meaning." because it assumes too much about the mind-state of the reader and seems sort of patronizing. I think it's important to state explicitly how Hoeffding's bound applies to the 'g'. I'm changing Q_N -> Q^N because it really is a cross product distribution. Is it: \sum_N \delta/(N(N+1)) = \delta or \sum_N \delta/(N(N+1)) \leq \delta Matthias - in general I like the changes you made to the proof. Referring to the original proof for getting the asymptotic bound is weak. We want the paper to be self-contained. Adding. Matthias - I'm removing your sections 4, 5, and 6 because no effort was made to work with the old section 5 and 6 which cover substantially the same material in a manner better for ICML. If you think something is missing, then I encourage you to add it in rather then attempting a wholesale replacement. You also cut the old section 4. I'm going to leave it out for the moment because I like the idea of placing it into a supporting tech report. I think including an example of the bound on a problem is more important. Nimrod - I'm reworking the Maximum entropy section a bit. See if that clears things up for you. With section 4 gone, we should seriously contemplate pushing the proof into the appendix. Added example. I think I will actually program it up - it might give us some pretty results which might help us get into ICML. Added a little discussion about the algorithm directly motivated by our algorithm. Getting rid of the dependence on the long mybib.bib file. It's not good to not be self contained, especially when attempting to work remotely.