#This file was created by Fri May 17 07:38:10 2002 #LyX 1.0 (C) 1995-1999 Matthias Ettrich and the LyX Team \lyxformat 2.15 \textclass amsbook \begin_preamble \usepackage{hyperref} \end_preamble \language default \inputencoding default \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 2 \paperpagestyle default \layout Title Quantitatively Tight Sample Complexity Bounds \layout Author John Langford \layout Standard I present many new results on sample complexity bounds (bounds on the future error rate of arbitrary learning algorithms). Of theoretical interest are qualitative and quantitative improvements in sample complexity bounds as well as some techniques and criteria for judging the tightness of sample complexity bounds. \layout Standard On the practical side, I show quantitative results (with true error rate bounds sometimes less than \begin_inset Formula \( 0.01 \) \end_inset ) for decision trees and neural networks with these sample complexity bounds applied to real world problems. I also present a technique for using both sample complexity bounds and (more traditional) holdout techniques. \layout Standard Together, the theoretical and practical results of this thesis provide a well-founded practical method for evaluating learning algorithm performance based upon both training and testing set performance. \layout Standard Code for calculating these bounds is provided. \layout Standard \begin_inset LatexCommand \tableofcontents{} \end_inset \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 297 235 file graph.eps width 4 100 flags 9 \end_inset \layout Part Introductory Learning Theory \layout Standard The introduction is broken into 4 parts. The first part is informal introduction to the learning model used here. The goal of the first part is to make explicit the design choices made in selecting our model. This is particularly important in machine learning because no model seems perfect. \layout Itemize The second chapter formally introduces the model, the goals of the analysis, and provides context to related work. \layout Itemize The third chapter states the fundamental statistical results upon which all of our techniques rest. \layout Itemize The fourth chapter discusses basic learning theory results. \layout Chapter Informal Introduction \layout Standard What is a sample complexity bound? Informally, it is a bound on the number of examples required to learn a function. Therefore, in order to motivate the use of a sample complexity bound, we must first motivate the learning problem. \layout Section The learning problem \layout Standard What is learning? Learning is the process of discovering relationships between events. Knowledge of the relationships between events is of great importance in coping with the environment. Naturally, humans tend to be quite good at the process of learning---so good that we sometimes do not even realize when it is hard. \layout Standard The learning problem which we focus on is learning for computers. In particular, \begin_inset Quotes eld \end_inset How can a computer learn? \begin_inset Quotes erd \end_inset A few examples are illustrative: \layout Enumerate How can we make a computer with a microphone output a text version of what is being spoken? \layout Enumerate How can we make a computer with a camera recognize John? \layout Enumerate How can we make a computer controlling a robot arrive at some location? \layout Standard We will work on learning a function from some input space to some output space - the supervised learning model. \layout Section The problem with the learning problem \layout Standard The learning problem, as stated, is somewhat ill-posed. There are some very obvious ways for a computer to learn---for example by memorization. The difficulty arises when memorization is too expensive. Expense here typically has to do with acquiring enough experience so that future prediction problems have already been encountered. The real learning problem becomes, \begin_inset Quotes eld \end_inset Given incomplete information, how can a computer learn? \begin_inset Quotes erd \end_inset This formulation of the learning problem gives rise to a new problem - quantifying the amount of information required to learn. Sample complexity bounds address this second question: \begin_inset Quotes eld \end_inset When can a computer learn? \begin_inset Quotes erd \end_inset \layout Standard In some cases learning is essentially hopeless. There are two notions of \begin_inset Quotes eld \end_inset hopeless \begin_inset Quotes erd \end_inset : information theoretic and computational. Information theoretic difficulties arise when it is simply not possible to predict the output given the input. For example, predicting when a radioactive nucleus will decay is \emph on always \emph default difficult no matter what observations are made according to current physics \begin_inset LatexCommand \cite{QM} \end_inset . Even when simple relations between inputs and outputs exist, the computation required to discover the simple relation can be formidable. A fine example of this is provided by cryptography \begin_inset LatexCommand \cite{Oded} \end_inset where a common task is to work out functions for which it is not feasible to predict the input given the output. \layout Section A plethora of learning models \layout Standard There are several possible learning models which can be divided along several axes. Our first axis is the type of information given to the learning algorithm. There are several possibilities: \layout Enumerate Labeled Examples: Vectors of observations. \layout Enumerate Partial relations: partial relations between events such as might be provided by experts. This could include constraints or partial functions. \layout Enumerate Other forms of input \layout Standard We will assume that just examples (vectors of observations) are available as a lowest common denominator amongst learning problems. It is worth noting that this does not preclude the use of other forms of information which could be much more powerful than mere examples. \layout Standard Another important axis is the difficulty of the learning problem. Do we have an opponent trying to minimize learning? Is someone helping us learn? Or is the world oblivious? \layout Enumerate Teacher: The teacher model is a \begin_inset Quotes eld \end_inset best case \begin_inset Quotes erd \end_inset model. Here, we assume that someone is providing the best examples possible in order to learn a relationship. \layout Enumerate Oblivious: The oblivious model is an \begin_inset Quotes eld \end_inset in between \begin_inset Quotes erd \end_inset model where we assume that the world doesn't oppose or help us learn. Examples are picked in some neutral manner. \layout Enumerate Opponent: The opponent model is a \begin_inset Quotes eld \end_inset worst case \begin_inset Quotes erd \end_inset model. Here, we assume that world is choosing examples in way which minimize our chance of learning. \layout Standard Clearly, the strongest form of learning is learning in the opponent model, because if something is learnable in the opponent model, then it is learnable in the oblivious model. The same relationship also holds for oblivious and teacher models. We will work in an oblivious model where examples are chosen in a neutral manner. Why the oblivious model? Aside from the intractability of analysis in an opponent model, we expect that most learning problems actually are oblivious: we have neither an active teacher nor an active opponent. Thus an analysis in the oblivious model will be directly applicable to many learning problems. \layout Standard We have committed to an oblivious model with examples as our source of informati on. With these two questions decided all the remaining questions will essentially be decided in favor of simplicity. There are two more very important questions to decide. The first is: does our algorithm get to pick the examples or are the examples picked for us? \layout Enumerate Active learning: The learning algorithm chooses a partial example and the remainder is filled in by nature. \layout Enumerate Passive learning: The learning algorithm is simply given examples. \layout Standard Active learning (aka experimental science) is inherently more powerful than passive learning. As an example, consider the problem of predicting whether or not it will rain or snow on any given day. By observation, we can eventually discover the \begin_inset Quotes eld \end_inset right \begin_inset Quotes erd \end_inset threshold temperature, but this might take many days. If we instead can control the temperature and make observations, it should be possible to narrow in on the threshold temperature very quickly - with exponentially fewer experiments than days of observation. \layout Standard Despite the power of active learning, we will choose to work with an inactive learning model, because opportunities for passive learning are typically more common than opportunities for active learning since passive learning only requires observation while active learning requires experimentation. Analyzing the active learning setting in a generic manner also appears very difficult. \layout Standard Our plan is to focus on an oblivious model with examples chosen by the world. The remaining question is: Do we know which relation we want to learn? The two possibilities are: \layout Enumerate Supervised learning: We want to learn to model an output in terms of an input. \layout Enumerate Unsupervised learning: We want to learn to model an arbitrary subset of observations in terms of other observations. \layout Standard We will focus on supervised learning and our exact setting will be defined next. \layout Standard The question we want to answer is, \begin_inset Quotes eld \end_inset When is supervised learning in an oblivious model with examples chosen by the world feasible? \begin_inset Quotes erd \end_inset . \layout Section The oblivious passive supervised learning model \layout Standard Oblivious will be modeled by an unknown distribution \begin_inset Formula \( D \) \end_inset over examples. Here, an \begin_inset Quotes eld \end_inset example \begin_inset Quotes erd \end_inset is just a vector of observations. Since this is a supervised learning model, all of our examples will split into two parts, \begin_inset Formula \( (x,y) \) \end_inset where \begin_inset Formula \( x \) \end_inset is the \begin_inset Quotes eld \end_inset input \begin_inset Quotes erd \end_inset and \begin_inset Formula \( y \) \end_inset is the \begin_inset Quotes eld \end_inset output \begin_inset Quotes erd \end_inset (the thing we wish to predict). A quick example is predicting whether precipitation will be in the form of rain or snow ( \begin_inset Quotes eld \end_inset \begin_inset Formula \( y \) \end_inset \begin_inset Quotes erd \end_inset value) given the temperature ( \begin_inset Quotes eld \end_inset \begin_inset Formula \( x \) \end_inset \begin_inset Quotes erd \end_inset value). \layout Standard For simplicity, we will typically work with theorems for binary valued \begin_inset Formula \( y \) \end_inset . We can remove this choice by generalizing sample complexity bounds---but we do not do so for simplicity of presentation. \layout Standard The fundamental assumption we will make in all of our sample complexity bounds is that all examples are drawn independently from the unknown distributi on \begin_inset Formula \( D \) \end_inset . This assumption must be stated explicitly and always kept in mind when considering the relevance of sample complexity bounds. \layout Axiom All examples are drawn independently from an unknown distribution \begin_inset Formula \( D \) \end_inset . \layout Standard With the exception of this assumption, all of the other parameters in our bounds will be verifiable at the time the bound is applied. \layout Standard Note that we use a distribution over \emph on labeled \emph default examples and not a combination of a distribution over the input space along with a function from the input space to the output space as in many other formulations. This choice is made because it is both more general and mathematically simpler. \layout Standard The number of samples, \begin_inset Formula \( m \) \end_inset , required for learning is the fundamental quantity we will be concerned with. In particular, we will not be concerned with the time complexity or the space complexity of learning algorithms. This choice is made for the purposes of simplicity and implies that the relationship between sample complexity bounds and learning algorithms will be similar to the difference between information theory and coding information for transmission across a noisy channel. \layout Standard Any learning algorithm must output some hypothesis, \begin_inset Formula \( h \) \end_inset , for predicting the output given the input. This hypothesis is essentially a program which, given the input, predicts the output. The hypothesis may or may not be randomized---it might choose an output deterministically or according to some randomization. \layout Standard The next item to quantify is learning. When has learning occurred? We will say that learning has occurred when the \emph on true error \emph default is significantly less than a uniform random prediction. The true error \begin_inset Formula \( e_{D}(h) \) \end_inset is defined in the following way: \begin_inset Formula \[ e_{D}(h)=\Pr _{D}(h(x)\neq y)\] \end_inset \layout Standard Unfortunately, the true error is not an observable quantity in our model because the distribution, \begin_inset Formula \( D \) \end_inset , is unknown. However, there is a related quantity which is observable. Given a sample set \begin_inset Formula \( S \) \end_inset of \begin_inset Formula \( (x,y) \) \end_inset pairs \begin_inset Formula \( \{(x_{1},y_{1}),...,(x_{m},y_{m})\} \) \end_inset , the \emph on empirical error \emph default , \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset is defined similarly as: \begin_inset Formula \[ \hat{e}_{S}(h)=\Pr _{S}(h(x)\neq y)=\frac{1}{m}\sum _{i=1}^{m}I(h(x_{i})\neq y_{i})\] \end_inset where \begin_inset Formula \( I() \) \end_inset is a function which maps \begin_inset Quotes eld \end_inset true \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 1 \) \end_inset and \begin_inset Quotes eld \end_inset false \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 0 \) \end_inset . Here \begin_inset Formula \( \Pr _{S}(...) \) \end_inset is a probability taken with respect to the uniform distribution over the set of examples, \begin_inset Formula \( S \) \end_inset . \layout Section Questions we can answer \layout Standard Our real goal in learning theory is to answer the question \begin_inset Quotes eld \end_inset When can we learn? \begin_inset Quotes erd \end_inset Unfortunately, there is no good answer to this question given only the assumption of independence. In particular, it may be impossible to learn. The simplest example of such a learning problem is the case of a distribution \begin_inset Formula \( D \) \end_inset which always flips a coin in deciding the value of the output \begin_inset Formula \( y \) \end_inset . The bias of the coin is the same irrespective of the input \begin_inset Formula \( x \) \end_inset . Since the value of the input is explicitly independent of the output we surely can not hope to learn a useful relation between the input and output. \layout Standard Considerable work has been done elsewhere to answer \begin_inset Quotes eld \end_inset When can we learn? \begin_inset Quotes erd \end_inset Typically this is done in models with stronger assumptions---for example, under the additional assumption that the output is related to the input by an \begin_inset Quotes eld \end_inset OR \begin_inset Quotes erd \end_inset of a subset of the input bits. \layout Standard We will instead focus on a different question: \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset This question \emph on is \emph default answerable in a probabilistic manner. In particular we can make a statement such as \begin_inset Quotes eld \end_inset With high probability over samples drawn from \begin_inset Formula \( D \) \end_inset we have learned if the empirical error is less than some value. \begin_inset Quotes erd \end_inset In practice, we will want to know how \emph on much \emph default we have learned which we can do by providing a high confidence bound on the true error rate of the learned hypothesis. \layout Chapter Formal Model and Context \layout Section Formal Model \layout Standard Let \begin_inset Formula \( X \) \end_inset be the space of the input to a predictor and \begin_inset Formula \( Y \) \end_inset be the space of the output. A labeled example \begin_inset Formula \( (x,y) \) \end_inset consists of an input, \begin_inset Formula \( x \) \end_inset and the desired output, \begin_inset Formula \( y \) \end_inset . Our formal model starts with the assumption that all labeled examples are drawn independently from a distribution \begin_inset Formula \( D \) \end_inset over the space \begin_inset Formula \( X\times Y \) \end_inset . This is strictly more general than the 'target concept' model which assumes that there exists some function \begin_inset Formula \( f:X\rightarrow Y \) \end_inset used to generate the label \begin_inset LatexCommand \cite{Valiant} \end_inset . In particular we can model probabilistic learning problems which do not have a particular \begin_inset Formula \( Y \) \end_inset value for each \begin_inset Formula \( X \) \end_inset value. This generalization is essentially \begin_inset Quotes eld \end_inset free \begin_inset Quotes erd \end_inset in the sense that it does not add to the complexity of presenting the results. \layout Standard The set of \begin_inset Formula \( m \) \end_inset independently drawn samples presented to a learning algorithm will be denoted as \begin_inset Formula \( S \) \end_inset . The learning algorithm will output a hypothesis \begin_inset Formula \( h:X\rightarrow Y \) \end_inset which has some unobservable true error rate \begin_inset Formula \( e_{D}(h) \) \end_inset and an observable empirical error rate, \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset . \layout Definition (True error) The true error \begin_inset Formula \( e_{D}(h) \) \end_inset of a hypothesis \begin_inset Formula \( h \) \end_inset is defined in the following way: \begin_inset Formula \[ e_{D}(h)=\Pr _{D}(h(x)\neq y)\] \end_inset \layout Standard Unfortunately, the true error is not an observable quantity in our model because the distribution, \begin_inset Formula \( D \) \end_inset , is unknown. However, there is a related quantity which is observable. \layout Definition (Empirical Error) Given a sample set \begin_inset Formula \( S \) \end_inset , the \emph on empirical error \emph default , \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset is defined as: \begin_inset Formula \[ \hat{e}_{S}(h)=\Pr _{S}(h(x)\neq y)=\frac{1}{m}\sum _{i=1}^{m}I(h(x_{i})\neq y_{i})\] \end_inset where \begin_inset Formula \( I() \) \end_inset is a function which maps \begin_inset Quotes eld \end_inset true \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 1 \) \end_inset and \begin_inset Quotes eld \end_inset false \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 0 \) \end_inset . Here \begin_inset Formula \( \Pr _{S}(...) \) \end_inset is a probability taken with respect to the uniform distribution over the set of examples, \begin_inset Formula \( S \) \end_inset . \layout Section Relationship to Prior Work \layout Subsection Distribution free learning \layout Standard The question answered here differs significantly from much prior learning theory, including the results of Vapnik \begin_inset LatexCommand \cite{Vapnik} \end_inset , Valiant \begin_inset LatexCommand \cite{Valiant} \end_inset , Devroye \begin_inset LatexCommand \cite{Devroye} \end_inset , and many others. See \begin_inset LatexCommand \cite{Haussler} \end_inset for a good summary. The principle difference is the question we address: \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset \layout Standard Much of the prior work in learning theory addresses the question: \begin_inset Quotes eld \end_inset How many examples are needed in order to guarantee that I will choose (nearly) the best hypothesis from some fixed hypothesis set? \begin_inset Quotes erd \end_inset \layout Standard In particular, suppose that we have a hypothesis set \begin_inset Formula \( H \) \end_inset . If we guarantee that: \layout Standard \begin_inset Formula \[ \Pr _{S\sim D^{m}}(\exists h\in H:\, \, |e_{D}(h)-\hat{e}_{S}(h)|>\epsilon )<\delta \] \end_inset then if our learning algorithm choses the hypothesis with minimum empirical error (known as \begin_inset Quotes eld \end_inset empirical risk minimization \begin_inset Quotes erd \end_inset in the language of \begin_inset LatexCommand \cite{V2} \end_inset ), we will be guaranteed that the chosen hypothesis satisfies: \begin_inset Formula \[ e_{D}(h)-e^{*}_{D}\leq 2\epsilon \] \end_inset where \begin_inset Formula \( e^{*}_{D}=\inf _{h\in H}e_{D}(h) \) \end_inset . \layout Standard There are several difficulties inherent in this approach which we will avoid. \layout Enumerate Results in this model apply only for the empirical risk minimization (ERM) algorithm. The ERM algorithm is known to be NP-complete \begin_inset LatexCommand \cite{AR} \end_inset for some hypothesis spaces and, in general, is essentially dependent upon the axiom of choice. These results will approximately apply to approximate ERM algorithms, but it is unclear how \begin_inset Quotes eld \end_inset approximate \begin_inset Quotes erd \end_inset typical learning algorithms are. By answering \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset this complexity is avoided. \layout Enumerate There is no natural notion of preference (or \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset ) amongst the hypotheses, \begin_inset Formula \( h \) \end_inset , in the hypothesis space, \begin_inset Formula \( H \) \end_inset . This is very important for practical application as is shown in Figure \begin_inset LatexCommand \ref{fig-simple-holdout} \end_inset . \layout Enumerate Answers to this question are generally insensitive to the final result, \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset . This is again important in practice (shown here \begin_inset LatexCommand \ref{fig-bounds} \end_inset ) because the variance of the distribution of the empirical error, \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset , changes significantly with different true error rates, \begin_inset Formula \( e_{D}(h) \) \end_inset . \layout Standard These drawbacks can be alleviated (but not removed) to some extent. For example, many people apply results in this model to arbitrary learning algorithms by simply noticing that the deviation of the empirical and true error rates is small. This, in turn, implies that whatever hypothesis your algorithm learns (empirica l minimum or not), it's true error is within \begin_inset Formula \( \epsilon \) \end_inset of the empirical error. This is one approach to addressing the question \begin_inset Quotes eld \end_inset have we learned? \begin_inset Quotes erd \end_inset - others will be presented here. \layout Standard The second drawback can be alleviated using \begin_inset Quotes eld \end_inset structural risk minimization \begin_inset Quotes erd \end_inset (as in \begin_inset LatexCommand \cite{V2} \end_inset ). Structural risk minimization removes most of drawback (2), although it is awkward for specifying very fine-grained preferences. We will use an arbitrary measure \begin_inset Formula \( P \) \end_inset over the hypothesis space \begin_inset Formula \( h \) \end_inset . This prior \begin_inset Formula \( P \) \end_inset need not (necessarily) be a Bayesian prior - all that is formally required is that this \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset be specified without using information from the examples. The notion of measure \begin_inset Formula \( P \) \end_inset is more general than structural risk minimization because for \emph on every \emph default \begin_inset Quotes eld \end_inset structure \begin_inset Quotes erd \end_inset \begin_inset Formula \( T \) \end_inset on which structural risk minimization is done, we can produce a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula \( P_{T} \) \end_inset dependent upon the structure \begin_inset Formula \( T \) \end_inset and derive the same (or tighter) results. \layout Standard The third drawback can be alleviated using \begin_inset Quotes eld \end_inset relative risk \begin_inset Quotes erd \end_inset as in \begin_inset LatexCommand \cite{V2} \end_inset , but this still leaves some slack in the bounds. \layout Standard The work in this thesis can be thought of as directly addressing the altered question which alleviates problem 1. Since our goal is addressing this altered question rather than deriving the answer from other results, we will be able to state and prove tighter results. Furthermore, because we address the question people encounter in practice, the bounds presented here will be more directly applicable. In particular, they will apply to arbitrary learning algorithms (although not necessarily \emph on tightly \emph default to arbitrary learning algorithms) rather than just the empirical risk minimizati on algorithm. \layout Subsection Bayesian Analysis \layout Standard The basic result of Bayesian analysis is that Bayes rule: \begin_inset Formula \[ \Pr (f|S)=\frac{\Pr (S|f)\Pr (f)}{\Pr (S)}\] \end_inset is the \emph on optimal \emph default learning algorithm when the learning problem is drawn from the distribution \begin_inset Formula \( \Pr (f) \) \end_inset . Note that the \begin_inset Quotes eld \end_inset hypothesis \begin_inset Quotes erd \end_inset learned by the Bayesian learning algorithm is the weighted average predictor, \begin_inset Formula \[ h(x)=\textrm{sign}(\int \Pr (f|S)f(x)df)\] \end_inset This rather strong statement is difficult to utilize in practice because specifying and using arbitrary distributions \begin_inset Formula \( \Pr (f) \) \end_inset is unwieldy. In practice, many people use approximations and there is considerable question about whether or not the specified \begin_inset Formula \( \Pr (f) \) \end_inset is \begin_inset Quotes eld \end_inset right enough \begin_inset Quotes erd \end_inset after approximations. A chapter is later devoted to analyzing hypotheses of this weighted-average form and a theorem \begin_inset LatexCommand \ref{th-averaging} \end_inset about their accuracy is proved. \layout Standard Some work has been done to analyze the robustness of Bayesian algorithms under approximation errors. There are two common traits of Bayesian-related analysis: \layout Enumerate All statements are parameterized by a prior, \begin_inset Formula \( P \) \end_inset . \layout Enumerate The analysis is typically an \begin_inset Quotes eld \end_inset average case \begin_inset Quotes erd \end_inset (w.r.t. the prior \begin_inset Formula \( P \) \end_inset and a fixed Bayesian or approximate Bayesian algorithm, \begin_inset Formula \( A \) \end_inset ). \layout Standard An example of this sort of analysis can be found in \begin_inset LatexCommand \cite{HKS} \end_inset . The work in this thesis adopts parameterization by a measure \begin_inset Formula \( P \) \end_inset , but is \emph on not \emph default an average case analysis. Our analysis is \begin_inset Quotes eld \end_inset worst-case \begin_inset Quotes erd \end_inset in the sense that it applies to \emph on all \emph default learning problems whether or not the measure \begin_inset Formula \( P \) \end_inset is a \begin_inset Quotes eld \end_inset correct \begin_inset Quotes erd \end_inset prior or not. This approach is strongly similar to the work of McAllester \begin_inset LatexCommand \cite{PB} \end_inset , and a later chapter is devoted to a refinement of this result. Despite this, it is interesting to note that our bounds are minimized (in some sense) when the measure \begin_inset Formula \( P \) \end_inset is in fact a \begin_inset Quotes eld \end_inset correct \begin_inset Quotes erd \end_inset Bayesian prior. \layout Standard There have been other attempts to connect Bayesian average case settings with worst case settings. One interesting example is \begin_inset LatexCommand \cite{Ypredicting} \end_inset which discusses the connection between Bayesian setting and the mistake bound model. This is especially interesting because the mistake bound model is, in some sense, more \begin_inset Quotes eld \end_inset worst-case \begin_inset Quotes erd \end_inset than the model we consider here as no assumption of example independence is made. Further work \begin_inset LatexCommand \cite{Grun} \end_inset has occurred in the \begin_inset Quotes eld \end_inset Minimum Description Length \begin_inset Quotes erd \end_inset (MDL) community. \layout Section Overview of the document \layout Standard This document is primarily about the theory of sample complexity for answering the question \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset . However, we do not neglect the experimental side. In particular, following the theory we will present results for application of sample complexity bounds to machine learning problems. These results are the 'best known results' in terms of bound tightness and should be considered as a guide and challenge to others working on sample complexity bounds. \layout Standard All of the sample complexity bounds presented here will fall within the paradigm of classical (non-Bayesian) statistics. Despite this, Bayesians may be interested in the results. In particular, it is worth noting that we will consider the use of a 'prior' and a 'posterior' (in a \emph on classical \emph default manner) within these bounds. \layout Standard In order to make this thesis more coherent, previous work (of which there is quite a bit) will be integrated into the presentation rather than separated into a section of its own. Credit will be given at the time the work is introduced. \layout Standard The document is organized into 3 parts: \layout Enumerate Introductory material. \layout Enumerate New results on Sample Complexity. \layout Enumerate Experimental results of applying Sample Complexity. \layout Standard What follows is a brief chapter-by-chapter summary of the theoretical results in this thesis. Much of the work can be summarized as approaches which use extra information (in the algorithm, in error rates of hypotheses which are \emph on not \emph default chosen, in test sets, etc...) in order to construct tighter bounds on the future error rate of a hypothesis. \layout Standard The principal \emph on practical \emph default result of this thesis is the construction of a program 'bound' which automatica lly uses any of several theorems in calculating a true error bound on the future error rate of a particular hypothesis. The use of this program will be demonstrated as the bounds are presented. \layout Subsection Microchoice Bounds \layout Standard The Microchoice technique allows for online construction of a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset which can be used in the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ). Empirical testing (presented in chapter 11) shows that this approach is practical and yields useful results on decision trees for real-world problems drawn from the UCI database of problems. The Adaptive Microchoice bounds further extend this approach and can result in functional improvements over the Occam's Razor bound. \layout Subsection Pac-Bayes Bounds \layout Standard Pac-Bayes bounds are a new approach (first presented by David McAllester \begin_inset LatexCommand \cite{PB} \end_inset ) for for dealing with continuously parameterized classifiers such as (stochasti c) neural networks. This chapter refines and improves the PAC-Bayes bound, giving it an information theoretic interpretation. Empirical results (presented in chapter 12) show that refined PAC-Bayes bound works to produce \emph on nonvacuous \emph default bounds on realistic learning problems. Nonvacuous bounds for continuous-valued classifiers are currently rare. \layout Subsection Averaging Bounds \layout Standard Averaging bounds deal with classifiers formed by picking according to the weighted majority on some other set of classifiers. Averaging is a very common technique in practice (see \begin_inset LatexCommand \cite{SFBL} \end_inset for examples), so results specialized for averaging classifiers are useful. This chapters states and proves a bound on averaging classifiers which shows that \begin_inset Quotes eld \end_inset hypothesis space complexity \begin_inset Quotes erd \end_inset \emph on decreases \emph default as averaging becomes more uniform. Prior theoretical work \begin_inset LatexCommand \cite{SFBL} \end_inset was of the form \begin_inset Quotes eld \end_inset averaging does not increase the hypothesis space complexity much \begin_inset Quotes erd \end_inset . \layout Subsection Shell Bounds \layout Standard Shell bounds are a new approach which trades extra information for tighter bounds. Empirical results in (presented in chapter 11) show this can be a useful approach. In order to ameliorate the increased information requirements, a sampled version of the bound is stated which allows for smooth interpolation between simpler bounds and the full shell bound. In addition, the shell bound has been extended to continuous spaces with an approach similar to PAC-Bayes bounds. \layout Subsection Bracketing Covering Number Bounds \layout Standard The results of this chapter are entirely theoretical. They point out an approach which may lead to useful technique for bounds on continuous valued hypothesis spaces. Using a stronger notion of cover (the bracketing cover), simplified bounds on the future error rate of continuous classifiers can be constructed. This approach can be significantly tighter than the standard covering number approach. More work in calculation of bracketing covers is needed before these results can be applied. \layout Subsection Progressive Validation \layout Standard Progressive Validation is a variant of the holdout bound ( \begin_inset LatexCommand \ref{th-hb} \end_inset ) which is roughly twice as efficient about how it uses examples. Progressive Validation is not used in the experimental results chapter(s) because more work is required in order to prove the bound without a Hoeffding-l ike approximation. \layout Subsection Combining training and test sets \layout Standard The idea behind this chapter is that it should be possible to combine \emph on any \emph default test set bound with \emph on any \emph default training set based bound in order to derive a bound with more robust behavior. A general technique is stated and proved to work. Empirical results (in chapter 11) show that this approach can work well in practice. \layout Chapter Basic Observations \layout Standard The principle observable quantity is the empirical error rate ( \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset ) of a hypothesis. What is the distribution of the empirical error rate for a fixed hypothesis? For each example, we know that the probability that the hypothesis will err is given by true error rate, \begin_inset Formula \( e_{D}(h) \) \end_inset . This can be modeled by a biased coin flip: heads if you are wrong and tails if you are right. \layout Standard Let us call the bias of the coin \begin_inset Formula \( p=e_{D}(h) \) \end_inset . What then is the probability of observing \begin_inset Formula \( k \) \end_inset heads out of \begin_inset Formula \( m \) \end_inset coin flips? This is a very familiar distribution in statistics called the Binomial distribution. Let \begin_inset Formula \( \hat{p} \) \end_inset be the observed rate of heads. \layout Definition (Binomial Distribution) The Binomial distribution is given by: \begin_inset Formula \[ \Pr _{\{0,1\}^{m}}\left( \hat{p}=\frac{k}{m}|p\right) =\binom{m}{k}p^{k}(1-p)^{m-k}\] \end_inset Here we use 'choose' notation defined by \begin_inset Formula \( \binom{m}{k}=\frac{m!}{(m-k)!k!} \) \end_inset . \layout Section The Basic Building Block \layout Standard Our real interest will be captured by Binomial tails because we wish to bound the probability of observing a misleadingly small event. The probability of a Binomial tail is just the cumulative distribution function: \layout Definition (Binomial Tail) \begin_inset Formula \[ \textrm{Bin}(m,k,p)\equiv \Pr _{\{0,1\}^{m}}\left( \hat{p}\leq \frac{k}{m}|p\right) =\sum _{j=1}^{k}\binom{m}{j}p^{j}(1-p)^{m-j}\] \end_inset = the probability that \begin_inset Formula \( m \) \end_inset coins with bias \begin_inset Formula \( p \) \end_inset produce \begin_inset Formula \( k \) \end_inset or fewer heads. \layout Standard For the learning problem, we will always choose \begin_inset Formula \( p=e_{D}(h) \) \end_inset and \begin_inset Formula \( \hat{p}=\hat{e}_{S}(h) \) \end_inset . With these definitions, we can interpret the Binomial tail as the probability of an empirical error less than or equal to \begin_inset Formula \( \frac{k}{m} \) \end_inset . \layout Section Approximation techniques \layout Standard \begin_inset LatexCommand \label{sec-approximation} \end_inset \layout Standard Exact calculation of \begin_inset Formula \( \textrm{Bin}(m,k,p) \) \end_inset (covered in the next subsection) can require computation at least proportional to \begin_inset Formula \( m \) \end_inset , which is often too expensive. For the bounds in this thesis, we will only need to calculate an upper bound on the quantity \begin_inset Formula \( \textrm{Bin}(m,k,p) \) \end_inset . There are several inequalities which are often used. The first of these is the Hoeffding inequality \begin_inset LatexCommand \cite{Hoeffding} \end_inset . Assume that \begin_inset Formula \( \frac{k}{m}

0 \) \end_inset , \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, e(h)\geq \hat{e}(h)+\sqrt{\frac{\ln |H|+\ln \frac{1}{\delta }}{2m}}\right) \leq \delta \] \end_inset \layout Proof Loosen corollary ( \begin_inset LatexCommand \ref{th-REDHSCP} \end_inset ) with the bound \begin_inset Formula \( \textrm{KL}(q||p)\geq 2(q-p)^{2} \) \end_inset . \layout Standard The agnostic form of this bound has the advantage that it explicitly shows that we are forcing the convergence (in high probability) of the empirical error rate to the true error rate. Graphically, we are forcing every empirical error to be near to its true error. This is a picture which represents the connection \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 89 12 file convergence.eps width 4 30 flags 9 \end_inset \layout Standard Later bounds will have more complicated convergence conditions with more complicated graphs. These more complicated bounds are necessary in order to avoid the limitations of the holdout bound. See figure \begin_inset LatexCommand \ref{fig-simple-holdout} \end_inset for a comparison with the holdout set. \layout Section Lower Bounds \layout Standard It is important to study lower bounds in order to discover how much room for improvement exists in our upper bounds. \layout Standard There are two forms of lower bounds: a lower bounds on the true error rate and lower \emph on upper \emph default bounds which lower bound the value our upper bound could return (given available information). For lower bounds essentially the upper bound theorem applies with the bound reversed. This form of lower bound allow us to construct a two-sided confidence interval on the location of the true error rate. Typically, such lower bounds can be formed by simply changing the sign in upper bound arguments. \layout Theorem \begin_inset LatexCommand \label{th-dhlb} \end_inset (Discrete Hypothesis Lower Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Proof For every individual hypothesis, we know that: \begin_inset Formula \[ \Pr _{D^{m}}(e(h)\leq \bar{e}(m,\hat{e}(h),\delta ))\leq \delta \] \end_inset Applying the union bound for every hypothesis gives us the result. \layout Section Lower Upper Bounds \layout Standard \begin_inset LatexCommand \label{sec-lower_upper} \end_inset \layout Standard The second form of lower bound is a lower bound on the upper bound (similar to the results of \begin_inset LatexCommand \cite{AHKV} \end_inset ). If we can lower bound the upper bound (as a function of its observables), then we can be confident that the upper bound is no looser than the gap between the lower upper bound and the upper bound. \layout Standard How tight is the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset )? The answer is \emph on sometimes \emph default tight. In particular, we can exhibit a set of learning problem where the discrete hypotheses bound can not be made significantly lower as a function of the observables, \begin_inset Formula \( m \) \end_inset , \begin_inset Formula \( \delta \) \end_inset , \begin_inset Formula \( |H| \) \end_inset , and \begin_inset Formula \( \hat{e}(h) \) \end_inset . Fix the value of these quantities and then we will construct a learning problem for which a lower upper bound can not be stated. \layout Standard Our learning problem will be defined over the input space \begin_inset Formula \( X=\{0,1\}^{|H|} \) \end_inset . The hypotheses will be \begin_inset Formula \( h_{i}(x)=x_{i} \) \end_inset where \begin_inset Formula \( x_{i} \) \end_inset is the \begin_inset Formula \( i \) \end_inset th component of the vector \begin_inset Formula \( x \) \end_inset . This construction allows us to vary the true error rate of each hypothesis independent of the other hypotheses. In fact, we can pick any true error rate for any hypothesis by simply adjusting the probability that \begin_inset Formula \( x_{i}=y \) \end_inset . Our learning problem can therefore generate problems according to the following algorithm: \layout Algorithm \begin_inset LatexCommand \label{alg-lp} \end_inset Draw_Sample(float \begin_inset Formula \( e \) \end_inset ) \layout Enumerate Pick \begin_inset Formula \( y\in \{0,1\} \) \end_inset uniformly \layout Enumerate For \begin_inset Formula \( i \) \end_inset , pick \begin_inset Formula \( x_{i}\neq y \) \end_inset with probability \begin_inset Formula \( e \) \end_inset . \layout Standard By construction, the true error rate of each hypothesis will be \begin_inset Formula \( e(h_{i}) \) \end_inset . Now, we can prove the following theorem: \layout Theorem \begin_inset LatexCommand \label{th-dhlub} \end_inset (Discrete Hypothesis lower upper bound) For all true error rates \begin_inset Formula \( e(h)>\frac{\ln |H|+\ln \frac{1}{\delta }}{m} \) \end_inset there exists a learning problem and algorithm such that: \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Standard Intuitively, this theorem implies that we can not improve significantly on the results of theorem \begin_inset LatexCommand \ref{th-DHSCP} \end_inset without using extra information about our learning problem. Some of our later results do exactly this - they use extra information. \layout Proof Using the family of learning problems implicitly defined by algorithm \begin_inset LatexCommand \ref{alg-lp} \end_inset , we know that \begin_inset Formula \[ \forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq \frac{\delta }{|H|}\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow \forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <1-\frac{\delta }{|H|}\] \end_inset (independence) \begin_inset Formula \[ \Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \forall h\, \, e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\] \end_inset (approximation) \begin_inset Formula \[ \Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\] \end_inset \layout Standard For small \begin_inset Formula \( \delta \) \end_inset , we get that \begin_inset Formula \( 1-e^{-\delta }\simeq \delta \) \end_inset which implies that, no significantly better true error bound can be stated for all learning algorithms. In particular, the empirical risk minimization algorithm (which chooses the hypothesis with minimal empirical error) will have a significant probabilit y of a large deviation between the empirical and true error. \layout Section Structural Risk Minimization \layout Standard Structural Risk Minimization \begin_inset LatexCommand \cite{Vapnik} \end_inset (SRM) is a technique used in the learning theory community to avoid the difficulties associated with convergence on hypothesis sets that are too \begin_inset Quotes eld \end_inset large \begin_inset Quotes erd \end_inset . SRM works with a sequence of nested hypothesis sets, \begin_inset Formula \( H_{1}\subset H_{2}\subset ....\subset H_{l} \) \end_inset . For each hypothesis set, a discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) on the difference between empirical and true error exists. For \begin_inset Quotes eld \end_inset small \begin_inset Quotes erd \end_inset hypothesis sets, this bound may be tight while for large hypothesis sets it may be inherently loose. However, we also expect that the best hypothesis in the hypothesis set improves as the hypothesis set becomes larger. This naturally induces a trade-off: there will be some hypothesis set \begin_inset Formula \( H_{i} \) \end_inset for which the true error bound is minimized. \layout Standard We can't simply apply the discrete hypothesis bound to the meta-algorithm which picks the algorithm (and associated hypothesis space) with the smallest true error bound since this meta-algorithm could, potentially, output any hypothesis in \begin_inset Formula \( H_{l} \) \end_inset . The simplest way to retrofit the bound to include all hypothesis sets is with a simple theorem which essentially states that we can guarantee \emph on nearly \emph default the same bound as would apply on the smallest hypothesis space \begin_inset Formula \( H_{i} \) \end_inset containing the output hypothesis, \begin_inset Formula \( h \) \end_inset . \layout Theorem \begin_inset LatexCommand \label{th-SRM} \end_inset (Structural Risk Minimization) Let \begin_inset Formula \( p(i) \) \end_inset be some measure across the \begin_inset Formula \( l \) \end_inset hypothesis sets with \begin_inset Formula \( \sum _{i=1}^{l}p(i)=1 \) \end_inset . Then: \begin_inset Formula \[ \forall p(i):\, \, \, \Pr _{D^{m}}\left( \exists h\in H_{i}\in \{H_{1},...,H_{l}\}:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{p(i)\delta }{|H_{i}|}\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset . \layout Proof Apply the union bound to the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ). \layout Standard The SRM bound is slightly inefficient in the sense that the bound for all hypotheses in \begin_inset Formula \( H_{2} \) \end_inset includes a bound for every hypothesis in \begin_inset Formula \( H_{1} \) \end_inset . This effect is typically small because the size of the hypothesis sets usually grows exponentially, implying that the extra confidence given to a hypothesis \begin_inset Formula \( h \) \end_inset in \begin_inset Formula \( H_{1} \) \end_inset by the bounds used on hypothesis set \begin_inset Formula \( H_{2},H_{3},... \) \end_inset is small relative to the confidence given by the bound for \begin_inset Formula \( H_{1} \) \end_inset . One can remove this slack in Structural Risk Minimization bound by \begin_inset Quotes eld \end_inset cutting out \begin_inset Quotes erd \end_inset the nested portion of each hypothesis set in the formulation of \begin_inset Formula \( H_{1},...,H_{l} \) \end_inset . We will call this Disjoint Structural Risk Minimization (also mentioned in \begin_inset LatexCommand \cite{MC_Journal} \end_inset ). \layout Section Incorporating a \begin_inset Quotes eld \end_inset Prior \begin_inset Quotes erd \end_inset \layout Standard In constructing the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ), it is notable that an arbitrary choice was made. We decided to give the same error allowance to every hypothesis. This is an arbitrary choice which, in practice, we will wish to make differentl y. The next theorem is essentially a restatement of the discrete hypothesis bound with this arbitrary choice made explicit. The same bound using the Hoeffding approximation has appeared elsewhere \begin_inset LatexCommand \cite{BEHW} \end_inset \begin_inset LatexCommand \cite{PB} \end_inset . \layout Theorem \begin_inset LatexCommand \label{th-ORB} \end_inset (Occam's Razor Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset over the hypothesis space, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \forall p(h)\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\delta p(h)\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Standard It is very important to notice that the \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset must be selected \emph on before \emph default seeing the examples. \layout Proof The proof again starts with the basic observation that: \begin_inset Formula \[ \Pr _{D^{m}}\left( e(h)\geq \bar{e}(m,\hat{e}(h),\delta )\right) \leq \delta \] \end_inset then, we apply the union bound in a \emph on nonuniform \emph default manner. In particular, we allocate confidence \begin_inset Formula \( \delta p(h) \) \end_inset to hypothesis \begin_inset Formula \( h \) \end_inset . Since \begin_inset Formula \( p \) \end_inset is normalized, we know that \begin_inset Formula \[ \sum _{h}\delta p(h)=\delta \] \end_inset which implies that the union bound completes the proof. \layout Standard Once again, we can relax the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) with the relative entropy Chernoff bound ( \begin_inset LatexCommand \ref{eq-recb} \end_inset ) to get a somewhat more tractable expression. \layout Corollary \begin_inset LatexCommand \label{th-reorb} \end_inset (Relative Entropy Occam's Razor Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset over the hypothesis space, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))\geq \frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\right) \leq \delta \] \end_inset \layout Proof approximate the Binomial tail with ( \begin_inset LatexCommand \ref{eq-recb} \end_inset ) and solve for the minimum. \layout Standard The Occam's razor bound is often nonvacuous for discrete learning algorithms such as decision lists and decision trees. The next chapter will discuss a particular motivated choice of the measure \begin_inset Formula \( p(h) \) \end_inset which can lead to much tighter bounds in practice. \layout Standard The application of the Occam's Razor bound is somewhat more complicated then the application of the test set bound. Pictorially, the protocol for bound application is given in figure \begin_inset LatexCommand \ref{fig-training-set} \end_inset . \begin_float fig \layout Standard \align center \begin_inset Figure size 267 159 file thesis-presentation/training_set.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-training-set} \end_inset In order to apply this bound it is necessary that the choice of \begin_inset Quotes eld \end_inset Prior \begin_inset Quotes erd \end_inset be made before seeing any training examples. Then, the bound is calculated based upon the chosen hypothesis. Note that it \emph on is \emph default \begin_inset Quotes eld \end_inset legal \begin_inset Quotes erd \end_inset to chose the hypothesis based upon the prior \begin_inset Formula \( p(h) \) \end_inset as well as the empirical error \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset . \end_float \layout Standard Examples of calculation of these bounds is detailed in appendix section \begin_inset LatexCommand \ref{sec-train-bound-calc} \end_inset . \layout Standard The \begin_inset Quotes eld \end_inset Occam's Razor bound \begin_inset Quotes erd \end_inset is strongly related to compression. In particular, for any self-terminating description language, \begin_inset Formula \( d(h) \) \end_inset , we can associate a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h)=2^{-d(h)} \) \end_inset with the property that \begin_inset Formula \( \sum _{h}p(h)\leq 1 \) \end_inset . Consequently, short description length hypotheses will tend to have a tighter convergence and the penalty term, \begin_inset Formula \( \ln \frac{1}{p(h)} \) \end_inset is the number of \begin_inset Quotes eld \end_inset nats \begin_inset Quotes erd \end_inset (bits base e). \layout Part New Techniques \layout Chapter Microchoice Bounds (the algebra of choices) \layout Standard \begin_inset LatexCommand \label{sec-MC} \end_inset \layout Standard The work in this chapter is joint with Avrim Blum and was first presented at ICML \begin_inset LatexCommand \cite{MC} \end_inset and then in the Machine Learning Journal \begin_inset LatexCommand \cite{MC_journal} \end_inset . The presentation here generalizes, unifies, and improves the earlier work. Microchoice bounds can be thought of as unifying \begin_inset Quotes eld \end_inset Self-bounding Learning Algorithms \begin_inset Quotes erd \end_inset \begin_inset LatexCommand \cite{SB} \end_inset and the Occam's razor bound, \begin_inset LatexCommand \cite{BEHW} \end_inset . \layout Standard The bounds of the previous chapter do not use much structure on the hypothesis space. Yet learning algorithms induce a natural structure. In particular, many learning algorithms work by an iterative process in which they take a sequence of steps each from a small set of choices (small in comparison to the overall hypothesis set size). Local optimization algorithms such as hill-climbing or simulated annealing, for example, work in this manner. Each step in a local optimization algorithm can be viewed as making a choice from a small set of possible steps to take. If we take into account the number of choices made and the size (and other properties) of each choice set, can we produce a tighter bound? This chapter introduce microchoice bounds which use this type of information to construct high confidence bounds on the future error rate. \layout Standard The microchoice bounds can be thought of in several ways. The simple microchoice bound (given in section \begin_inset LatexCommand \ref{sec:mc} \end_inset ) can be thought of as a well motivated application of the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ). The idea is to use the learning algorithm itself to define a description language for hypotheses, so that the description length of the hypothesis actually produced gives a bound on the estimation error. The adaptive microchoice theorem (in section \begin_inset LatexCommand \ref{sec:query} \end_inset ) can be thought of as a computationally tractable adaptation of Self-bounding Learning algorithms \begin_inset LatexCommand \cite{SB} \end_inset . Microchoice bounds tie together and show the relationship between these different sample complexity bounds. \layout Standard Microchoice bounds also provide insight into the nature of choices. In general, we know that choice is \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset for the purposes of creating a uniform bound on the true error rate. The microchoice bounds give a quantitative understanding of how much choice is \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset . In particular, the log of the choice space size is the natural measure of \begin_inset Quotes eld \end_inset badness \begin_inset Quotes erd \end_inset . This is directly related to the log of the hypothesis space size in the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ). There is also an indirect relationship with information theory where the log of the alphabet size is an important parameter for specifying the number of bits required to send a message. \layout Standard Viewed as an interactive proof of learning, microchoice bounds can be described pictorially as in figure \begin_inset LatexCommand \ref{fig-microchoice-protocol} \end_inset . \layout Standard \begin_float fig \layout Standard \align center \begin_inset Figure size 267 176 file thesis-presentation/microchoice.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-microchoice-protocol} \end_inset Microchoice bounds collapse the two-round Occam's Razor style protocol into a one round protocol. This is (essentially) done by providing a compiler for the verifier which takes a learning algorithm as input, and extracts a choice of \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset . \end_float \layout Standard Important early work developing approximately self-bounding learning algorithms was also done by Domingos \begin_inset LatexCommand \cite{Domingos} \end_inset . \layout Section A Motivating Observation \layout Standard \begin_inset LatexCommand \label{sec:motivating} \end_inset \layout Standard Imagine, for the moment, that we know the (unknown) problem distribution, \begin_inset Formula \( D \) \end_inset . For a given learning algorithm \begin_inset Formula \( A \) \end_inset , a distribution \begin_inset Formula \( D \) \end_inset on labeled examples induces a distribution \begin_inset Formula \( q(h) \) \end_inset over the possible hypotheses \begin_inset Formula \( h\in H \) \end_inset produced by algorithm \begin_inset Formula \( A \) \end_inset after \begin_inset Formula \( m \) \end_inset examples \begin_float footnote \layout Standard \begin_inset Formula \( q(h) \) \end_inset is the probability over draws of examples and any internal randomization of the algorithm. \end_float . A natural choice for the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ) is the measure \begin_inset Formula \( p(h)=q(h) \) \end_inset . Is this choice optimal? The answer is \begin_inset Quotes eld \end_inset yes \begin_inset Quotes erd \end_inset , given the right notion of optimal. In particular, if we start with the relative entropy Occam's razor bound ( \begin_inset LatexCommand \ref{th-reorb} \end_inset ), we can show that \begin_inset Formula \( p(h)=q(h) \) \end_inset minimizes the expected value of the bound on the Kullback-Leibler divergence between the empirical error and true error. \layout Theorem (KL divergence minimization) \begin_inset Formula \( p(h)=q(h) \) \end_inset minimizes the expected value of the KL divergence in the relative entropy Occam's Razor bound ( \begin_inset LatexCommand \ref{th-reorb} \end_inset ). \layout Proof We need to show that \begin_inset Formula \[ q(h)=\textrm{argmin}_{p(h)}\sum _{h}q(h)\frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\] \end_inset removing terms which the minimum does not depend on, we get: \begin_inset Formula \[ \textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{1}{p(h)}\] \end_inset adding a constant, we get: \begin_inset Formula \[ \textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{q(h)}{p(h)}\] \end_inset This is equivalent to minimizing the Kullback-Leibler divergence between the distribution of \begin_inset Formula \( q(h) \) \end_inset and \begin_inset Formula \( p(h) \) \end_inset which is minimized for \begin_inset Formula \( q(h)=p(h) \) \end_inset . \layout Standard Using the KL divergence as our notion of loss is somewhat non-intuitive. However, it is mathematically simple and not irrational. For all true errors, the KL divergence will be upper and lower bounded between an \begin_inset Formula \( l_{1} \) \end_inset and an \begin_inset Formula \( l_{2} \) \end_inset metric. Since these are two of the most common metrics, the choice of KL divergence based metric should behave similarly well. \layout Standard The point of these observations is to notice that if the structure of the learning algorithm produces a choice \begin_inset Formula \( p(h) \) \end_inset that approximates \begin_inset Formula \( q(h) \) \end_inset , the result should be better estimation bounds. \layout Section The Simple Microchoice Bound \layout Standard \begin_inset LatexCommand \label{sec:mc} \end_inset \layout Standard The simple microchoice bound is essentially a compelling and easy way to select a measure \begin_inset Formula \( p(h) \) \end_inset for learning algorithms that operate by making a series of small choices. In particular, consider a learning algorithm that works by making a sequence of choices, \begin_inset Formula \( c_{1},...,c_{d} \) \end_inset , from a sequence of choice sets, \begin_inset Formula \( C_{1},...,C_{d} \) \end_inset , finally producing a hypothesis, \begin_inset Formula \( h\in H \) \end_inset . Specifically, the algorithm first looks at the choice set \begin_inset Formula \( C_{1} \) \end_inset and the data \begin_inset Formula \( z^{N} \) \end_inset to produce choice \begin_inset Formula \( c_{1}\in C_{1} \) \end_inset . The choice \begin_inset Formula \( c_{1} \) \end_inset then determines the next choice set \begin_inset Formula \( C_{2} \) \end_inset (different initial choices produce different choice sets for the second level). The algorithm again looks at the data to make some choice \begin_inset Formula \( c_{2}\in C_{2} \) \end_inset . This choice then determines the next choice set \begin_inset Formula \( C_{3} \) \end_inset , and so on. These choice sets can be thought of as nodes in a \emph on choice tree \emph default , where each node in the tree corresponds to some internal state of the learning algorithm, and a node containing some choice set \begin_inset Formula \( C \) \end_inset has branching factor \begin_inset Formula \( |C| \) \end_inset . Pictorially, we can draw the tree as follows: \layout Standard \begin_inset Figure size 270 112 file thesis-presentation/microchoice_tree.eps width 4 91 flags 9 \end_inset \layout Standard Depending on the learning algorithm, sub-trees of the overall tree may be identical. We address optimization of the bound for this case later. Eventually there is a final choice leading to a leaf, and a single hypothesis is output. \layout Standard For example, the decision list algorithm of Rivest \begin_inset LatexCommand \cite{Rivest} \end_inset , applied to a set of \begin_inset Formula \( n \) \end_inset features, uses the data to choose one of \begin_inset Formula \( 4n+2 \) \end_inset rules (e.g., \begin_inset Quotes eld \end_inset if \begin_inset Formula \( \bar{x}_{3} \) \end_inset then \begin_inset Formula \( - \) \end_inset \begin_inset Quotes erd \end_inset ) to put at the top. Based on the choice made, it moves to a choice set of \begin_inset Formula \( 4n-2 \) \end_inset possible rules to put at the next level, then a choice set of size \begin_inset Formula \( 4n-6 \) \end_inset , and so on, until eventually it chooses a rule such as \begin_inset Quotes eld \end_inset else \begin_inset Formula \( + \) \end_inset \begin_inset Quotes erd \end_inset leading to a leaf. \layout Standard The microchoice bound calculation program is as follows: \layout Algorithm Calculate_Microchoice \layout Enumerate set \begin_inset Formula \( p\leftarrow 1 \) \end_inset \layout Enumerate while learning algorithm has not halted. \begin_deeper \layout Enumerate \begin_inset Formula \( |C|\leftarrow \) \end_inset number of possible data-dependent choices \layout Enumerate \begin_inset Formula \( p\leftarrow \frac{p}{|C|} \) \end_inset \end_deeper \layout Enumerate return \begin_inset Formula \( p \) \end_inset \layout Standard Pictorially, this algorithm can be thought of as taking a \begin_inset Quotes eld \end_inset supply \begin_inset Quotes erd \end_inset of probability at the root of the choice tree. \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 267 139 file thesis-presentation/microchoice_alg.eps width 4 90 flags 9 \end_inset \layout Standard The root takes its supply and splits it equally among all its children. Recursively, each child then does the same: it takes the supply it is given and splits it evenly among its children, until all of the supplied probability is allocated among the leaves. If we examine some leaf containing a hypothesis \begin_inset Formula \( h \) \end_inset , we see that this method gives at least probability \begin_inset Formula \( p(h)=\prod _{i=1}^{d(h)}\frac{1}{|C_{i}(h)|} \) \end_inset to each \begin_inset Formula \( h \) \end_inset for any path of depth \begin_inset Formula \( d(h) \) \end_inset reaching the hypothesis \begin_inset Formula \( h \) \end_inset . \layout Standard Note it is possible that several leaves will contain the same hypothesis \begin_inset Formula \( h \) \end_inset , and in that case one should really add the allocated measures together. However, the microchoice bound neglects this issue, implying that it will be unnecessarily loose for learning algorithms which can arrive at the same hypothesis in multiple ways. The reason for neglecting this is that now, \begin_inset Formula \( p(h) \) \end_inset is something the learning algorithm itself can calculate by simply keeping track of the sizes of the choice sets it has encountered so far. It is important to notice that this construction is defined before observing any data. Consequently, every hypothesis has some bound associated with it before the data is used to pick a particular hypothesis and its corresponding bound. \layout Standard Another way to view this process is that we cannot know in advance which choice sequence the algorithm will make. However, a distribution \begin_inset Formula \( D \) \end_inset on labeled examples induces a probability distribution over choice sequences, inducing a probability distribution \begin_inset Formula \( q(h) \) \end_inset over hypotheses. Ideally we would like to use \begin_inset Formula \( p(h)=q(h) \) \end_inset in our bounds as noted above. However, we cannot calculate \begin_inset Formula \( q(h) \) \end_inset (since the distribution \begin_inset Formula \( D \) \end_inset is unknown), so instead, our choice of \begin_inset Formula \( p(h) \) \end_inset will be just an estimate. We hope that the algorithm designer has chosen a \begin_inset Quotes eld \end_inset good \begin_inset Quotes erd \end_inset leaning algorithm which induces a distribution \begin_inset Formula \( p(h) \) \end_inset over the final hypotheses which is near to \begin_inset Formula \( q(h) \) \end_inset . Our estimate \begin_inset Formula \( p(h) \) \end_inset is the probability distribution resulting from picking each choice uniformly at random from the current choice set at each level (note: this is different from picking a final hypothesis uniformly at random). I.e., it can be viewed as the measure associated with the assumption that at each step, all choices are equally likely. \layout Standard We immediately find the following theorem: \layout Theorem \begin_inset LatexCommand \label{th-smb} \end_inset (Microchoice Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),\frac{\delta }{\prod _{i=1}^{d(h)}|C_{i}(h)|})\right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Standard \noindent \series bold Proof. \series default Specialization of the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ). \layout Standard Once again, it will be worthwhile to slightly loosen this bound with the following corollary: \layout Corollary \begin_inset LatexCommand \label{th-remb} \end_inset (Relative Entropy Microchoice Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |C_{i}(h)|+\ln \frac{1}{\delta }}{m}\right) \leq \delta \] \end_inset \layout Standard The point of the microchoice bound is that the quantity \begin_inset Formula \( \bar{e}(...) \) \end_inset is something the algorithm can calculate as it goes along, based on the sizes of the choice sets encountered. To see this, note that the hypothesis dependent term is \begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \) \end_inset . We can calculate \begin_inset Formula \( d(h) \) \end_inset by noting the number of choices made. The choice sets, \begin_inset Formula \( C_{i}(h) \) \end_inset , can often be easily deduced by reasoning about the possible microchoices the algorithm could have made given different datasets. \layout Standard In many natural cases, a \begin_inset Quotes eld \end_inset fortuitous distribution and target concept \begin_inset Quotes erd \end_inset corresponds to a shallow leaf or a part of the tree with low branching, resulting in a better bound. For instance, in the decision list case, \begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \) \end_inset is roughly \begin_inset Formula \( d\ln n \) \end_inset where \begin_inset Formula \( d \) \end_inset is the length of the list produced and \begin_inset Formula \( n \) \end_inset is the number of features. Notice that \begin_inset Formula \( d\ln n \) \end_inset is also the description length of the final hypothesis produced in the natural encoding, thus in this case these theorems yield similar bounds to a simple application of Occam's razor or SRM. \layout Standard More generally, the microchoice bound is similar to Occam's razor or SRM bounds when each \begin_inset Formula \( k \) \end_inset -ary choice in the tree corresponds to \begin_inset Formula \( \log k \) \end_inset bits in the natural encoding of the final hypothesis \begin_inset Formula \( h \) \end_inset . However, sometimes this may not be the case. Consider, for instance, a local optimization algorithm in which there are \begin_inset Formula \( n \) \end_inset parameters and each step adds or subtracts 1 from one of the parameters. Suppose in addition the algorithm knows certain constraints that these parameters must satisfy (perhaps a set of linear inequalities) and the algorithm restricts itself to choices in the legal region. In this case, the branching factor, at most \begin_inset Formula \( 2n \) \end_inset , might become much smaller if we are \begin_inset Quotes eld \end_inset lucky \begin_inset Quotes erd \end_inset and head toward a highly constrained portion of the solution space. One could always reverse-engineer an encoding of hypotheses based on the choice tree, but the microchoice approach is much more natural. \layout Standard There is also an opportunity to use \emph on a \protected_separator priori \emph default knowledge in the choice of \begin_inset Formula \( p(h) \) \end_inset . In particular, instead of splitting our confidence equally at each node of the tree, we could split it unevenly, according to some heuristic function \begin_inset Formula \( g \) \end_inset . If \begin_inset Formula \( g \) \end_inset is \begin_inset Quotes eld \end_inset good \begin_inset Quotes erd \end_inset it may produce error bounds similar to the bounds when \begin_inset Formula \( p(h)=q(h) \) \end_inset . In fact, the method of section ( \begin_inset LatexCommand \ref{sec:query} \end_inset ) where we combine these results with Freund's query-tree \begin_inset LatexCommand \cite{SB} \end_inset approach can be thought of as an attempt to do exactly this. \layout Subsection Examples \layout Standard It is difficult to create a bound which is universally better than previous bounds. The microchoice bound can be much better than the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) and can be slightly worse. To develop some understanding of how they compare we consider several cases. \layout Subsubsection Greedy Set Cover \layout Standard Consider a greedy set cover algorithm for learning an OR function over \begin_inset Formula \( F \) \end_inset Boolean features. The algorithm begins with a choice space of size \begin_inset Formula \( F+1 \) \end_inset (one per feature or halt) and chooses the feature that covers the most positive examples while covering no negative ones. It then moves to a choice space of size \begin_inset Formula \( F \) \end_inset (one per feature remaining or halt) and chooses the best remaining feature and so on until it halts. If the number of features chosen is \begin_inset Formula \( k \) \end_inset then the microchoice bound is: \layout Standard \begin_inset Formula \[ \epsilon (h)=\frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{i=1}^{k}\ln (F-i+2)\right) \leq \frac{1}{m}\left( \ln \frac{1}{\delta }+k\ln (F+1)\right) \] \end_inset \layout Standard The bound of ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) is: \layout Standard \begin_inset Formula \[ \epsilon =\frac{1}{m}\left( \ln \frac{1}{\delta }+F\ln 2\right) .\] \end_inset \layout Standard If \begin_inset Formula \( k \) \end_inset is small, then the microchoice bound is a lot better, but if \begin_inset Formula \( k=O(F) \) \end_inset then the microchoice bound is slightly worse than the discrete hypothesis bound. Notice that in this case the microchoice bound is essentially the same as the standard Occam's razor analysis when one uses \begin_inset Formula \( O(\ln F) \) \end_inset bits per feature to describe the hypothesis. \layout Subsubsection Decision Trees \layout Standard Decision trees over discrete sets (say, \begin_inset Formula \( \{0,1\}^{F} \) \end_inset ) are another natural setting for application of the microchoice bound. \layout Standard A decision tree differs from a decision list in that the size of the available choice set is larger due to the fact that there are multiple nodes where a new test may be applied. In particular, for a decision tree with \begin_inset Formula \( K \) \end_inset leaves at an average depth of \begin_inset Formula \( d \) \end_inset , the choice set size is \begin_inset Formula \( K(F-d) \) \end_inset , giving a bound noticeably worse than the bound for the decision list. This motivates a slightly different decision algorithm which considers only one leaf node at a time. The algorithm adds a new test or decides to never add a new test at this node. In this case, there are \begin_inset Formula \( (F-d(v)+1) \) \end_inset choices for a node \begin_inset Formula \( v \) \end_inset at depth \begin_inset Formula \( d(v) \) \end_inset , implying the bound: \begin_inset Formula \begin{equation} \label{eqn:dectreemc} \textrm{KL}(\hat{e}(h)||e(h))\leq \frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{v}\ln (F-d(v)+1)\right) \end{equation} \end_inset where \begin_inset Formula \( v \) \end_inset ranges over the nodes of the decision tree. Once again, this is very similar to what might be produced by an Occam's Razor Bound with an appropriate choice of prior. This result is again sometimes much better than the Discrete Hypothesis bound and sometimes slightly worse. \layout Subsection Pruning \layout Standard Decision tree algorithms for real-world learning problems often have some form of \begin_inset Quotes eld \end_inset pruning \begin_inset Quotes erd \end_inset as in \begin_inset LatexCommand \cite{Quinlan} \end_inset and \begin_inset LatexCommand \cite{Mingers} \end_inset . The tree is first grown to full size producing a hypothesis with minimum empirical error. Then the tree is \begin_inset Quotes eld \end_inset pruned \begin_inset Quotes erd \end_inset starting at the leaves and progressing up through the tree toward the root node using some test for the significance of an internal node. An internal node is not significant if the reduction in total error is small in comparison to the complexity of its children. Insignificant internal nodes are replaced with a leaf resulting in a smaller tree. \layout Standard Microchoice bounds have the property that they incidentally prove a bound for every decision tree which can be found by pruning internal nodes. In particular, one of the choices available when constructing a node is to make the node a leaf. Therefore, if we begin with the tree \begin_inset Formula \( T \) \end_inset and then prune to the smaller tree \begin_inset Formula \( T' \) \end_inset , we can apply the bound ( \begin_inset LatexCommand \ref{eqn:dectreemc} \end_inset ) to \begin_inset Formula \( T' \) \end_inset \emph on as if \emph default the algorithm had constructed \begin_inset Formula \( T' \) \end_inset directly rather than having gone first through the tree \begin_inset Formula \( T \) \end_inset . This suggests another possible pruning criterion: prune a node if the pruning would result in an improved microchoice bound. That is, prune if the increase in empirical error is less than the decrease in \begin_inset Formula \( \epsilon (h) \) \end_inset . This pruning criteria is a \begin_inset Quotes eld \end_inset pessimistic criteria \begin_inset Quotes erd \end_inset \begin_inset LatexCommand \cite{Yishay} \end_inset . \layout Standard The similarities to SRM are discussed next. \layout Subsection Microchoice and Structural Risk Minimization \layout Standard The microchoice bound is essentially a compelling application of the Disjoint SRM bound \begin_inset LatexCommand \ref{th-SRM} \end_inset where the description language for a hypothesis is the sequence of data-depende nt choices which the algorithm makes in the process of deciding upon the hypothesis. The hypothesis set \begin_inset Formula \( H_{i} \) \end_inset is all hypotheses with the same description length in this language. \layout Standard As an example, consider a binary decision tree with \begin_inset Formula \( F \) \end_inset Boolean features and a Boolean label. The first hypothesis set, \begin_inset Formula \( H_{1} \) \end_inset will consist of \begin_inset Formula \( 2 \) \end_inset hypotheses; always false and always true. In general, we will have one hypothesis set for every legal configuration of internal nodes. The size of a hypothesis set where every tree contains \begin_inset Formula \( k \) \end_inset internal nodes will be \begin_inset Formula \( 2^{k+1} \) \end_inset because there are \begin_inset Formula \( k+1 \) \end_inset leaves each of which can take \begin_inset Formula \( 2 \) \end_inset values. The weighting \begin_inset Formula \( p(i) \) \end_inset across the different hypothesis sets is defined by the microchoice allocation of confidence. \layout Standard The principle disadvantage of the microchoice bound is that the sequence of data-dependent choices may contain redundancy. A different SRM bound with a different set of disjoint hypothesis sets might be able to better avoid redundancy. As an example, assume that we are working with a decision tree on \begin_inset Formula \( F \) \end_inset binary features. There are \begin_inset Formula \( F+2 \) \end_inset choices (any of \begin_inset Formula \( F \) \end_inset features or \begin_inset Formula \( 2 \) \end_inset labels) at the top node. At the next node down there will be \begin_inset Formula \( F+1 \) \end_inset choices in both the left and right children. Repeat until a maximal decision tree is constructed. There will be \begin_inset Formula \( \prod _{i=0}^{F}(F-i+2)^{2^{i}} \) \end_inset possible trees. This number is somewhat larger than the number of Boolean functions on \begin_inset Formula \( F \) \end_inset features: \begin_inset Formula \( 2^{2^{F}} \) \end_inset . \layout Section Combining Microchoice with Freund's Query Tree approach \layout Standard \begin_inset LatexCommand \label{sec:query} \end_inset \layout Standard The next section is devoted to an improvement of the microchoice bound called adaptive microchoice, which arises from synthesizing Freund's query trees \begin_inset LatexCommand \cite{SB} \end_inset with the microchoice bound. This improvement is not easily expressed as a simplification of Structural Risk Minimization. In essence, the adaptive microchoice bound can gain from dependence on the learning problem distribution \begin_inset Formula \( D \) \end_inset and can take advantage of an \begin_inset Quotes eld \end_inset easy \begin_inset Quotes erd \end_inset distribution. \layout Standard First we require some background material in order to state and understand Freund's bound. \layout Subsection Preliminaries and Definitions \layout Standard The statistical query framework introduced by Kearns \begin_inset LatexCommand \cite{SQ} \end_inset restricts learning algorithm to only access the data using statistical queries. A statistical query takes as input a binary predicate, \begin_inset Formula \( \chi \) \end_inset , mapping examples to a binary output: \begin_inset Formula \( (X,Y)\rightarrow \{0,1\} \) \end_inset . The output of the statistical query is the average of \begin_inset Formula \( \chi \) \end_inset over the examples seen. Let \begin_inset Formula \( z_{i} \) \end_inset be the \begin_inset Formula \( i \) \end_inset th labeled example, then: \begin_inset Formula \[ \hat{D}_{\chi }=\frac{1}{m}\sum _{i=1}^{m}\chi (z_{i})\] \end_inset \layout Standard The output is an empirical estimate of the true value \begin_inset Formula \( D_{\chi }={\textbf {E}}_{D}[\chi (z)]=\Pr _{z\sim D}(\chi (z)=1) \) \end_inset of the query under the distribution \begin_inset Formula \( D \) \end_inset \begin_float footnote \layout Standard In the real SQ model there is no set of examples. The algorithm asks a query \begin_inset Formula \( \chi \) \end_inset and is given a response \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset that is guaranteed to be near to the true value \begin_inset Formula \( D_{\chi } \) \end_inset . That is, the true SQ model is an abstraction of the scenario described here where \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset is computed from an observed sample. \end_float . One simple example of a predicate \begin_inset Formula \( \chi \) \end_inset is \begin_inset Quotes eld \end_inset the first bit is \begin_inset Formula \( 1 \) \end_inset \begin_inset Quotes erd \end_inset . A more complicated predicate might be \begin_inset Quotes eld \end_inset the third bit xor the 4th bit is \begin_inset Formula \( 0 \) \end_inset \begin_inset Quotes erd \end_inset . Naturally, the distribution of \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset will be the familiar Binomial distribution. \layout Standard It is convenient to define \begin_inset Formula \[ \bar{I}_{D_{\chi }}(\delta )=\frac{1}{m}\max _{k}\left\{ k:\, \, 1-\textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \] \end_inset and \begin_inset Formula \[ \underline{I}_{D_{\chi }}(\delta )=\frac{1}{m}\min _{k}\left\{ k:\, \, \textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \] \end_inset and let \begin_inset Formula \[ I_{\chi }(\delta )=\left[ \underline{I}_{D_{\chi }}(\delta ),\bar{I}_{D_{\chi }}(\delta )\right] \] \end_inset Intuitively, \begin_inset Formula \( I_{\chi }(\delta ) \) \end_inset is a (fixed) interval in which the random variable \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset will fall with high probability. In other words, we know that: \begin_inset Formula \[ \Pr \left[ \hat{D}_{\chi }\not \in I_{\chi }(\delta )\right] \leq \delta \] \end_inset Now, we want to construct a confidence interval based upon the high confidence interval \begin_inset Formula \( I_{\chi }(\delta ) \) \end_inset . We can do this using the inversion lemma ( \begin_inset LatexCommand \ref{lem-inversion} \end_inset ) to get: \begin_inset Formula \[ \bar{D}_{\chi }(\delta )=\max _{p}\left\{ p:\underline{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \] \end_inset and \begin_inset Formula \[ \underline{D}_{\chi }(\delta )=\min _{p}\left\{ p:\bar{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \] \end_inset \layout Standard The random interval defined here contains the \begin_inset Quotes eld \end_inset real \begin_inset Quotes erd \end_inset answer \begin_inset Formula \( D_{\chi } \) \end_inset with high probability. In other words, we have: \begin_inset Formula \[ \Pr \left[ D_{\chi }\not \in [\underline{D}_{\chi }(\delta ),\bar{D}_{\chi }(\delta )]\right] \leq \delta \] \end_inset \layout Subsection Background and Summary \layout Standard Freund \begin_inset LatexCommand \cite{SB} \end_inset considers choice algorithms that at each step perform a Statistical Query on the sample, using the result to determine which choice to take. For an algorithm \begin_inset Formula \( A \) \end_inset , tolerance \begin_inset Formula \( \alpha \) \end_inset (defined next), and distribution \begin_inset Formula \( D \) \end_inset , Freund defines the query tree \begin_inset Formula \( T_{A}(D,\alpha ) \) \end_inset as the choice tree created by considering only those choices resulting from answers \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset to queries \begin_inset Formula \( \chi \) \end_inset such that \begin_inset Formula \( \left| \hat{D}_{\chi }-D_{\chi }\right| \leq \alpha \) \end_inset . The idea is that if a particular predicate, \begin_inset Formula \( \chi \) \end_inset , is true with probability \begin_inset Formula \( .9 \) \end_inset (for example) on a random sample it is very unlikely that the empirical result of the query will be \begin_inset Formula \( .1 \) \end_inset . More generally, the chance the answer to a given query is off by more than \begin_inset Formula \( \alpha \) \end_inset is at most \begin_inset Formula \( 2e^{-2m\alpha ^{2}} \) \end_inset by Hoeffding's inequality. So, if the entire tree contains a total of \begin_inset Formula \( |Q(T_{A}(D,\alpha ))| \) \end_inset queries in it, the probability \emph on any \emph default of these queries is off by more than \begin_inset Formula \( \alpha \) \end_inset is at most \begin_inset Formula \( 2\cdot |Q(T_{A}(D,\alpha ))|\cdot e^{-2m\alpha ^{2}} \) \end_inset . In other words, this is an upper bound on the probability the algorithm ever \begin_inset Quotes eld \end_inset falls off the tree \begin_inset Quotes erd \end_inset and makes a low probability choice. The point of this is that we can allocate half (say) of the confidence parameter \begin_inset Formula \( \delta \) \end_inset to the event that the algorithm ever falls off the tree, and then spread the remaining half evenly on the hypotheses in the tree (which hopefully is a much smaller set than the entire hypothesis set). \layout Standard Unfortunately, the query tree suffers from the same problem as the \begin_inset Formula \( q(h) \) \end_inset distribution considered in section ( \begin_inset LatexCommand \ref{sec:motivating} \end_inset ), namely that to compute it, one needs to know \begin_inset Formula \( D \) \end_inset . So, Freund proposes an algorithmic method to find a super-set approximation of the tree. The idea is that by analyzing the results of queries, it is possible to determine which outcomes were unlikely given that the query is close to the desired outcome. In particular, each time a query \begin_inset Formula \( \chi \) \end_inset is asked and a response \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset is received, if it is true that \begin_inset Formula \( |\hat{D}_{\chi }-D_{\chi }|\leq \alpha \) \end_inset , then the range \begin_inset Formula \( \left[ \hat{D}_{\chi }-2\alpha ,\hat{D}_{\chi }+2\alpha \right] \) \end_inset contains the range \begin_inset Formula \( \left[ D_{\chi }-\alpha ,D_{\chi }+\alpha \right] \) \end_inset . Thus, under the assumption that no query in the \emph on correct \emph default tree is answered badly, a super-set of the correct tree can be produced by exploring all choices resulting from responses within \begin_inset Formula \( 2\alpha \) \end_inset of the response actually received. By applying this method to every node in the query tree we can generate an empirically observable super-set of the query tree: that is, the original query tree is a pruning of the empirically constructed tree. \layout Standard A drawback of this method is that it can easily take exponential time to produce the approximate tree, because even the smaller correct tree can have a size exponential in the running time of the learning algorithm. Instead, we would much rather simply keep track of the choices actually made and the sizes of the nodes actually followed, which is what the microchoic e approach allows us to do. As a secondary point, given \begin_inset Formula \( \delta \) \end_inset , computing a good value of \begin_inset Formula \( \alpha \) \end_inset for Freund's approach is not trivial, see \begin_inset LatexCommand \cite{SB} \end_inset ; we will be able to finesse that issue and use the tighter bound \begin_inset Formula \( \hat{D}_{\chi }\in I_{\chi }(\delta ) \) \end_inset . \layout Standard In order to apply the microchoice approach, we modify Freund's query tree so that different nodes in the tree receive different confidence, \begin_inset Formula \( \delta \) \end_inset , much in the same way that different hypotheses \begin_inset Formula \( h \) \end_inset in our choice tree receive different values of \begin_inset Formula \( \delta (h) \) \end_inset . \layout Subsection Microchoice Bounds for Query Trees \layout Standard The manipulations of the choice tree are now reasonably straightforward. We begin by describing the \emph on true \emph default microchoice query tree and then give the algorithmic approximation. As with the choice tree in section ( \begin_inset LatexCommand \ref{sec:mc} \end_inset ), one should think of each node in the tree as representing the current internal state of the algorithm. \layout Standard We incorporate Freund's approach into the choice tree construction by having each internal node allocate a portion, \begin_inset Formula \( p' \) \end_inset of its \begin_inset Quotes eld \end_inset supply \begin_inset Quotes erd \end_inset of failure probability to the event that \begin_inset Formula \( \hat{D}_{\chi }\not \in I_{\chi }(\delta *p') \) \end_inset . The node then splits the remainder of its supply evenly among the children corresponding to choices that result from answers \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset with \begin_inset Formula \( \hat{D}_{\chi }\in I_{\chi }(\delta *p') \) \end_inset . Choices that would result from \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset answers to the query are \emph on pruned away \emph default from the tree and get nothing. This continues down the tree to the leaves. Pictorially, this looks like: \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 267 116 file thesis-presentation/amicro_tree.eps width 4 90 flags 9 \end_inset \layout Standard How should \begin_inset Formula \( p' \) \end_inset be chosen? Smaller values of \begin_inset Formula \( p' \) \end_inset result in larger intervals \begin_inset Formula \( I_{\chi }(\delta *p') \) \end_inset leading to more children in the pruned tree and less confidence given to each. Larger values of \begin_inset Formula \( p' \) \end_inset result in less left over to divide among the children. Unfortunately, our algorithmic approximation (which only sees the empirical answers \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset and needs to be efficient) will not be able to make this optimization. Therefore, we define \begin_inset Formula \( p' \) \end_inset in the \emph on true \emph default microchoice query tree to be \begin_inset Formula \( \frac{p}{d+1} \) \end_inset where \begin_inset Formula \( d \) \end_inset is the depth of the current node. This choice will imply that the adaptive microchoice bound is never much worse than the Microchoice bound, and sometimes much better. \layout Standard \added_space_bottom 0.3cm Since a particular query value \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset implies a particular choice \begin_inset Formula \( c \) \end_inset , we can think of the interval \begin_inset Formula \( I_{\chi }(\delta ) \) \end_inset as containing \emph on choices \emph default rather than query results. After all, we only care about the choices the algorithm makes. We can calculate the probability assigned to a hypothesis in the \emph on true \emph default adaptive microchoice query tree according to the following algorithm: \layout Algorithm True_Adaptive_Microchoice( \begin_inset Formula \( \delta \) \end_inset ) \layout Enumerate set \begin_inset Formula \( p\leftarrow 1 \) \end_inset \layout Enumerate set \begin_inset Formula \( d=1 \) \end_inset \layout Enumerate while learning algorithm has not halted. \begin_deeper \layout Enumerate \begin_inset Formula \( d\leftarrow d+1 \) \end_inset \layout Enumerate \begin_inset Formula \( p'\leftarrow \frac{p}{d} \) \end_inset \layout Enumerate Let \begin_inset Formula \( C \) \end_inset = the current set of possible data-dependent choices. \layout Enumerate \begin_inset Formula \( |\hat{C}|\leftarrow |\{c\in C|c\in I_{\chi }(\delta *p')\}| \) \end_inset \layout Enumerate \begin_inset Formula \( p\leftarrow \frac{p-p'}{|\hat{C}|} \) \end_inset \end_deeper \layout Enumerate return \begin_inset Formula \( p \) \end_inset \layout Standard There are two important things to note about this algorithm. First of all, we could plug the value \begin_inset Formula \( p(h) \) \end_inset it returns into the Occam's Razor Bound \begin_inset LatexCommand \ref{th-ORB} \end_inset and receive a bound on the true error rate of our chosen classifier. \layout Standard Second, this algorithm can not be executed. The essential problem is determining whether or not \begin_inset Formula \( c\in I_{\chi }(\delta *p') \) \end_inset which cannot be done without knowledge of the underlying distribution \begin_inset Formula \( D \) \end_inset . However, we \emph on can \emph default calculate an approximate version of this algorithm which, with high probability, returns a value \begin_inset Formula \( p \) \end_inset which is smaller. Since a smaller value is pessimistic, we can use it in our bounds. \layout Standard The algorithmic approximation uses the idea in \begin_inset LatexCommand \cite{SB} \end_inset of including all choices within the \emph on double \emph default confidence interval of the observed value \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset . Unlike \begin_inset LatexCommand \cite{SB} \end_inset , however, we do not actually create the tree; instead we just follow the path taken by the learning algorithm, and argue that the \begin_inset Quotes eld \end_inset supply \begin_inset Quotes erd \end_inset probability remaining at the leaf is no greater than the amount that would have been there in the original tree. Finally, the algorithm outputs a bound calculated with \begin_inset Formula \( p(h) \) \end_inset . \layout Standard Specifically, the algorithm is as follows. Suppose we are at a node of the tree containing statistical query \begin_inset Formula \( \chi \) \end_inset at depth \begin_inset Formula \( d(\chi ) \) \end_inset and we have a \begin_inset Formula \( p \) \end_inset supply of parameter. (If the current node is the root, then \begin_inset Formula \( p=1 \) \end_inset and \begin_inset Formula \( d(\chi )=1 \) \end_inset ). We choose \begin_inset Formula \( p'=p/(d+1) \) \end_inset , ask the query \begin_inset Formula \( \chi \) \end_inset , and receive \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset . Let \begin_inset Formula \[ \bar{\bar{D}}_{\chi }(\delta )=\min _{k}\left\{ \frac{k}{m}:\, \, 1-\textrm{Bin}\left( m,k,\bar{D}_{\chi }(\delta )\right) \leq \frac{\delta }{2}\right\} \] \end_inset and \begin_inset Formula \[ \underline{\underline{D}}_{\chi }(\delta )=\max _{k}\left\{ \frac{k}{m}:\, \, \textrm{Bin}\left( m,k,\bar{D}_{\chi }(\delta )\right) \leq \frac{\delta }{2}\right\} \] \end_inset with \begin_inset Formula \[ \hat{I}_{\chi }(\delta )=\left[ \underline{\underline{D}}_{\chi }(\delta ),\bar{\bar{D}}_{\chi }(\delta )\right] \] \end_inset \layout Standard We now let \begin_inset Formula \( k \) \end_inset be the number of children of our node corresponding to answers in the range \begin_inset Formula \( \hat{I}_{\chi }(p') \) \end_inset . We then go to the child corresponding to the answer \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset that we received, giving this child a confidence parameter supply of \begin_inset Formula \( (p-p')/k \) \end_inset . This is the same as we would have given it had we allocated \begin_inset Formula \( p-p' \) \end_inset to the children equally. We then continue from that child. Finally, when we reach a leaf, we output the probability left for the hypothesi s. Pictorially, this looks like: \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 267 137 file thesis-presentation/amcicro_alg.eps width 4 90 flags 9 \end_inset \layout Standard Notice that the second choice set is larger than in the true adaptive microchoic e set tree. This can easily happen and it makes our results somewhat more pessimistic. The approximate adaptive microchoice algorithm is specified as follows: \layout Algorithm Approximate_Adaptive_Microchoice( \begin_inset Formula \( \delta \) \end_inset ) \layout Enumerate set \begin_inset Formula \( p\leftarrow 1 \) \end_inset \layout Enumerate set \begin_inset Formula \( d=1 \) \end_inset \layout Enumerate while learning algorithm has not halted. \begin_deeper \layout Enumerate \begin_inset Formula \( d\leftarrow d+1 \) \end_inset \layout Enumerate \begin_inset Formula \( p'\leftarrow \frac{p}{d} \) \end_inset \layout Enumerate Let \begin_inset Formula \( C \) \end_inset = the current set of possible data-dependent choices. \layout Enumerate \begin_inset Formula \( |\hat{C}|\leftarrow |\{c\in C|c\in \hat{I}_{\chi }(\delta *p')\}| \) \end_inset \layout Enumerate \begin_inset Formula \( p\leftarrow \frac{p-p'}{|\hat{C}|} \) \end_inset \end_deeper \layout Enumerate return \begin_inset Formula \( p \) \end_inset \layout Standard Let \begin_inset Formula \( d(h) \) \end_inset be the depth of some hypothesis \begin_inset Formula \( h \) \end_inset in the empirical path and \begin_inset Formula \( \hat{C}_{1}(h) \) \end_inset , \begin_inset Formula \( \hat{C}_{2}(h) \) \end_inset , \begin_inset Formula \( \ldots \) \end_inset , \begin_inset Formula \( \hat{C}_{d}(h) \) \end_inset be the sequence of choice sets resulting in \begin_inset Formula \( h \) \end_inset in the algorithmic construction; i.e., \begin_inset Formula \( \hat{C}_{i}(h) \) \end_inset is the number of unpruned children of the \begin_inset Formula \( i \) \end_inset -th node. Then, the confidence placed on \begin_inset Formula \( h \) \end_inset will be: \layout Standard \begin_inset Formula \begin{equation} \label{eqn:adaptiveMC} p(h)=\prod _{i=1}^{d(h)}\left( \frac{i}{i+1}\frac{1}{|\hat{C}_{i}(h)|}\right) =\frac{1}{d(h)+1}\prod _{i=1}^{d(h)}\frac{1}{|\hat{C}_{i}(h)|} \end{equation} \end_inset \layout Theorem \begin_inset LatexCommand \label{th-amb} \end_inset (Adaptive Microchoice Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),p(h)*\delta )\right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset , and \begin_inset Formula \( p(h) \) \end_inset is as defined in equation \begin_inset LatexCommand \ref{eqn:adaptiveMC} \end_inset . \layout Proof \noindent By design, with probability \begin_inset Formula \( 1-\delta \) \end_inset all queries in the true microchoice query tree receive good answers, \emph on and \emph default all hypotheses in that tree have their true errors within their estimates. \layout Proof \noindent We will prove that in the high probability case, the output of the Approximate_A daptive_Microchoice algorithm is less than the output of the True_Adaptive_Micro choice algorithm. Since a smaller \begin_inset Formula \( p(h) \) \end_inset makes the bound more pessimistic, we will prove the bound. \layout Proof \noindent Assume inductively that at the current node of our empirical path the supply \begin_inset Formula \( p_{\mathrm{emp}} \) \end_inset is no greater than the supply \begin_inset Formula \( p_{\mathrm{true}} \) \end_inset given to that node in the true tree. This is clearly satisfied in the base case when \begin_inset Formula \( p_{emp}=p_{true}=1 \) \end_inset . \layout Proof \noindent Under the assumption that the response \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset falls in the interval \begin_inset Formula \( I_{\chi }(p_{\mathrm{true}}/(d+1)) \) \end_inset , it must be the case that the interval \begin_inset Formula \( \hat{I}_{\chi }(p_{\mathrm{emp}}/(d+1)) \) \end_inset contains the interval \begin_inset Formula \( I_{\chi }(p_{\mathrm{true}}/(d+1)) \) \end_inset . Therefore, the supply given to any child in the empirical path is no greater than the supply given in the true tree. \layout Standard The corresponding relative entropy corollary is: \layout Corollary \begin_inset LatexCommand \label{th-reamb} \end_inset (Relative Entropy Microchoice Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset , \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |\hat{C}_{i}(h)|+\ln (d(h)+1)+\ln \frac{1}{\delta }}{m}\right) \leq \delta \] \end_inset \layout Standard The bound in theorem ( \begin_inset LatexCommand \ref{th-amb} \end_inset ) is very similar to ( \begin_inset LatexCommand \ref{th-smb} \end_inset ) except that the choice complexity is slightly worsened with the \begin_inset Formula \( \ln (d(h)+1) \) \end_inset term but improved by replacing \begin_inset Formula \( C_{i}(h) \) \end_inset with the smaller \begin_inset Formula \( \hat{C}_{i}(h) \) \end_inset . \layout Subsection Allowing batch queries \layout Standard Most natural Statistical Query algorithms make each choice based on responses to a \emph on set \emph default of queries, not just one. For instance, to decide what variable to put at the top of a decision tree, we ask \begin_inset Formula \( F \) \end_inset queries, one for each feature; we then choose the feature whose answer was most \begin_inset Quotes eld \end_inset interesting \begin_inset Quotes erd \end_inset . This suggests generalizing the query tree model to allow each tree node to contain a set of queries, executed in batch. Requiring each node in the query tree to contain just a single query as in the above construction would result in an unfortunately high branching factor just for the purpose of \begin_inset Quotes eld \end_inset remembering \begin_inset Quotes erd \end_inset the answers received so far. \begin_float footnote \layout Standard Consider a decision tree algorithm attempting to find the right feature for a node. If the first query returns a value of \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset with a confidence of \begin_inset Formula \( \delta \) \end_inset then the branching factor would be approximately \begin_inset Formula \( m\cdot \hat{I}_{\chi }(\delta ) \) \end_inset . This branching factor would be approximately the same for further queries required by the algorithm to make a decision about what feature to use. This results in a total multiplied choice space size of approximately \begin_inset Formula \( \left[ 4m\cdot \hat{I}_{\chi }(\delta )\right] ^{F} \) \end_inset which can be reduced to \begin_inset Formula \( F \) \end_inset or less using a batch query. \end_float \layout Standard Extending the algorithmic construction to allow for batch queries is easily done. If a node has \begin_inset Formula \( q \) \end_inset queries \begin_inset Formula \( \chi _{1},\ldots ,\chi _{q} \) \end_inset , we choose the query confidence \begin_inset Formula \( \delta *p' \) \end_inset as before, but we now split the mass evenly among all \begin_inset Formula \( q \) \end_inset queries. We then let \begin_inset Formula \( k \) \end_inset be the number of children corresponding to answers to the queries \begin_inset Formula \( \chi _{1},\ldots ,\chi _{q} \) \end_inset in the ranges \begin_inset Formula \( \hat{I}_{\chi _{1}}(\delta p'/q),\ldots ,\hat{I}_{\chi _{q}}(\delta p'/q) \) \end_inset respectively. We then go to the child corresponding to the answers we actually received, and as before give the child a probability supply of \begin_inset Formula \( (p-p')/k \) \end_inset . Theorem ( \begin_inset LatexCommand \ref{th-amb} \end_inset ) holds exactly as before; the only change is that \begin_inset Formula \( |\hat{C}_{i}(h)| \) \end_inset means the size of the \begin_inset Formula \( i \) \end_inset -th choice set in the batch tree rather than the size in the single-query-per-no de tree. \layout Subsection Example: Batch Queries for Decision trees \layout Standard When growing a decision tree, it is natural to make a batch of queries and then make a decision about which feature to place in a node. The process is then repeated to grow the full tree structure. As in the decision tree example described in the simple microchoice section, if we have \begin_inset Formula \( F \) \end_inset features and are considering adding a node at depth \begin_inset Formula \( d(v) \) \end_inset , there are \begin_inset Formula \( F-d(v)+1 \) \end_inset possible features that could be chosen for placement in a particular node. The decision of which feature to use is made by comparing the results of \begin_inset Formula \( F-d(v)+1 \) \end_inset queries to pick the best feature according to some criteria, such as informatio n gain. We can choose \begin_inset Formula \( p'=p/(d+1) \) \end_inset , then further divide \begin_inset Formula \( p' \) \end_inset into confidences of size \begin_inset Formula \( p'/(F-d(v)+1) \) \end_inset , placing each divided confidence on one of the \begin_inset Formula \( F-d(v)+1 \) \end_inset statistical queries. We now may be able to eliminate some of the \begin_inset Formula \( F-d(v)+1 \) \end_inset choices from consideration, allowing the remaining confidence, \begin_inset Formula \( p-p' \) \end_inset to be apportioned evenly amongst the remaining choices. Depending on the underlying distribution this could substantially reduce the size of the choice set. The best case occurs when one feature partitions all examples reaching the node perfectly and all other features are independent of the target. In this case the choice set will have size \begin_inset Formula \( 1 \) \end_inset if there are enough examples. \layout Subsection Adaptive Microchoice vs. \protected_separator Basic Microchoice \layout Standard The adaptive microchoice bound is a significant improvement over the simple microchoice bound when the distribution is such that each choice is clear. For example, consider \begin_inset Formula \( F \) \end_inset Boolean features and \begin_inset Formula \( N=O(F) \) \end_inset examples. Suppose that one feature is identical to the label and all the rest of the features are determined with a coin flip independent of the label. \layout Standard When we apply a decision tree to a data set generated with this distribution, what will be the resulting bound? Given enough examples, with high probability there will only be one significant choice for the first batch query: the feature identical to the label. The second and third batch queries, corresponding to the children of the root feature, will also have a choice space of size \begin_inset Formula \( 1 \) \end_inset with very high probability. The \begin_inset Quotes eld \end_inset right \begin_inset Quotes erd \end_inset choice will be the label value. Each choice set has size \begin_inset Formula \( 1 \) \end_inset resulting in a complexity of \begin_inset Formula \( \ln 4 \) \end_inset due to allocation of confidence to the statistical queries necessary for learning the decision tree. \begin_inset Formula \( \ln 4 \) \end_inset is considerably better than \begin_inset Formula \( \ln (F+2)+2\ln (F+1) \) \end_inset which the simple version of the microchoice bound provides. Note that the complexity reduction only occurs with a large enough number of examples \begin_inset Formula \( m \) \end_inset implying that the value of \begin_inset Formula \( \bar{e} \) \end_inset calculated can improve faster than (inverse) linearly in the number of examples. \layout Standard The adaptive microchoice bound is never much looser than the simple microchoice bound because under the assumption that choice sets are of size at least \begin_inset Formula \( 2 \) \end_inset , the penalty for using the adaptive version, \begin_inset Formula \( \ln d \) \end_inset , is always small compared to the complexity term for the simple microchoice bound, \begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \) \end_inset . \layout Subsection Other Adaptive Microchoice \layout Standard The adaptive microchoice bound provides a simple scheme for dividing confidence between choices and queries. There are other choices which may be useful in some settings. Any scheme which \emph on a priori \emph default divides the confidence between queries and choices at every node will generally work. Here are two schemes which may be useful: \layout Itemize Assign a constant proportion of confidence to the query. This scheme is more aggressive than the one used in the adaptive microchoice bounds and may result in a lower complexity when many choices are eliminatable. The drawback is we no longer get the telescoping in equation ( \begin_inset LatexCommand \ref{eqn:adaptiveMC} \end_inset ) and so the term logarithmic in \begin_inset Formula \( d(h) \) \end_inset in theorem ( \begin_inset LatexCommand \ref{th-reamb} \end_inset ) becomes linear in \begin_inset Formula \( d(h) \) \end_inset . \layout Itemize For a decision tree, assign a portion dependent on the depth of the node in the decision tree that the choice set is over. It is unlikely that choices are eliminatable from nodes not near the root because the number of examples available at a node typically decays exponential ly with the (decision tree) depth. A progressive scheme which allocates less confidence to queries for deep nodes will probably behave better in practice. \layout Subsection Comparison with Freund's Self-Bounding algorithms \layout Standard Freund's approach for self-bounding learning algorithms can require exponentiall y more computation then the microchoice approach. In its basic form, it requires explicit construction of every path in the state space of the algorithm not pruned in the tree. There exist some learning algorithms where this process can be done implicitly making the computation feasible. However, in general this does not appear to be possible. \layout Standard The adaptive microchoice bound only requires explicit construction of the size of each subset from which a choice is made. Because many common learning algorithms work by a process of making choices from small subsets, this is often computationally easy. The adaptive microchoice bound does poorly, however, when Freund's query tree has a high degree of sharing; for example, when many nodes of the tree correspond to the same query, or many leaves of the tree have the same final hypothesis. Allowing batch queries alleviates the most egregious examples of this. It is also possible to interpolate between the adaptive microchoice bound and Freund's bound by a process of conglomerating the subsets of the microchoic e bound. \layout Subsection Choice Set Conglomeration \layout Standard The mechanism of choice set conglomeration is a similar to the batch query technique. It allows you to trade increased computation for a tighter bound. When starting with the simple microchoice bound, this technique can smoothly interpolate with the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ). When starting with the adaptive microchoice bound, we can interpolate with Freund's bound. \layout Standard Consider a particular choice set, \begin_inset Formula \( \hat{C}_{i} \) \end_inset , with elements \begin_inset Formula \( c_{i} \) \end_inset . Each \begin_inset Formula \( c_{i} \) \end_inset indexes another choice set, \begin_inset Formula \( \hat{C}_{i+1}(c_{i}) \) \end_inset . If the computational resources exist to calculate the union \begin_inset Formula \( \hat{C}_{i,i+1}=\bigcup _{c_{i}\in \hat{C}_{i}}\hat{C}_{i+1}(c_{i}) \) \end_inset , then \begin_inset Formula \( |\hat{C}_{i,i+1}| \) \end_inset can be used in place of \begin_inset Formula \( |\hat{C}_{i}|\cdot |\hat{C}_{i+1}| \) \end_inset in the adaptive microchoice bound. The conglomeration can be done repeatedly to build large choice sets and also applies to the simple microchoice bound ( \begin_inset LatexCommand \ref{th-smb} \end_inset ). Conglomeration can be useful for tightening the bound when there are multiple choice sequences leading to the same hypothesis. However, choice set conglomeration is not always helpful because it trades away the fine granularity of the microchoice bound. The extreme case where all choice sets are conglomerated into one choice set and every hypothesis and query have the same weight is equivalent to Freund's bound. \layout Standard When the choices of the attached choice sets are all different, conglomeration will have little use because the size of the union of the choice sets is the sum of the sizes of each choice set \begin_inset Formula \( |\hat{C}_{i,i+1}|=\sum _{c_{i}\in \hat{C}_{i}}|\hat{C}_{i+1}(c_{i})| \) \end_inset . If the child sets each have the same size \begin_inset Formula \( |\hat{C}_{i+1}| \) \end_inset then this simplifies to \begin_inset Formula \( |\hat{C}_{i,i+1}|=|\hat{C}_{i}|\cdot |\hat{C}_{i+1}| \) \end_inset which results in the same confidence applied to each choice whether conglomerat ing or not. The best case for conglomeration is equivalent to the batch query case: every sub-choice set contains the same elements. Then we have \begin_inset Formula \( |\hat{C}_{i,i+1}|=|\hat{C}_{i+1}| \) \end_inset and can pay no cost for the choice set \begin_inset Formula \( |\hat{C}_{i}| \) \end_inset . \layout Section Microchoice discussion \layout Standard Microchoice bounds can be used in practice and do yield results comparable with holdout set based techniques (see figure \SpecialChar \@. \begin_inset LatexCommand \ref{fig-microchoice-holdout} \end_inset and the surrounding section for details). There are two significant insights which the microchoice bounds provide. \layout Enumerate The \begin_inset Quotes eld \end_inset cost \begin_inset Quotes erd \end_inset of choices is made very explicit. The cost of a choice (in terms of sample complexity) is the log of the number of choices. \layout Enumerate It is possible to improve upon the Occam's Razor Bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) by using information from the sample set to infer properties (such as the choice tree) of the distribution \begin_inset Formula \( D \) \end_inset . This can be done \emph on without \emph default any explicit knowledge of the distribution \begin_inset Formula \( D \) \end_inset . \layout Problem (Open) Is there a satisfying, natural bound for the continuous case? Preliminary work and thoughts by several people has occurred, but nothing has yet come of it. \layout Chapter PAC-Bayes bounds \layout Standard \begin_inset LatexCommand \label{sec-PB} \end_inset \layout Standard The work presented here is also published in \begin_inset LatexCommand \cite{averaging_tech} \end_inset . \layout Standard PAC-Bayes bounds are a generalization of the Occam's razor bound for algorithms which output a \emph on distribution \emph default over classifiers rather than just a single classifier. This includes the possibility of a distribution over a single classifier, so it is a generalization. Most classifiers do not output a distribution over base classifiers. Instead, they output either a classifier, or an average over base classifiers. Nonetheless, PAC-Bayes bounds are interesting for several reasons: \layout Enumerate PAC-Bayes bounds are much tighter (in practice) than most common VC-related \begin_inset LatexCommand \cite{Vapnik} \end_inset approaches on continuous classifier spaces. This can be shown by application to stochastic neural networks (see section \begin_inset LatexCommand \ref{SNN} \end_inset ) as well as other classifiers. It also can be seen by observation: when specializing the PAC-Bayes bounds on discrete hypothesis spaces, only \begin_inset Formula \( O(\ln m) \) \end_inset sample complexity is lost. \layout Enumerate Due to the achievable tightness, the result motivates new learning algorithms which strongly limit the amount of overfitting that a learning algorithm will incur. \layout Enumerate The result found here will turn out to be useful for averaging hypotheses. \layout Standard PAC-Bayes bounds were first introduced by McAllester \begin_inset LatexCommand \cite{PB} \end_inset . \layout Standard There are three relatively independent observations in this chapter: \layout Enumerate A quantitative improvement of the PAC-Bayes by retrofit with relative entropy Chernoff bound \begin_inset LatexCommand \ref{eq-recb} \end_inset . This retrofit is not as trivial as might be expected, but it can be done. The result is the tightest known PAC-Bayes bound. In addition to the quantitative improvements, this tightening simplifies the proof and adds to our qualitative understanding of the bound. \layout Enumerate A method for (partially) derandomizing the PAC-Bayes stochastic hypothesis \layout Enumerate A method for stochastic evaluation of the empirical error. \layout Standard The first observation is the most important. Observation (3) is important for many practical applications because it is safely avoids a (sometimes) very complicated evaluation problem. Observation (2) is of little theoretical interest, but it might interest some people who feel reassured when every classifier randomized over has a low empirical error rate. \layout Standard Figure \begin_inset LatexCommand \ref{fig-pb-protocol} \end_inset shows what the PAC-Bayes bound looks like as an interactive proof of learning. \layout Standard \begin_float fig \layout Standard \align center \begin_inset Figure size 267 152 file thesis-presentation/pac-bayes.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-pb-protocol} \end_inset The PAC-Bayes bound can be viewed as a new style for a proof of learning. The learner must commit to a \begin_inset Quotes eld \end_inset Prior \begin_inset Quotes erd \end_inset as in the Occam's Razor Bound \begin_inset LatexCommand \ref{th-ORB} \end_inset before seeing examples, but it does not commit to a single hypothesis. Instead, it commits to a distribution over hypotheses, \begin_inset Formula \( q(h) \) \end_inset and the bound applies to a randomization with respect to the distribution \begin_inset Formula \( q(h) \) \end_inset . \end_float \layout Section PAC-Bayes Basics \layout Standard In the PAC-Bayes setting, a classifier is defined by a distribution \begin_inset Formula \( q(h) \) \end_inset over the hypothesis space. Each classification is carried out according to a hypothesis \emph on sampled \emph default from \begin_inset Formula \( q(h) \) \end_inset . We are interested in the gap between the \emph on expected \emph default generalization error \begin_inset Formula \( e_{q}(h)\equiv E_{q}\left[ e(h)\right] \) \end_inset and the \emph on expected \emph default empirical error \begin_inset Formula \( \hat{e}_{q}(h)=E_{q}\left[ \hat{e}(h)\right] \) \end_inset , where both expectations are taken with respect to \begin_inset Formula \( q(h) \) \end_inset . The gap will be parameterized by the Kullback-Leibler divergence (see \begin_inset LatexCommand \cite{Cover} \end_inset ). Recall that: \layout Standard \begin_inset Formula \begin{equation} \label{eq-relent} \textrm{KL}(q||p)=E_{h\sim q(h)}\ln \frac{q(h)}{p(h)} \end{equation} \end_inset If the support is finite, we have \begin_inset Formula \begin{equation} \label{eq-frelent} \textrm{KL}(q||p)=\sum _{h}q(h)\ln \frac{q(h)}{p(h)} \end{equation} \end_inset The relative entropy is an asymmetric distance measure between probability distributions, with \begin_inset Formula \( \textrm{KL}(q||p)=0\Leftrightarrow q=p \) \end_inset almost everywhere. \layout Theorem \begin_inset LatexCommand \label{th-pbb} \end_inset (PAC-Bayes \begin_inset LatexCommand \cite{PB} \end_inset ) For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset and for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \layout Theorem \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists q(h):\, \, e_{q}(h)\geq \hat{e}_{q}(h)+\sqrt{\frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }+2}{2m-1}}\right) \leq \delta \] \end_inset \layout Proof Given in \begin_inset LatexCommand \cite{PB} \end_inset . \layout Standard This PAC-Bayes bound is almost the same as the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) when the distribution is peaked on a single hypothesis and the Occam's razor bound is proved using the looser Hoeffding inequality. This can be seen by noting that the KL-divergence when \begin_inset Formula \( q \) \end_inset is all on one hypothesis, \begin_inset Formula \( h \) \end_inset satisfies \begin_inset Formula \( \textrm{KL}(q||p)=\log \frac{1}{p(h)} \) \end_inset . Comparing with the Occam's Razor bound, we see that a (small) extra term of size \begin_inset Formula \( \frac{\ln m}{m} \) \end_inset has been introduced in return for the capability to average with respect to \emph on any \emph default posterior \begin_inset Formula \( q(h) \) \end_inset . It is unclear yet that this \begin_inset Formula \( \frac{\ln m}{m} \) \end_inset term needs to be there. \layout Problem (Open) Remove the \begin_inset Formula \( \frac{\ln m}{m} \) \end_inset term from the sample complexity. \layout Standard The real power of the PAC-Bayes bound occurs when the average is over many hypotheses. This might occur if the distribution \begin_inset Formula \( q(h) \) \end_inset is picked using Bayes law or a Gibbs distribution. One of the most interesting aspects of the PAC-Bayes bound is that it holds for finite \emph on and \emph default infinite hypothesis spaces. \layout Section A Tighter PAC-Bayes Bound \layout Standard We can tighten this bound by employing a more accurate tail bound on the Binomial distribution. The proof of this improved lemma is not as straightforward as a simple substitution of the Hoeffding bound with the relative entropy Chernoff bound \begin_inset LatexCommand \ref{eq-recb} \end_inset but it can be worked out nonetheless. \layout Theorem \begin_inset LatexCommand \label{th-repbb} \end_inset (Relative Entropy PAC-Bayes bound) For all binary loss functions, \begin_inset Formula \( l(h,(x,y)) \) \end_inset , for all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset and for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \layout Theorem \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists q(h):\, \, \textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\geq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\right) \leq \delta \] \end_inset \layout Standard This bound is always at least as tight as the original PAC-Bayes bound \begin_inset LatexCommand \cite{PB} \end_inset and sometimes much tighter, such as when the average empirical error is near \begin_inset Formula \( 0 \) \end_inset . In particular, when the average empirical error is zero ( \begin_inset Formula \( \hat{e}_{q}(h)=0 \) \end_inset ) the bound can be significantly tighter as shown in figure \begin_inset LatexCommand \ref{fig-bounds} \end_inset on page \begin_inset LatexCommand \pageref{fig-bounds} \end_inset . \layout Standard One interesting new feature of this PAC-Bayes bound is \begin_inset Quotes eld \end_inset dimensional consistency \begin_inset Quotes erd \end_inset \begin_float footnote \layout Standard My thanks to Patrick Haffner for pointing this out. \end_float . In particular, each side of the equation is an expectation of log probabilities --- \begin_inset Quotes eld \end_inset nats \begin_inset Quotes erd \end_inset . Note, this does not hold when the Hoeffding approximation is used. Rewriting, we get that with high probability, approximately the following holds: \begin_inset Formula \[ m\textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\leq \textrm{KL}(q||p)\] \end_inset There is a coding theory interpretation of KL divergence: \begin_inset Formula \( \textrm{KL}(q||p)= \) \end_inset the expected number of \emph on extra \emph default bits required to encode symbols drawn from \begin_inset Formula \( q \) \end_inset given a code designed for symbols drawn from \begin_inset Formula \( p \) \end_inset rather than from \begin_inset Formula \( q \) \end_inset . \layout Standard Using the coding theory interpretation of KL divergence, this says approximately : \begin_inset Quotes eld \end_inset With high probability the number of \emph on extra \emph default bits required to encode the empirical errors is less than the number of \emph on extra \emph default bits required to encode hypotheses drawn from the posterior. \begin_inset Quotes erd \end_inset \layout Standard The retrofit of the PAC-Bayes bound is accomplished by reproving a technical lemma about distributions. The proof relies upon two lemmas. The first is Lemma 22 from \begin_inset LatexCommand \cite{PB} \end_inset which is given by: \layout Lemma \begin_inset LatexCommand \label{lem-opt} \end_inset For all \begin_inset Formula \( \beta >0,K>0 \) \end_inset and \begin_inset Formula \( Q,P,y\in R^{n} \) \end_inset satisfying \begin_inset Formula \( P_{i}>0,Q_{i}>0 \) \end_inset and \begin_inset Formula \( \sum _{i}Q_{i}=1 \) \end_inset , if \begin_inset Formula \[ \sum _{i=1}^{n}P_{i}e^{\beta y_{i}}\leq K\] \end_inset then \layout Lemma \begin_inset Formula \[ \sum _{i=1}^{n}Q_{i}y_{i}\leq \frac{\textrm{KL}(Q||P)+\ln K}{\beta }\] \end_inset \layout Proof given in \begin_inset LatexCommand \cite{PB} \end_inset . \layout Standard We will need to prove the following lemma in order to tighten the PAC-Bayes bound. It is analogous to Lemma 17 from \begin_inset LatexCommand \cite{PB} \end_inset . \layout Lemma \begin_inset LatexCommand \label{lem-tech} \end_inset For all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset , for all \begin_inset Formula \( \delta ,\alpha \in (0,1) \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}>\frac{1}{\alpha \delta }\right) \leq \delta \] \end_inset \layout Proof For any given hypothesis \begin_inset Formula \( h \) \end_inset we will prove the following. \begin_inset Formula \begin{equation} \label{eq-expectation} \forall h\, \, E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha } \end{equation} \end_inset The Lemma then follows from the sequence: \begin_inset Formula \[ \Rightarrow \forall p\, \, E_{p}E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }\] \end_inset \begin_inset Formula \[ \Rightarrow \forall p\, \, E_{D^{m}}E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{1}{\alpha }\] \end_inset \begin_inset Formula \[ \Rightarrow \forall p\, \, \Pr _{D^{m}}\left( E_{p}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\geq \frac{1}{\alpha \delta }\right) \leq \delta \] \end_inset \layout Proof Consequently, we must only prove equation \begin_inset LatexCommand \ref{eq-expectation} \end_inset . Given the hypothesis, we have a fixed true error rate, \begin_inset Formula \( e(h) \) \end_inset , and the empirical error rate \begin_inset Formula \( \hat{e}(h) \) \end_inset will be distributed like a Binomial. Let \begin_inset Formula \( R(e(h)) \) \end_inset be the random variable with a cumulative distribution given by the relative entropy Chernoff bound for a hypothesis with true error \begin_inset Formula \( e(h) \) \end_inset . In other words, define a cumulative distribution function on \begin_inset Formula \( [0,1] \) \end_inset according to: \begin_inset Formula \[ e^{-m\textrm{KL}\left( R||p\right) }\] \end_inset (note that we defined \begin_inset Formula \( \textrm{KL}() \) \end_inset so that it is always \begin_inset Formula \( 0 \) \end_inset when \begin_inset Formula \( \frac{k}{m}>p \) \end_inset ). Note that the relative entropy Chernoff bound implies \begin_inset Formula \( R \) \end_inset satisfies: \begin_inset Formula \[ \textrm{Bin}(m,mR,p)\leq e^{-m\textrm{KL}\left( R||p\right) }\] \end_inset whenever \begin_inset Formula \( mR \) \end_inset is integer. \layout Proof Since \begin_inset Formula \( e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))} \) \end_inset increases monotonically with decreasing \begin_inset Formula \( \hat{e}(h) \) \end_inset , the probability distribution function of \begin_inset Formula \( R(e(h)) \) \end_inset will have a larger expected value. In other words: \begin_inset Formula \[ E_{D^{m}}e^{(1-\alpha )m\textrm{KL}(\hat{e}(h)||e(h))}\leq E_{R}e^{(1-\alpha )m\textrm{KL}(R||e(h))}\] \end_inset The probability distribution function of \begin_inset Formula \( R \) \end_inset is given by: \begin_inset Formula \[ f(x)=\left\{ \begin{array}{ll} 0 & \textrm{for}\; x\geq p\\ & \\ -m\frac{\partial \textrm{KL}(x||p)}{\partial x}e^{-m\textrm{KL}(x||p)} & \textrm{for}\; x\leq p \end{array}\right. \] \end_inset Taking the expectation with respect to this distribution gives us: \begin_inset Formula \begin{eqnarray*} E_{R}\left[ e^{(1-\alpha )m\textrm{KL}(R||e(h))}\right] & = & \int _{0}^{1}e^{(1-\alpha )m\textrm{KL}(x||e(h))}f(x)dx\\ & & \\ & = & \int _{0}^{e(h)}-m\frac{\partial \textrm{KL}(x||e(h))}{\partial x}e^{-\alpha m\textrm{KL}(x||e(h))}dx\\ & = & \left. \frac{1}{\alpha }e^{-\alpha m\textrm{KL}(x||e(h))}\right| _{0}^{e(h)}\\ & \leq & \frac{1}{\alpha } \end{eqnarray*} \end_inset This finishes the proof of the technical lemma. \layout Standard Now, we can prove the relative entropy PAC-Bayes bound \begin_inset LatexCommand \ref{th-repbb} \end_inset . \layout Proof First, we can specialize lemma \protected_separator \begin_inset LatexCommand \ref{lem-tech} \end_inset with \begin_inset Formula \( \alpha =\frac{1}{m} \) \end_inset to get that with probability \begin_inset Formula \( 1-\delta \) \end_inset \begin_inset Formula \[ E_{p}e^{(m-1)\textrm{KL}(\hat{e}(h)||e(h))}\leq \frac{m}{\delta }\] \end_inset Apply lemma \begin_inset LatexCommand \ref{lem-opt} \end_inset with \begin_inset Formula \( K=\frac{m}{\delta } \) \end_inset , \begin_inset Formula \( \beta =m-1 \) \end_inset , \begin_inset Formula \( Q_{i}=q(h_{i}) \) \end_inset and \begin_inset Formula \( y_{i}=\textrm{KL}(\hat{e}(h)||e(h)) \) \end_inset to get: \begin_inset Formula \[ E_{q}\textrm{KL}(\hat{e}(h)||e(h))=\sum _{i=1}^{n}Q_{i}\textrm{KL}(\hat{e}(h_{i})||e(h_{i}))\leq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\] \end_inset Jensen's inequality, gives us: \begin_inset Formula \[ \textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\leq E_{q}\textrm{KL}(\hat{e}(h)||e(h))\leq \frac{\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1}\] \end_inset which proves the theorem for the finite case. For the infinite case, a sequence of limits can be defined just as in \begin_inset LatexCommand \cite{PB} \end_inset . \layout Section PAC-Bayes Approximations \layout Subsection Approximating the empirical error \layout Standard \begin_inset LatexCommand \label{pb-approx} \end_inset \layout Standard In practice, it is not always easy to calculate some of the observable variables in the PAC-Bayes bound. In particular, \begin_inset Formula \( \hat{e}_{q}(h) \) \end_inset is not necessarily easy to calculate when \begin_inset Formula \( q \) \end_inset is some continuous distribution. We can avoid the need for a direct evaluation by a Monte Carlo evaluation and a bound on the tail of the Monte Carlo evaluation. Let \begin_inset Formula \( \hat{e}_{\hat{q}}(h)\equiv \Pr _{\hat{q},S}(h(x)\neq y) \) \end_inset be the observed rate of failure of \begin_inset Formula \( n \) \end_inset random hypotheses drawn according to \begin_inset Formula \( q(h) \) \end_inset and applied to a random training example. Once again, we have a familiar Binomial distribution. Direct calculation will give us: \layout Theorem (Monte Carlo Sampling Bound) For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset \begin_inset Formula \[ \Pr _{q^{n}}\left( \, \, \hat{e}(h)>\bar{e}\left( n,\hat{e}_{\hat{q}}(h),\delta \right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( n,\frac{k}{n},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(n,k,p)\}\geq \delta \) \end_inset \layout Proof Observer that the Monte Carlo estimate is distributed like a Binomial distributi on and apply the Binomial Tail bound. \layout Standard In order to calculate a bound on the expected true error rate, we can first bound the expected empirical error rate \begin_inset Formula \( \hat{e}_{q}(h) \) \end_inset with confidence \begin_inset Formula \( \frac{\delta }{2} \) \end_inset then bound the expected true error rate \begin_inset Formula \( e_{q}(h) \) \end_inset with confidence \begin_inset Formula \( \frac{\delta }{2} \) \end_inset , using our bound on \begin_inset Formula \( \hat{e}_{q}(h) \) \end_inset . Since the total probability of failure is only \begin_inset Formula \( \frac{\delta }{2}+\frac{\delta }{2}=\delta \) \end_inset our bound will hold with probability \begin_inset Formula \( 1-\delta \) \end_inset . \layout Subsection Derandomizing the PAC-Bayes bound \layout Standard It is sometimes desirable to derandomize the PAC-Bayes bound. There are several ways to do this. The next chapter will talk about replacing the randomization over \begin_inset Formula \( q(h) \) \end_inset with a thresholded average. Another technique is to simply pick a hypothesis according to \begin_inset Formula \( q(h) \) \end_inset . While this would probably be effective in practice, the theoretical guarantees that can be made for this technique are weak. Strong theoretical guarantees \emph on can \emph default be made for a similar technique. \layout Standard Suppose we make \begin_inset Formula \( n \) \end_inset draws form \begin_inset Formula \( q(h) \) \end_inset . Let the drawn hypotheses be \begin_inset Formula \( \{h_{1},...,h_{n}\} \) \end_inset . We can form a new distribution \begin_inset Formula \( \hat{q}(h) \) \end_inset which is uniform over the \begin_inset Formula \( n \) \end_inset draws. The true error rate of this distribution can be bound with high probability according to the following theorem. \layout Theorem (PAC-Bayes Derandomization) For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset \begin_inset Formula \[ \Pr _{q^{n}}\left( \, \, e_{\hat{q}}(h)>\bar{e}\left( n,e_{q}(h),\delta \right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( n,\frac{k}{n},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(n,k,p)\}\geq \delta \) \end_inset \layout Proof Observer that the distribution of \begin_inset Formula \( e_{\hat{q}}(h) \) \end_inset is distributed like a Binomial around \begin_inset Formula \( e_{q}(h) \) \end_inset and apply the Binomial Tail bound. \layout Standard Note the this theorem and the last theorem are essentially the same theorem. \layout Standard This theorem allows us to do an (incomplete) derandomization. Instead of drawing from \begin_inset Formula \( q(h) \) \end_inset in order to evaluate an input, we can draw from \begin_inset Formula \( \hat{q}(h) \) \end_inset which requires a fixed finite number of bits. This may allow for more efficient algorithms, and some people may find it reassuring that every hypothesis in \begin_inset Formula \( \{h_{1},...,h_{n}\} \) \end_inset has a low empirical error. The same confidence splitting trick of the last section can be used in order to guarantee \begin_inset Formula \( e_{q}(h) \) \end_inset is bounded and \begin_inset Formula \( e_{\hat{q}}(h) \) \end_inset is bounded given that \begin_inset Formula \( e_{q}(h) \) \end_inset is bounded. \layout Standard It is worth mentioning that \emph on no \emph default assumption of independence applies to either this theorem or the last theorem since we explicitly control (and create) the independence ourselves. These theorems hold for totally verifiable preconditions. \layout Section Application of the PAC-Bayes bound \layout Standard The goal of this chapter is making PAC-Bayes bounds more applicable. This is done by tightening the analysis from a Hoeffding-like to a Chernoff-lik e statement, and by noting that we can use monte-carlo evaluation to safely bound the stochastic empirical error rate quickly. \layout Standard In work detailed in chapter \begin_inset LatexCommand \ref{SNN} \end_inset , results for the application of PAC-Bayes bounds to stochastic neural networks is presented. PAC-Bayes bounds are one of very few approaches capable of producing nonvacuous learning theory bounds on continuous valued classifiers for real-world problems. \layout Chapter Averaging Bounds (Improved margin) \layout Standard \begin_inset LatexCommand \label{sec-average} \end_inset \layout Standard The work in this chapter is joint with Matthias Seeger and Nimrod Megiddo. It was first presented at ICML \begin_inset LatexCommand \cite{averaging_icml} \end_inset . \layout Standard Averaging bounds are specialized for averaging classifiers. An average has the form \begin_inset Formula \[ f(x)=\int q(h)h(x)dx\textrm{ or }f(x)=\int _{i=1}^{N}q(h_{i})h_{i}(x)dx\] \end_inset where \begin_inset Formula \( q(h_{i})\geq 0 \) \end_inset and \begin_inset Formula \( \int q(h)dh=1 \) \end_inset . Averaging classifiers have the form: \begin_inset Formula \[ c(x)=\textrm{sign}\left( f(x)\right) \] \end_inset Averaging bounds are especially interesting because there are many learning algorithms which use averaging. These techniques include: \layout Enumerate Boosting \begin_inset LatexCommand \cite{Boosting} \end_inset \layout Enumerate Bayes-Optimal learning (see section 6.7 of \begin_inset LatexCommand \cite{Tom} \end_inset ) \layout Enumerate Support Vector Machines \begin_inset LatexCommand \cite{SVM} \end_inset \layout Enumerate Bagging \begin_inset LatexCommand \cite{Bagging} \end_inset \layout Enumerate Maximum Entropy classification \begin_inset LatexCommand \cite{ME} \end_inset \layout Standard Viewed as an interactive proof of learning (see figure \begin_inset LatexCommand \ref{fig-averaging-protocol} \end_inset ) the bound presented here is almost the same as the PAC-Bayes bound except that it applies to the \emph on average \emph default over the posterior rather than to stochastic choices over the posterior. \layout Standard \begin_float fig \layout Standard \align center \begin_inset Figure size 267 152 file thesis-presentation/averaging.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-averaging-protocol} \end_inset For the averaging bound, the learning must commit to some measure \begin_inset Formula \( p(h) \) \end_inset , receive examples, and then commit to another measure \begin_inset Formula \( q(h) \) \end_inset . The true error bound applies to the \emph on average \emph default over \begin_inset Formula \( q(h) \) \end_inset rather than stochastic choices as in the PAC-Bayes bound \begin_inset LatexCommand \ref{sec-PB} \end_inset . \end_float \layout Standard The bound in this section is a qualitative improvement on prior results for averaging bounds. For the average learning algorithms listed above, the form of the improvements is most interesting for Maximum Entropy Classification and Bayes-Optimal classification. All (currently known) specialized averaging bounds use as a parameter the \begin_inset Quotes eld \end_inset margin \begin_inset Quotes erd \end_inset . For this section only, suppose the label and the hypotheses have value \begin_inset Formula \( -1 \) \end_inset or \begin_inset Formula \( 1 \) \end_inset . ( \begin_inset Formula \( y\in \{-1,1\} \) \end_inset and \begin_inset Formula \( h(x)\in \{-1,1\} \) \end_inset ). Then, the margin will be defined as \begin_inset Formula \[ t(x,y)=yf(x)\] \end_inset Some simple observations are immediate: \layout Enumerate The margin is bounded. \begin_inset Formula \( t(x,y)\in [-1,1] \) \end_inset \layout Enumerate If the classifier is correct, the margin is positive. \begin_inset Formula \( c(x)=y\Rightarrow t(x,y)\in (0,1] \) \end_inset \layout Standard The \emph on error \emph default at some margin is the quantity actually used in the averaging bounds. The empirical error at margin \begin_inset Formula \( \theta \) \end_inset is defined by: \begin_inset Formula \[ \hat{e}_{\theta }(c)=\Pr _{S}(t(x,y)\leq \theta )\] \end_inset and the true error at margin \begin_inset Formula \( \theta \) \end_inset is defined by: \begin_inset Formula \[ e_{\theta }(c)=\Pr _{D^{m}}(t(x,y)\leq \theta )\] \end_inset Note that \begin_inset Formula \( e_{0}(c)=e_{D}(c) \) \end_inset (the true error rate of \begin_inset Formula \( c \) \end_inset ). \layout Standard The \begin_inset Quotes eld \end_inset margin \begin_inset Quotes erd \end_inset is a useful way to parameterize our learning algorithms. It will turn out that the sample complexity will be low (and the guarantees we can make better) when the margin of most of the training examples is large. There is, however, a price associated with using the margin: some hypotheses have no notion of a margin. Thus the theory in this chapter is less general than appears elsewhere. \layout Section Earlier Results \layout Standard \begin_inset LatexCommand \label{sec-earlier} \end_inset \layout Standard The improved averaging bound arises from improving one critical step in the proof of the original margin bound \begin_inset LatexCommand \cite{Margin} \end_inset which is stated next. \layout Theorem (Margin Bound \begin_inset LatexCommand \cite{Margin} \end_inset ) \begin_inset LatexCommand \label{th-margin} \end_inset For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset , for all base hypothesis spaces, \begin_inset Formula \( H \) \end_inset , \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, e(c)>\hat{e}_{\theta }(c)+O\left( \sqrt{\frac{\frac{\ln |H|}{\theta ^{2}}\ln m+\ln \frac{1}{\delta }}{m}}\right) \right) \leq \delta \] \end_inset \layout Proof Given in \begin_inset LatexCommand \cite{Margin} \end_inset . A simplification of the improved averaging bound proof. \layout Standard Here, the notation \begin_inset Formula \( b(m)=O(a(m)) \) \end_inset means there exists a constant \begin_inset Formula \( C \) \end_inset such that \begin_inset Formula \( b(m)\leq C\cdot a(m) \) \end_inset for all \begin_inset Formula \( m \) \end_inset . This margin bound implies that if most training examples have a large margin \begin_inset Formula \( \theta \) \end_inset (i.e. \begin_inset Formula \( t(x,y)>\theta \) \end_inset for most \begin_inset Formula \( (x,y)\in S \) \end_inset ) and the hypothesis space is not too large, then the generalization error cannot be large. This theorem can only be non-vacuous when the base hypothesis space is finite. There are various extensions (see \begin_inset LatexCommand \cite{Margin} \end_inset ) of this bound for continuous hypothesis spaces based upon VC dimension and covering number techniques. However, the extensions tend to result in extremely loose guarantees and are not relevant to the discussion here. One of the advantages of the improved averaging bound is that it \emph on can \emph default apply in a non-vacuous way to infinite hypothesis spaces. This generalization comes about with essentially zero loosening of the underlying bound. \layout Section A generalized averaging bound \layout Standard Before discussing the main theorem, it is important to notice that the averaging classifier, \begin_inset Formula \( c(x) \) \end_inset \emph on implies \emph default a distribution over the base hypothesis space \begin_inset Formula \( H \) \end_inset . This implied distribution is \begin_inset Formula \( q_{c}(h) \) \end_inset where \begin_inset Formula \[ c(x)=\textrm{sign}(\int h(x)q_{c}(h)dh)\] \end_inset The distribution \begin_inset Formula \( q_{c} \) \end_inset is used in the following theorem. \layout Theorem \begin_inset LatexCommand \label{th-averaging} \end_inset (Relative Entropy Averaging Theorem) For all distribution \begin_inset Formula \( p(h) \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \layout Theorem \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, \textrm{KL}\left( \hat{e}_{\theta }(c)||e(c)\right) \geq O\left( \frac{\frac{\textrm{KL}\left( q_{c}||p\right) }{\theta ^{2}}\ln m+\ln m+\ln \frac{1}{\delta }}{m}\right) \right) \leq \delta \] \end_inset \layout Proof Given in the next section. \layout Standard The main theorem uses a KL-divergence based pseudodistance which is a bit hard to understand intuitively. In order to gain intuition, we can relax the tightness of the proof with an inequality. \layout Standard \begin_inset Formula \[ \textrm{KL}(\hat{e}_{\theta }(c)||e(c))\geq 2(\hat{e}_{\theta }(c)-e(c))^{2}\] \end_inset \layout Standard This relaxation gives us an immediate corollary. \layout Corollary (Relative Entropy Averaging Theorem) For all distribution \begin_inset Formula \( p(h) \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \layout Theorem \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists c,\theta \in (0,1]:\, \, e(c)\geq \hat{e}_{\theta }(c)+O\left( \sqrt{\frac{\frac{\textrm{KL}\left( q_{c}||p\right) }{\theta ^{2}}\ln m+\ln m+\ln \frac{1}{\delta }}{m}}\right) \right) \leq \delta \] \end_inset \layout Standard This theorem improves upon theorem \begin_inset LatexCommand \ref{th-margin} \end_inset because \begin_inset Formula \( \textrm{KL}(q_{c}||p) \) \end_inset is used instead of \begin_inset Formula \( \ln |H| \) \end_inset . For the case of a uniform distribution on \begin_inset Formula \( |H| \) \end_inset different base classifiers, these results will agree when the average is over just one classifier. As the average becomes \begin_inset Quotes eld \end_inset broader \begin_inset Quotes erd \end_inset the results will improve. In the limit when the average is over nearly all classifiers, the term \begin_inset Formula \( \textrm{KL}(q_{c}||p) \) \end_inset will be nearly \begin_inset Formula \( 0 \) \end_inset . \layout Standard The theorems are stated in an asymptotic fashion which is not be very useful in practical applications. Section \begin_inset LatexCommand \ref{sec-tighten} \end_inset gives some ideas of how to tighten the result, and the non-asymptotic form ( \begin_inset LatexCommand \ref{eq-finalbound} \end_inset ) given at the end of the proof can be used directly in practice. \layout Standard The improved averaging bound applies to averages over continuous hypothesis spaces. In this setting, the average needs to be an integral over an uncountably-infini te set of hypotheses or the KL-divergence will not converge to a finite value. It is exactly because of this limitation that the improvements of this bound are most applicable to Bayes Optimal and Maximum Entropy classifiers. \layout Standard In practice, the limitation may not be a significant problem because machine learning algorithms over large hypothesis spaces typically have some parameter stability. In other words, a small shift in the parameters of the learned model produces a small change in the prediction of the hypothesis. With hypothesis stability, we can convert any average over a finite set of hypotheses into an average over an infinite set of hypotheses without significantly altering the predictions of the average. This technique is explored in chapter \begin_inset LatexCommand \ref{SNN} \end_inset with positive results. \layout Section Proof of main theorem \layout Standard \begin_inset LatexCommand \label{sec-main-proof} \end_inset \layout Subsection Definitions and observations \layout Standard The proof has the same structure as the original margin bound ( \begin_inset LatexCommand \ref{th-margin} \end_inset ) proof with one step replaced by the application of the relative entropy PAC-Bayes theorem ( \begin_inset LatexCommand \ref{th-repbb} \end_inset ). \layout Standard Let \begin_inset Formula \( N \) \end_inset be any natural number; later, the choice of \begin_inset Formula \( N \) \end_inset will be optimized. In the first part of the proof, we regard \begin_inset Formula \( \theta \) \end_inset and \begin_inset Formula \( N \) \end_inset as fixed. Later we generalize this so that they may depend on the sample \begin_inset Formula \( S \) \end_inset . \layout Standard We construct the distribution \begin_inset Formula \( q_{N} \) \end_inset as follows. Draw \begin_inset Formula \( N \) \end_inset hypotheses \begin_inset Formula \( h_{i}\sim q(h) \) \end_inset independently. \begin_inset Formula \( Q_{N} \) \end_inset might therefore be viewed as the product distribution \begin_inset Formula \begin{equation} q_{c}(h)^{N}. \end{equation} \end_inset Given the \begin_inset Formula \( h_{i} \) \end_inset we define \begin_inset Formula \begin{equation} \label{eq-sample} g(x)=\frac{1}{N}\sum _{i=1}^{N}h_{i}(x) \end{equation} \end_inset \layout Standard For fixed \begin_inset Formula \( x,\, y \) \end_inset , the value of \begin_inset Formula \( yh(x) \) \end_inset are i.i.d. \latex latex \latex default Bernoulli variables with the mean equal to the expected margin: \begin_inset Formula \begin{equation} E_{q\sim h}yg(x)=yf(x) \end{equation} \end_inset Therefore, \begin_inset Formula \( \forall x,y\, \, E_{g\sim Q_{N}}[yg(x)]=yf(x) \) \end_inset and \begin_inset Formula \( yg(x) \) \end_inset is distributed according to the familiar Binomial distribution. Later, we will apply a Binomial tail bound on this quantity. \layout Standard Before writing out the proof mathematically, it is helpful to consider a graphical view of what we will prove. We will force convergence between three quantities. \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 148 19 file averaging.eps width 4 50 flags 9 \end_inset \layout Standard The convergences are then tied together with triangle inequalities resulting in the overall proof. The critical improvement of this paper is a refined version of the second convergence. \layout Subsection The Proof \layout Standard For every \begin_inset Formula \( \theta \in (0,1] \) \end_inset and for every (fixed) \begin_inset Formula \( g \) \end_inset , the following simple inequality holds: \begin_inset Formula \begin{equation} \label{eq-simple} e(f)=\Pr _{D}[yf(x)\leq 0] \end{equation} \end_inset \begin_inset Formula \[ =\Pr _{D}[yg(x)\leq \frac{\theta }{2},\, yf(x)\leq 0]+\Pr _{D}[yg(x)>\frac{\theta }{2},\, yf(x)\leq 0]|\, yf(x)\leq 0]\] \end_inset \begin_inset Formula \[ \leq \Pr _{D}[yg(x)\leq \frac{\theta }{2}]+\Pr _{D}[yg(x)>\frac{\theta }{2}\, |\, yf(x)\leq 0]\] \end_inset Note that the left-hand side does not depend on \begin_inset Formula \( g \) \end_inset . By taking the expectation over \begin_inset Formula \( g\sim Q_{N} \) \end_inset (and exchanging the order of expectations in the second term on the right-hand side), we arrive at \begin_inset Formula \begin{equation} e(f)\leq E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +E_{D}\left[ P_{g\sim Q_{N}}[yg(x)>\frac{\theta }{2}\, |\, yf(x)\leq 0]\right] . \end{equation} \end_inset For the rightmost term, we can apply a Binomial tail bound to get: \layout Standard \begin_inset Formula \begin{equation} e(f)\leq E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +E_{D}\left[ 1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \right] \end{equation} \end_inset \begin_inset Formula \begin{equation} =E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] +1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \end{equation} \end_inset We would like to apply the PAC-Bayes theorem \begin_inset LatexCommand \ref{th-repbb} \end_inset to the right-hand term. Here we use the loss function \begin_inset Formula \( I_{\{yg(x)\leq \theta /2\}} \) \end_inset . The PAC-Bayes theorem applies for any \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset distribution. We use as the \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset the distribution \begin_inset Formula \( P_{N}=p(h)^{N} \) \end_inset where \begin_inset Formula \( p(h) \) \end_inset is the \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset over the base hypothesis space. The choice of this prior allows us to use the identity: \begin_inset Formula \[ \textrm{KL}(Q_{N}||P_{N})=N\textrm{KL}\{q(h)||p(h))\] \end_inset It follows from Theorem \begin_inset LatexCommand \ref{th-repbb} \end_inset that: with probability at least \begin_inset Formula \( 1-\delta \) \end_inset over random choices of \begin_inset Formula \( S \) \end_inset , for every \begin_inset Formula \( Q \) \end_inset , \layout Standard \begin_inset Formula \begin{equation} \label{eq-usemcall} \Pr _{D}\left( \exists q(h):\, \, \textrm{KL}\left( \hat{e}_{\frac{\theta }{2}}(g)||e_{\frac{\theta }{2}}(g)\right) \leq \frac{\textrm{KL}(Q_{N}||P_{N})+\ln \frac{m}{\delta }}{m-1}\right) \end{equation} \end_inset where \begin_inset Formula \( e_{\frac{\theta }{2}}(g)=1 \) \end_inset with probability \begin_inset Formula \( E_{g\sim Q_{N}}\left[ \Pr _{D}[yg(x)\leq \frac{\theta }{2}]\right] \) \end_inset and \begin_inset Formula \( 0 \) \end_inset otherwise, and \begin_inset Formula \( \hat{e}_{\frac{\theta }{2}}(g)=1 \) \end_inset with probability \begin_inset Formula \( E_{g\sim Q_{N}}\left[ \Pr _{S}[yg(x)\leq \frac{\theta }{2}]\right] \) \end_inset and \begin_inset Formula \( 0 \) \end_inset otherwise. \layout Standard By the same argument as in ( \begin_inset LatexCommand \ref{eq-simple} \end_inset ), we get: \begin_inset Formula \begin{equation} \hat{e}_{\frac{\theta }{2}}(g)=\Pr _{S}[yg(x)\leq \frac{\theta }{2}]\leq \Pr _{S}[yg(x)\leq \frac{\theta }{2},\; yf(x)>\theta ]+\Pr _{S}[yf(x)\leq \theta ] \end{equation} \end_inset \begin_inset Formula \begin{equation} \leq \Pr _{S}[yg(x)\leq \frac{\theta }{2}\, |\, yf(x)>\theta ]+\Pr _{S}[yf(x)\leq \theta ] \end{equation} \end_inset \begin_inset Formula \begin{equation} =\Pr _{S}[yg(x)\leq \frac{\theta }{2}\, |\, yf(x)>\theta ]+\hat{e}_{\theta }(f) \end{equation} \end_inset Again, we take expectations over \begin_inset Formula \( g\sim Q_{N} \) \end_inset on both sides, interchange the order of the expectations and apply the Binomial tail bound to get: \begin_inset Formula \begin{equation} \label{eq-secondhoeff} E_{S}\left[ P_{g\sim Q_{N}}[yg(x)\leq \frac{\theta }{2}]\right] \leq \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) +\Pr _{S}\left[ yf(x)\leq \theta \right] . \end{equation} \end_inset Combining our inequalities, we conclude that with probability at least \begin_inset Formula \( 1-\delta \) \end_inset , for every \begin_inset Formula \( q(h) \) \end_inset \begin_inset Formula \begin{equation} \textrm{KL}(q_{S}||p_{D})\leq \frac{N\textrm{KL}(q||p)+\ln \frac{m}{\delta }}{m-1} \end{equation} \end_inset where \begin_inset Formula \( q_{S}=1 \) \end_inset with probability \begin_inset Formula \( \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) +\Pr _{S}\left[ yf(x)\leq \theta \right] \) \end_inset and \begin_inset Formula \( 0 \) \end_inset otherwise, and \begin_inset Formula \( p_{D}=1 \) \end_inset with probability \begin_inset Formula \( \Pr _{D}[yf(x)\leq 0]-1+\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \) \end_inset and \begin_inset Formula \( 0 \) \end_inset otherwise. \layout Standard This bound holds for any fixed \begin_inset Formula \( N \) \end_inset and \begin_inset Formula \( \theta \) \end_inset , which is not yet what we need here, since we want to allow these to depend on the data \begin_inset Formula \( S \) \end_inset . In essence, the bound we proved so far is a statement about certain events, parameterized by \begin_inset Formula \( N \) \end_inset and \begin_inset Formula \( \theta \) \end_inset , namely the probability of each event is smaller than \begin_inset Formula \( \delta \) \end_inset . However, we need to prove that the probability of the \emph on union \emph default of all these events is smaller than \begin_inset Formula \( \delta \) \end_inset . To this end, we first observe that this union is contained in the union over only a \emph on countable \emph default number of events. Note that \begin_inset Formula \( g(x)\in \{(2k-N)/N|k=0,1,\ldots ,N\} \) \end_inset . Thus, even with all the possible (positive) values of \begin_inset Formula \( \theta \) \end_inset , there are no more than \begin_inset Formula \( N+1 \) \end_inset events of the form \begin_inset Formula \( \{yg(x)\leq \theta /2\} \) \end_inset . Denote by \begin_inset Formula \( k(\theta ,N) \) \end_inset the largest integer \begin_inset Formula \( k \) \end_inset such that \begin_inset Formula \( k/N\leq \theta /2 \) \end_inset . We observe that for every \begin_inset Formula \( \theta >0 \) \end_inset , every \begin_inset Formula \( g \) \end_inset and every distribution over \begin_inset Formula \( (x,y) \) \end_inset : \begin_inset Formula \begin{equation} \Pr \left[ yg(x)\leq \theta /2\right] =\Pr \left[ yg(x)\leq \frac{k(\theta ,N)}{N}\right] . \end{equation} \end_inset This means that the middle step in the proof above, i.e. \latex latex \latex default the application of theorem \begin_inset LatexCommand \ref{th-repbb} \end_inset , depends on \begin_inset Formula \( (N,\theta ) \) \end_inset only through \begin_inset Formula \( (N,k) \) \end_inset . Since the other steps are true with probability one, we see that we can restrict ourselves to the union of countably many events, indexed by \begin_inset Formula \( (N,k) \) \end_inset . Now, we \begin_inset Quotes eld \end_inset allocate \begin_inset Quotes erd \end_inset parts of the confidence quantity \begin_inset Formula \( \delta \) \end_inset to each of these events, namely \begin_inset Formula \( (N,k) \) \end_inset receives \begin_inset Formula \( \delta _{N,k}=\delta /(N(N+1)^{2}),\; N=1,2,\dots ;\, k=0,\dots ,N \) \end_inset . It follows easily that the union of all these events has probability at most \begin_inset Formula \( \sum _{N,k}\delta _{N,k}=\delta \) \end_inset . Therefore we have proved that with probability at least \begin_inset Formula \( 1-\delta \) \end_inset over random choices of \begin_inset Formula \( S \) \end_inset it holds true that for \emph on all \emph default \begin_inset Formula \( N \) \end_inset and \emph on all \emph default \begin_inset Formula \( \theta \in (0,1] \) \end_inset , \begin_inset Formula \begin{equation} \label{eq-finalbound} \textrm{KL}(q_{S}||p_{D})\leq \frac{N\textrm{KL}(Q||P)+\ln \frac{2m}{\delta _{N,k}}}{m-1}\leq \frac{N\textrm{KL}(Q||P)+\ln \frac{2m}{\delta }+3\ln N+1}{m-1} \end{equation} \end_inset We can now choose \begin_inset Formula \( N \) \end_inset to minimize this bound. \begin_inset Formula \( N \) \end_inset may depend on \begin_inset Formula \( \theta ,\, Q \) \end_inset and the sample \begin_inset Formula \( S \) \end_inset . \layout Standard The asymptotic bound stated in the theorem can be derived by choosing \begin_inset Formula \( N \) \end_inset (with respect to \begin_inset Formula \( \theta \) \end_inset and \begin_inset Formula \( Q \) \end_inset ) so as to \emph on approximately \emph default minimize the bound we have derived above. The first step in this minimization is the replacement of the Binomial tail with a looser approximation such as the Hoeffding bound which gives us: \begin_inset Formula \[ 1-\textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,0\right) \leq e^{-2N\frac{\theta ^{2}}{4^{2}}}=e^{-N\frac{\theta ^{2}}{8}}\] \end_inset \begin_inset Formula \[ \textrm{Bin}\left( N,\left\lceil N\frac{1+\theta /2}{2}\right\rceil ,\frac{1+\theta }{2}\right) \leq e^{-2N\frac{\theta ^{2}}{4^{2}}}=e^{-N\frac{\theta ^{2}}{8}}\] \end_inset \layout Standard We can then choose \begin_inset Formula \[ N=\left\lceil 8\frac{\ln \frac{m}{\textrm{KL}(q||p)}}{\theta ^{2}}\right\rceil .\] \end_inset \layout Standard This choice gives us: \layout Standard \begin_inset Formula \[ e^{-\frac{N\theta ^{2}}{8}}=\frac{\textrm{KL}(q||p)}{m}\] \end_inset \layout Standard Which implies we have an equation of the form: \layout Standard \begin_inset Formula \begin{equation} \textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) \leq \frac{\theta ^{-2}\textrm{KL}(q||p)\ln m+\ln m+\ln \frac{1}{\delta }}{m} \end{equation} \end_inset In order to prove the result, we must show that: \begin_inset Formula \begin{equation} \textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) =O\left( \textrm{KL}\left( q||p\right) \right) \end{equation} \end_inset \layout Standard We can do this by taking the difference to get an equation of the form: \layout Standard \begin_inset Formula \( (q+k)\ln \frac{q+k}{p-k}+(1-q-k)\ln \frac{1-q-k}{1-p+k}-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \) \end_inset \layout Standard If \begin_inset Formula \( p-q>2k \) \end_inset and \begin_inset Formula \( k<\frac{1}{2} \) \end_inset then we have: \layout Standard \begin_inset Formula \( \simeq (q+k)\ln \frac{q+k+\frac{k}{p}}{p}+(1-q-k)\ln \frac{1-q-k-\frac{k}{1-p}}{1-p}-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \) \end_inset \layout Standard \begin_inset Formula \( \simeq (q+k)[\ln \frac{q}{p}+\frac{k+\frac{k}{p}}{\frac{q}{p}}+(1-q-k)\ln \frac{1-q}{1-p}-\left[ \frac{k+\frac{k}{1-p}}{\frac{1-q}{1-p}}\right] -q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \) \end_inset \layout Standard \begin_inset Formula \( \simeq (q+k)[\ln \frac{q}{p}+k(\frac{p+1}{q})]+(1-q-k)[\ln \frac{1-q}{1-p}-k(\frac{2-p}{1-q})]-q\ln \frac{q}{p}-(1-q)\ln \frac{1-q}{1-p} \) \end_inset \layout Standard \begin_inset Formula \( \simeq k(1+p)-k(2-p) \) \end_inset \layout Standard \begin_inset Formula \( \simeq k \) \end_inset \layout Standard Since we have that the difference \begin_inset Formula \begin{equation} \textrm{KL}\left( q+\frac{\textrm{KL}(q||p)}{m}||p-\frac{\textrm{KL}(q||p)}{m}\right) -\textrm{KL}\left( q||p\right) \simeq [\textrm{KL}\left( q||p\right) ] \end{equation} \end_inset the main theorem follows immediately. \layout Section Methods for tightening \layout Standard \begin_inset LatexCommand \label{sec-tighten} \end_inset \layout Standard The previous section showed a bound in asymptotic form which is good for understanding the trade-offs between the number of examples ( \begin_inset Formula \( m \) \end_inset ), the size of the hypothesis space ( \begin_inset Formula \( |H| \) \end_inset ), the margin ( \begin_inset Formula \( \theta \) \end_inset ) and the entropy of the average ( \begin_inset Formula \( H(Q) \) \end_inset ). However, it is not a good form for those interested in quantitative application of the bound to specific problems. We state improvements which aid in the development of a quantitatively applicable bound. We can tighten the bound above through several techniques: \layout Enumerate Parameterizing and then optimizing the parameterization of arbitrary choices within the proof. In the (improved) margin bound proof, we arbitrarily decided to work with the margin of the randomly produced function \begin_inset Formula \( g(x) \) \end_inset at \begin_inset Formula \( \frac{\theta }{2} \) \end_inset . This is a good heuristic, but not the optimal choice when we use the improved tail bounds. Since the decision of the margin for the random function \begin_inset Formula \( g(x) \) \end_inset is a parameter of the proof, we are free to optimize it. \layout Enumerate Tighter argument within the proof. The optimal value of \begin_inset Formula \( N \) \end_inset is a function of \begin_inset Formula \( \theta ,m,\textrm{KL}(q||p) \) \end_inset and \begin_inset Formula \( \delta \) \end_inset . All of these are known in advance except for \begin_inset Formula \( \textrm{KL}(q||p) \) \end_inset . If we can estimate in advance the value of \begin_inset Formula \( \textrm{KL}(q||p) \) \end_inset , then it becomes possible to optimize the value of \begin_inset Formula \( N \) \end_inset in a data-independent manner. Consequently, it becomes unnecessary to split our confidence over the possible values of \begin_inset Formula \( N \) \end_inset and we need only split the confidence over the values of \begin_inset Formula \( \theta \) \end_inset in proving the bound. The effect of this improvement is reducing \begin_inset Formula \( 1/\delta _{N,k}=1/(N(N+1)^{2}) \) \end_inset to \begin_inset Formula \( 1/\delta _{N,k}=1/(N(N+1)) \) \end_inset giving us a small improvement in the low order terms of the improved averaging bound. \layout Section Final thoughts for Averaging Bounds \layout Standard The practical implication of this more-refined analysis of averaging bounds is more support for the practice of averaging. Sufficient averaging over the hypothesis space can reduce the \begin_inset Quotes eld \end_inset complexity \begin_inset Quotes erd \end_inset allowing tight estimates on the true error rate---possibly even tighter than could be guaranteed with a single hypothesis. \layout Standard The improved averaging bound is not yet the tightest possible and it appears there are several possible theoretical improvements. \layout Enumerate Remove the low order terms from the bound to make it more quantitatively applicable. \layout Enumerate Improve the argument to take into account the \emph on distribution \emph default of the margin rather than the margin at some point. \layout Enumerate Prove a lower bound which corresponds to the upper bound given here. Since no good lower bound yet exists, we do not know that large improvements in the upper bound are not possible. \layout Standard Open Problem: In practice, is the averaging bound ( \begin_inset LatexCommand \ref{eq-finalbound} \end_inset ) ever better than the PAC-Bayes bound \begin_inset LatexCommand \ref{th-repbb} \end_inset ? The extra triangle inequality applications in the averaging bound may make it worse than the PAC-Bayes bound in practice. \layout Chapter Computable Shell bounds \layout Standard \begin_inset LatexCommand \label{sec-shell} \end_inset \layout Standard The first shell bound paper was joint work with David McAllester and was presented at Colt \begin_inset LatexCommand \cite{Shell} \end_inset . The work presented here incorporates significant refinement, generalization, and simplification of the first Colt paper. \layout Standard Roughly speaking, the shell bound (usually) provides much tighter true error rate upper bounds on learned hypotheses than conventional Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) or PAC-Bayes bounds (theorem s \begin_inset LatexCommand \ref{th-repbb} \end_inset ). It does this without violating lower upper bounds \begin_inset LatexCommand \ref{sec-lower_upper} \end_inset by incorporating \emph on much \emph default more information into the bound calculation. \layout Standard The inspiration behind the work on Shell bounds rests on two pieces of work. In \begin_inset LatexCommand \cite{old_shell} \end_inset by Haussler, Kearns, Seung, and Tishby, learning theory curves are investigated from an omniscient point of view where the true error rates of various hypotheses are known. The principle improvement in this paper is that our bounds are reduced to \emph on observable \emph default quantities. Put another way, we do not need to know the underlying learning distribution, \begin_inset Formula \( D \) \end_inset . In \begin_inset LatexCommand \cite{Tobias} \end_inset , an analysis was made assuming some distribution over true error rates. Our analysis does not rely on any assumption about the distribution of true error rates---only the independence assumption is made. Despite using only observable information and making no extra assumptions, the results here are quite tight and yield practical results. \layout Standard We start with the distribution of empirical errors over hypotheses and subtract a small amount from the empirical error rates to create a pessimistic distribut ion. With high probability, the cumulative of the pessimistic distribution will lower bound the cumulative distribution of hypothesis true error rates. Given this, we can directly calculate a bound on the probability that a \begin_inset Quotes eld \end_inset large \begin_inset Quotes erd \end_inset hypothesis will produce a misleadingly small error. This bound can be \emph on much \emph default tighter than standard union bound techniques although the quantity of improvemen t is highly problem dependent. \layout Standard After presenting the first bound, we will transform it into a bound on continuou s hypothesis spaces using a PAC-Bayes like approach \begin_inset LatexCommand \cite{PB} \end_inset . \layout Standard Viewed as an interactive proof of learning (figure \begin_inset LatexCommand \ref{fig-shell-protocol} \end_inset ), the stochastic shell bound is much like the PAC-Bayes bound. \layout Standard \begin_float fig \layout Standard \align center \begin_inset Figure size 267 153 file thesis-presentation/shell.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-shell-protocol} \end_inset The stochastic shell bound, as an interactive proof of learning, has the same general outline as the PAC-Bayes bound except that \emph on much \emph default more information is required in order to calculate the bound. The shell bound (proved first below) is a simplification which is somewhat tighter when the \begin_inset Quotes eld \end_inset Posterior \begin_inset Quotes erd \end_inset places all mass on one hypothesis. \end_float \layout Standard The strongest criticism of shell bounds is, in fact, that too much information is required. While this information is always theoretically observable, it may not be tractable to collect. There are two answers to this criticism given here. The first is an empirical employment on decision tree learning algorithms which shows that in practice, there are often shortcuts which make the information gathering feasible. The second answer is to construct a sampled version of the shell bound which approximate versions of the information required by the shell bound. We will: \layout Enumerate Present the discrete shell bound \layout Enumerate Present the sampled version of the shell bound. \layout Enumerate Extend the discrete shell bound to continuous spaces \layout Section The Discrete Shell Bound \layout Subsection Knowledge of learning distribution \layout Standard Let \begin_inset Formula \( B(m,K,p)=\binom{m}{K}e(h)^{K}(1-e(h))^{m-K} \) \end_inset be the probability that hypothesis with true error \begin_inset Formula \( p \) \end_inset produces \begin_inset Formula \( K \) \end_inset errors on \begin_inset Formula \( m \) \end_inset independent examples. \layout Standard The discrete shell bound works directly with the probability that there will be a confusingly small empirical error. Let \begin_inset Formula \[ P(\epsilon ,K)\equiv \sum _{h:e(h)>K+\epsilon }B(m,K,e(h))\] \end_inset \layout Standard Intuitively, \begin_inset Formula \( P(\epsilon ,K) \) \end_inset is a bound on the probability that a hypotheses with a true error rate larger than \begin_inset Formula \( \frac{K}{m}+\epsilon \) \end_inset will have an empirical error rate of \begin_inset Formula \( \frac{K}{m} \) \end_inset . The contribution to the sum will fall off exponentially as the true error, \begin_inset Formula \( e(h) \) \end_inset , increases. \layout Standard Our first step is stating a shell bound which requires unknown information. The purpose of this bound is motivational - it provides incite into why we can expect a large improvement. Later, we will remove the unknown information requirements and recover a useful bound. \layout Theorem \begin_inset LatexCommand \label{Full_knowledge} \end_inset (Full knowledge theorem) For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}(\exists h\in H\, \, e(h)\geq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \] \end_inset where \begin_inset Formula \( \epsilon (\delta ,\frac{K}{m})=\min \left\{ \epsilon :\, \, P(\epsilon ,K)=\frac{\delta }{m}\right\} \) \end_inset \layout Standard The full knowledge theorem relies on unobservable information---the true error rates of all hypotheses. This theorem is not (quite) trivial because it does not rely upon information about \emph on which \emph default hypothesis has a particular true error. \layout Proof For every hypothesis with a true error rate of \begin_inset Formula \[ e(h)>\frac{K}{m}+\epsilon (\delta ,K)\] \end_inset the probability of producing an empirical error of \begin_inset Formula \[ \hat{e}(h)=\frac{K}{m}\] \end_inset is \begin_inset Formula \( B(K,e(h),m) \) \end_inset . Applying the union bound over all hypotheses and all \begin_inset Formula \( m \) \end_inset possible nontrivial values of \begin_inset Formula \( K \) \end_inset completes the proof. \layout Standard There are a couple things to note about this theorem. First, for most balanced machine learning problems most hypotheses typically have a true error rate near to \begin_inset Formula \( 0.5 \) \end_inset . Given this, and noticing that Binomial tails fall of exponentially, dramatic improvements in the bound are feasible. \layout Standard Second, we must use \begin_inset Formula \( \frac{\delta }{m} \) \end_inset rather than \begin_inset Formula \( \delta \) \end_inset in order to make the proof work. It is possible that theorem holds without splitting \begin_inset Formula \( \delta \) \end_inset \begin_inset Quotes eld \end_inset \begin_inset Formula \( m \) \end_inset -ways \begin_inset Quotes erd \end_inset . Removing the factor of \begin_inset Formula \( m \) \end_inset is an open problem. For the special case of empirical risk minimization algorithms, McDiarmid's inequality \begin_inset LatexCommand \cite{McD} \end_inset implies that the range of hypotheses with minimum empirical error is of size \begin_inset Formula \( O(\sqrt{m}) \) \end_inset with high probability. Therefore, we need only apply the union bound to \begin_inset Formula \( O(\sqrt{m}) \) \end_inset possible minimum empirical error rates. \layout Subsection No knowledge of learning distribution \layout Standard Applying the full knowledge theorem ( \begin_inset LatexCommand \ref{Full_knowledge} \end_inset ) is not practical in almost all learning settings because we do not know the distribution of true error rates. Therefore, it is necessary to construct a slightly looser theorem which relies upon only observable quantities. Surprisingly, this is possible while introducing only slightly more slack. \layout Standard First, we need a couple of definitions. \layout Standard \begin_inset Formula \[ \bar{e}(\epsilon ,K,\delta ,\frac{i}{m})=\textrm{max }\left\{ \frac{K}{m}+\epsilon ,\, \, \textrm{min}\, \left\{ p:\, \, 1-\textrm{Bin}(m,i,p)\geq \delta \right\} \right\} \] \end_inset Intuitively, \begin_inset Formula \( \bar{e}(\epsilon ,K,\delta ,\frac{i}{m}) \) \end_inset is a \emph on lower \emph default bound on the true error rate of the hypothesis with an empirical error of \begin_inset Formula \( \frac{i}{m} \) \end_inset . \layout Standard \begin_inset Formula \[ \hat{P}(\epsilon ,K,\delta )\equiv 2\sum _{h}B(m,K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)))\] \end_inset The quantity \begin_inset Formula \( \hat{P}(\epsilon ,K,\delta ) \) \end_inset is an upper bound on the probability that one of the hypotheses with a true error rate of \begin_inset Formula \( \bar{e} \) \end_inset or more could produce \begin_inset Formula \( K \) \end_inset empirical errors. \layout Standard Noting that there are only \begin_inset Formula \( m+1 \) \end_inset possible empirical errors, we can first let \begin_inset Formula \[ c\left( \frac{i}{m}\right) =\left| \left\{ h:\, \hat{e}(h)=\frac{i}{m}\right\} \right| \] \end_inset be the count of empirical errors at \begin_inset Formula \( \frac{i}{m} \) \end_inset . Then we can redefine: \begin_inset Formula \[ \hat{P}(\epsilon ,K,\delta )\equiv 2\sum _{i=0}^{m}c\left( \frac{i}{m}\right) B(m,K,\bar{e}(\epsilon ,K,\delta ,\frac{i}{m}))\] \end_inset \layout Standard Later, we will prove that with high probability, \begin_inset Formula \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \) \end_inset . Given that this is so, we can prove a theorem which \emph on only \emph default relies on observable quantities. \layout Theorem \begin_inset LatexCommand \label{Observable} \end_inset (Observable Shell Bound) For all \begin_inset Formula \( \delta >0 \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}(\exists h\in H\, \, e(h)\geq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \] \end_inset where \begin_inset Formula \( \epsilon (\delta ,\frac{K}{m})=\min \left\{ \epsilon :\, \, \hat{P}\left( \epsilon ,K,\frac{\delta }{2}\right) \leq \frac{\delta }{2m}\right\} \) \end_inset \layout Standard The observable shell bound preserves the important locality property of the full knowledge shell bound. In particular, when most of the true error rates are \begin_inset Quotes eld \end_inset far \begin_inset Quotes erd \end_inset from the empirical error rate (and \begin_inset Formula \( m \) \end_inset is large enough), we expect to make large (functional) improvements on the discrete hypothesis bound \begin_inset LatexCommand \ref{th-DHSCP} \end_inset . \layout Standard The proof rests upon a technical lemma which allows us to bound the unobservable \begin_inset Quotes eld \end_inset probability of a misleading event \begin_inset Quotes erd \end_inset with an observable event. \layout Lemma \begin_inset LatexCommand \label{unobservable bound} \end_inset (Unobservable bound) For all empirical errors, \begin_inset Formula \( K \) \end_inset , for all distributions \begin_inset Formula \( Q(h) \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( 2\sum _{h}Q(h)B(m,K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)))\geq \sum _{h:e(h)>K+\epsilon }Q(h)B(m,K,e(h))\right) \leq \delta \] \end_inset \layout Standard This lemma is powerful because it bounds the unobservable right hand side in terms of the observable left hand side. \layout Proof Let \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, 1-\textrm{Bin}(m,k,p)\geq \delta \} \) \end_inset \begin_inset Formula \[ \forall \alpha \in (0,1]\, \, \forall h:\, \, E_{D^{m}}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \] \end_inset \begin_inset Formula \[ \Rightarrow \forall P(h)\, \, E_{P}E_{D^{m}}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \] \end_inset \begin_inset Formula \[ \Rightarrow \forall P(h)\, \, E_{D^{m}}E_{P}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\leq \alpha \] \end_inset \begin_inset Formula \[ \Rightarrow \forall P(h)\, \, \Pr _{D^{m}}\left( E_{P}I(e(h)<\bar{e}(m,\hat{e}(h),\alpha ))\geq \frac{\alpha }{\delta }\right) \leq \delta \] \end_inset \begin_inset Formula \[ \Rightarrow \forall P(h)\, \, \Pr _{D^{m}}\left( \Pr _{P}\left( e(h)<\bar{e}(m,\hat{e}(h),\alpha )\right) \geq \frac{\alpha }{\delta }\right) \leq \delta \] \end_inset Let \begin_inset Formula \( P(h)\propto Q(h)B(m,K,e(h)) \) \end_inset . Then: \begin_inset Formula \[ \Rightarrow \forall Q(h)\, \, \Pr _{D^{m}}\left( \frac{\sum _{h:e(h)>\frac{K}{m}+\epsilon \wedge e(h)>\bar{e}(m,\hat{e}(h),\alpha )}Q(h)B(m,K,e(h))}{\sum _{h:e(h)>\frac{K}{m}+\epsilon }Q(h)B(m,K,e(h))}\geq \frac{\alpha }{\delta }\right) \leq \delta \] \end_inset set \begin_inset Formula \( \alpha =\frac{\delta }{2} \) \end_inset and replace \begin_inset Formula \( e(h) \) \end_inset with \begin_inset Formula \( \bar{e}(m,\hat{e}(h),\alpha ) \) \end_inset to achieve the result. \layout Standard We now have all the tools required to prove the theorem. \layout Proof (of theorem \begin_inset LatexCommand \ref{Observable} \end_inset ) Choose \begin_inset Formula \( Q(h) \) \end_inset = the uniform distribution on a our hypothesis space. Then, we know that with probability \begin_inset Formula \( 1-\delta \) \end_inset , \begin_inset Formula \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \) \end_inset . Therefore, we can (arbitrarily) allocate a \begin_inset Formula \( \frac{\delta }{2} \) \end_inset probability of failure to the unobservable bound \begin_inset LatexCommand \ref{unobservable bound} \end_inset and a \begin_inset Formula \( \frac{\delta }{2} \) \end_inset probability of failure to the full knowledge bound \begin_inset LatexCommand \ref{Full_knowledge} \end_inset . Assuming \begin_inset Formula \( \hat{P}(\epsilon ,K,\delta )\geq P(\epsilon ,K) \) \end_inset , the observable bound will be more pessimistic than the full knowledge bound. \layout Standard The Observable Shell bound behaves in a strange manner which is unlike other true error bounds. In particular, the true error bound can be discontinuous in the value of \begin_inset Formula \( \delta \) \end_inset . This discontinuity implies that relatively small improvements in the shell bounds can result in dramatic improvements in the value of the true error bound. While dramatic improvements can happen with small improvements in the shell bound, we expect that in practice, such large improvements will not be too common, simply because a small improvement is unlikely (amongst all learning problems) to shift the bound across one of these discontinuous transitions. \layout Section Sampling Shell Bound \layout Standard The Shell bound relies upon the distribution of empirical errors across the entire hypothesis space. Calculating \begin_inset Formula \( c\left( \frac{i}{m}\right) \) \end_inset , while theoretically possible, is often practically intractable. For example, the space of all binary functions on \begin_inset Formula \( n \) \end_inset features has size \begin_inset Formula \( 2^{2^{n}} \) \end_inset and for a decision tree with a number of nodes \begin_inset Formula \( k=O(m) \) \end_inset there are more than \begin_inset Formula \( 2^{k} \) \end_inset hypotheses with the same number of nodes. In order to avoid this difficulty, we will use sampling which is made possible by noticing that the Shell bound does not require exact knowledge of the empirical error distribution. Instead, we can safely count a hypothesis twice because over-counting monotonically worsens the bound. Assume that we have an oracle which can be used to sample uniformly from the set of all hypotheses. Then, we can bin the sampled hypotheses according to their error rate on the example set. After repeating \begin_inset Formula \( k \) \end_inset times we will arrive at an empirical distribution over error rates which can be altered into a bound on the true distribution of error rates by upper bounding the count \begin_inset Formula \( c\left( \frac{i}{m}\right) \) \end_inset . Let \begin_inset Formula \( \hat{c}\left( \frac{i}{m}\right) \) \end_inset be the observed number of hypotheses out of \begin_inset Formula \( k \) \end_inset uniform choices with empirical error \begin_inset Formula \( \frac{i}{m} \) \end_inset and define: \begin_inset Formula \[ \bar{c}\left( k,\frac{\hat{c}}{k},\delta ,|H|\right) \equiv |H|*\max _{p}\left\{ p:\, \, \textrm{Bin}(k,\hat{c},p)=\frac{\delta }{m}\right\} \] \end_inset Intuitively, \begin_inset Formula \( \bar{c} \) \end_inset will be a high confidence upper bound on the number of hypotheses with empirical error rate \begin_inset Formula \( \hat{c} \) \end_inset . Given these quantities, we can calculate an approximate \begin_inset Formula \( \hat{P} \) \end_inset according to the formula: \begin_inset Formula \[ \hat{\hat{P}}(\epsilon ,K,\delta )\equiv 2\sum _{i=0}^{m}\bar{c}\left( k,\frac{\hat{c}\left( \frac{i}{m}\right) }{k},\frac{\delta }{2m},|H|\right) B\left( m,K,\bar{e}\left( \epsilon ,K,\frac{\delta }{2},\frac{i}{m}\right) \right) \] \end_inset \layout Theorem \begin_inset LatexCommand \label{th-sample_shell} \end_inset (Sampling Shell Bound) For all \begin_inset Formula \( \delta >0 \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}(\exists h\in H\, \, e(h)\leq \hat{e}(h)+\epsilon (\delta ,\hat{e}(h)))\leq \delta \] \end_inset where \begin_inset Formula \( \epsilon (\delta ,K)=\min \left\{ \epsilon :\, \, \hat{\hat{P}}\left( \epsilon ,K,\frac{\delta }{2}\right) \leq \frac{\delta }{2m}\right\} \) \end_inset \layout Proof For every \begin_inset Formula \( i \) \end_inset , we know that \begin_inset Formula \( \hat{\hat{P}}\left( \epsilon ,K,\delta \right) \geq \hat{P}\left( \epsilon ,K,\frac{\delta }{2}\right) \) \end_inset with probability at least \begin_inset Formula \( 1-\frac{\delta }{2} \) \end_inset . This implies that \begin_inset Formula \( \hat{\hat{P}}\left( \epsilon ,K,\delta \right) \geq P\left( \epsilon ,K\right) \) \end_inset with probability at least \begin_inset Formula \( 1-\delta \) \end_inset . Applying the union bound with the full knowledge theorem gives us the result. \layout Standard The Sampling Shell bound is still relatively fast to calculate given the sampling results, but it is worth noting that \emph on many \emph default samples are required---the bound will typically tighten linearly with \begin_inset Formula \( \ln k \) \end_inset . In other words, an exponentially large \begin_inset Formula \( k \) \end_inset is required to realize all of the gains of the shell bound. Thus, the sampling shell bound will interpolate between the discrete hypothesis bound \begin_inset LatexCommand \ref{th-DHSCP} \end_inset and the shell bound as \begin_inset Formula \( \ln k \) \end_inset moves from \begin_inset Formula \( 1 \) \end_inset to \begin_inset Formula \( \ln |H| \) \end_inset . \layout Section Lower Bounds \layout Standard The lower upper bound \begin_inset LatexCommand \ref{sec-lower_upper} \end_inset does not apply to shell bounds because we are using more information than just the empirical error rate of a learned hypothesis. In particular, we are using the empirical error rates of \emph on all \emph default the hypotheses in calculating the bound. Is there a lower upper bound which applies for the information used by the shell bound? The same independent hypothesis technique will allow us to lower bound the full knowledge theorem \begin_inset LatexCommand \ref{Full_knowledge} \end_inset . In particular, assume that we are given a set of independent hypotheses, each with some true error \begin_inset Formula \( e(h) \) \end_inset . What is a lower bound on the probability that one of these hypotheses will have an empirical error of \begin_inset Formula \( \frac{K}{m} \) \end_inset ? \layout Standard If \begin_inset Formula \( A \) \end_inset and \begin_inset Formula \( B \) \end_inset are independent events, then: \begin_inset Formula \[ \Pr (A\textrm{ or }B)=\Pr (A)+\Pr (B)-\Pr (A\textrm{ and }B)\] \end_inset \begin_inset Formula \[ =\Pr (A)+\Pr (B)-\Pr (A)\Pr (B)\] \end_inset \begin_inset Formula \[ =(1-\Pr (B))\Pr (A)+\Pr (B)\] \end_inset This implies that we can \begin_inset Quotes eld \end_inset add \begin_inset Quotes erd \end_inset the independent probabilities together as long as we rescale. In particular, \begin_inset Formula \( B \) \end_inset might be the probability of a \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset hypothesis in some set of hypotheses and \begin_inset Formula \( A \) \end_inset might be the probability that some new hypothesis with a large true error rate has a small empirical error. \layout Standard Using this fact, we get the following theorem: \layout Theorem (Lower Upper Shell Bound) For all true errors \begin_inset Formula \( e(h) \) \end_inset , \begin_inset Formula \( K \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h:e(h)>\frac{K}{m}+\epsilon \textrm{ and }\hat{e}(h)=\frac{K}{m}\right) \geq P(\epsilon ,K)(1-P(\epsilon ,K))\] \end_inset \layout Proof The proof is by finite induction on the set of hypotheses with a large true error rate. Let \begin_inset Formula \[ P_{H}(\epsilon ,K)=\sum _{h\in H}B(m,K,e(h))\] \end_inset be the sum of the probabilities that each hypothesis in \begin_inset Formula \( H' \) \end_inset produces an empirical error of \begin_inset Formula \( \frac{K}{m} \) \end_inset . Now, we want to prove that: \begin_inset Formula \[ \forall H\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(h)=\frac{K}{m}\right) \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)\] \end_inset This is true for the base case of \begin_inset Formula \( |H|=1 \) \end_inset . Assuming that it is true for the case of \begin_inset Formula \( |H|=n \) \end_inset , we need to prove it for the case of \begin_inset Formula \( H'=H\cup |h| \) \end_inset . In particular, we have assumed \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(h)=\frac{K}{m}\right) \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)\] \end_inset Using the earlier independent principle, we get that: \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H':\, \, \hat{e}(h)=\frac{K}{m}\right) \] \end_inset \begin_inset Formula \[ \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K))+(1-P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K)))B(K,e(h),m)\] \end_inset \begin_inset Formula \[ \geq P_{H}(\epsilon ,K)(1-P_{H}(\epsilon ,K))+(1-P_{H}(\epsilon ,K))B(K,e(h),m)\] \end_inset \begin_inset Formula \[ =P_{H'}(\epsilon ,K)(1-P_{H}(\epsilon ,K))\] \end_inset \begin_inset Formula \[ \geq P_{H'}(\epsilon ,K)(1-P_{H'}(\epsilon ,K))\] \end_inset By induction, this property therefore holds for the set of all hypotheses with a large true error rate. \layout Standard Assuming that \begin_inset Formula \( \delta <\frac{1}{2} \) \end_inset , the lower upper shell bound is tight to within a factor (in \begin_inset Formula \( \delta \) \end_inset ) of \begin_inset Formula \( 2m \) \end_inset with the full knowledge bound \begin_inset LatexCommand \ref{Full_knowledge} \end_inset . Given the exponential behavior of Binomial tails, this usually (but not always) implies a small impact on the true error bound. One important question remains: how does this bound compare to the observable shell bound \begin_inset LatexCommand \ref{Observable} \end_inset ? In the observable shell bound, the distribution of true errors is replaced with a pessimistic distribution based upon the observed empirical errors. The \begin_inset Quotes eld \end_inset size \begin_inset Quotes erd \end_inset of this pessimism in terms of the true error bound is, in general, of size \begin_inset Formula \( \frac{1}{\sqrt{m}} \) \end_inset . Thus, the gap between the lower upper shell bound and the upper shell bound is typically of size \begin_inset Formula \( \frac{1}{\sqrt{m}} \) \end_inset . \layout Section Shell Bounds for Continuous Spaces \layout Standard Applying Shell bounds to continuous hypothesis spaces is not easy. In fact, upon first inspection, this appears to be impossible since shell bounds require knowledge of the number of hypotheses with a particular empirical error. We can avoid these difficulties, while introducing a small amount of slack, by always being concerned with the \emph on measure \emph default rather than the count of hypotheses. In particular, we will keep track of the \emph on measure \emph default of hypotheses with a confusingly small error and the \emph on measure \emph default of the hypotheses that we pick. The approach here is similar to the approach in section \begin_inset LatexCommand \ref{sec-PB} \end_inset although more simplistic. \layout Standard First assume that there is some measure \begin_inset Formula \( Q \) \end_inset over the hypothesis space \begin_inset Formula \( H \) \end_inset . Suppose that we choose some subset, \begin_inset Formula \( U \) \end_inset , of the hypotheses with measure \begin_inset Formula \( Q(U) \) \end_inset . We will be concerned with the \emph on largest \emph default empirical error rate \begin_inset Formula \( \hat{e}_{U}(h)=\max _{h\in U}\hat{e}(h) \) \end_inset and the \emph on average \emph default true error rate, \begin_inset Formula \( e_{U}(h)=E_{Q_{U}}e(h) \) \end_inset . A bound on the gap between the largest empirical error rate and the average true error rate will imply a true error bound for the stochastic classifier which chooses randomly and evaluates. \layout Standard We will need a different definition of \begin_inset Formula \( P \) \end_inset . Let: \layout Standard \begin_inset Formula \[ P_{s}(\epsilon ,\, \, K)\equiv \int _{h:e(h)>K+\epsilon }Q(h)\textrm{Bin}(K,e(h),m)dh\] \end_inset We will also need the concept of rounding. Choose \begin_inset Formula \( i \) \end_inset such that \begin_inset Formula \( \frac{1}{m^{i}}\geq Q(U)\geq \frac{1}{m^{i+1}} \) \end_inset then, define: \begin_inset Formula \[ \left\lfloor Q(U)\right\rfloor =\frac{1}{m^{i+1}}\] \end_inset Now, the following theorem holds: \layout Theorem (Stochastic Full knowledge) \begin_inset LatexCommand \label{Stochastic Full Knowledge} \end_inset For all distributions \begin_inset Formula \( Q \) \end_inset , For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists U\, \, e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon (\delta ,\hat{e},U)\right) \] \end_inset where \begin_inset Formula \( \epsilon \left( \delta ,\frac{K}{m},U\right) =\min \left\{ \epsilon :\, \, P_{s}(\epsilon ,K)\leq \frac{\delta \left\lfloor Q(U)\right\rfloor }{m^{3}}\right\} \) \end_inset \layout Proof Call a hypothesis with a large true error ( \begin_inset Formula \( e(h)>\frac{K}{m}+\epsilon \) \end_inset ) and small empirical error ( \begin_inset Formula \( \hat{e}(h)\leq \frac{K}{m} \) \end_inset ) a \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset hypothesis. \begin_inset Formula \( P_{s}(\epsilon ,K) \) \end_inset is the expected measure of the bad hypotheses. We will use Markov's inequality to bound the actual measure of bad hypotheses. Then, given that the quantity is bounded, we can bound the expected true error by assuming that we included every bad hypothesis in the set \begin_inset Formula \( U \) \end_inset . \layout Proof Let \begin_inset Formula \[ \epsilon _{Ki}=\min \left\{ \epsilon :\, \, P_{s}(\epsilon ,K)\leq \frac{\delta }{m^{3+i}}\right\} \] \end_inset Intuitively, \begin_inset Formula \( \epsilon _{Ki} \) \end_inset is the value we will use when \begin_inset Formula \( \hat{e}=\frac{K}{m} \) \end_inset and \begin_inset Formula \( \left\lfloor Q(\hat{e})\right\rfloor =\frac{1}{m^{i}} \) \end_inset . Also let \begin_inset Formula \[ \hat{P}_{s}(\epsilon ,K)=\int _{h:e(h)>K+\epsilon \wedge \hat{e}(h)\leq K}p(h)dh\] \end_inset be the actual measure of bad hypotheses. Then, Markov's inequality tells us: \begin_inset Formula \[ \forall \delta >0\, \, \Pr _{D}(\hat{P}_{s}(\epsilon _{Ki},K)\geq \frac{m^{2}}{\delta }P_{s}(\epsilon _{Ki},K))\leq \frac{\delta }{m^{2}}\] \end_inset Taking the union bound over all values of \begin_inset Formula \( \epsilon _{Ki} \) \end_inset , we get: \begin_inset Formula \[ \forall \delta >0\, \, \Pr _{D}(\forall K,i\, \, \hat{P}_{s}(\epsilon _{Ki},K)\geq \frac{m^{2}}{\delta }P_{s}(\epsilon _{Ki},K))\leq \delta \] \end_inset So, with probability \begin_inset Formula \( 1-\delta \) \end_inset for all values of \begin_inset Formula \( K \) \end_inset and \begin_inset Formula \( i \) \end_inset , we have: \begin_inset Formula \( \hat{P}_{s}(\epsilon _{Ki},K)\leq \frac{1}{m^{1+i}} \) \end_inset . Therefore, if \begin_inset Formula \( Q(U)\geq \frac{1}{m^{i}} \) \end_inset , we know that \begin_inset Formula \( \frac{Q(U)}{\hat{P}_{s}(\epsilon _{Ki},K)}\geq m \) \end_inset . Assuming that all of the bad hypotheses have true error \begin_inset Formula \( 1 \) \end_inset and all the rest have true error at most \begin_inset Formula \( \frac{K}{m}+\epsilon _{Ki} \) \end_inset , we get the following true error bound: \begin_inset Formula \[ e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon _{\hat{e}i}\] \end_inset \layout Standard The stochastic full knowledge bound is loose when applied to the full knowledge setting by \begin_inset Formula \( \delta \rightarrow \frac{\delta }{m^{2}} \) \end_inset . Typically, this results in only a \begin_inset Formula \( \frac{2\ln m}{m} \) \end_inset increase in the size of \begin_inset Formula \( \epsilon \) \end_inset , although the increase can sometimes be much larger near phase transitions. These factors of \begin_inset Formula \( \frac{\ln m}{m} \) \end_inset may be removable with improved argumentation. Naturally, the stochastic full knowledge bound can be used to prove a stochasti c observable bound. \layout Standard The next theorem is the observable analog of the stochastic full knowledge bound. Here, we eliminate the unobservable quantities to produce a stochastic observable shell bound. The observable quantity we will use is: \layout Standard \begin_inset Formula \[ \hat{P}_{s}(\epsilon ,K,\delta )\equiv 2\sum _{h}Q(h)\textrm{Bin}(K,\bar{e}(\epsilon ,K,\delta ,\hat{e}(h)),m)\] \end_inset where \begin_inset Formula \[ \bar{e}(\epsilon ,K,\delta ,\frac{i}{m})=\textrm{max }\left\{ K+\epsilon ,\, \, \textrm{min}\, \left\{ p:\, \, 1-\textrm{Bin}(m,K,p)\geq \delta \right\} \right\} \] \end_inset is a slightly pessimal estimate of the true error rate given the empirical error rate as before. \layout Theorem (Stochastic Observable Shell Bound) For all distributions \begin_inset Formula \( Q \) \end_inset , For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists U\, \, e_{U}(h)\leq \hat{e}_{U}(h)+\frac{1}{m}+\epsilon (\delta ,\hat{e},U)\right) \] \end_inset where \begin_inset Formula \( \epsilon \left( \delta ,\frac{K}{m},U\right) =\min \left\{ \epsilon :\, \, \hat{P}_{s}(\epsilon ,K,\frac{\delta }{2})\leq \frac{\delta \left\lfloor Q(U)\right\rfloor }{2m^{3}}\right\} \) \end_inset \layout Proof First note that the unobservable bound lemma \begin_inset LatexCommand \ref{unobservable bound} \end_inset implies that with probability \begin_inset Formula \( 1-\frac{\delta }{2} \) \end_inset , we have \begin_inset Formula \( \hat{P}_{s}(\epsilon ,K,\frac{\delta }{2})\geq P_{s}(\epsilon ,K) \) \end_inset . Given that this is the case, our choice of \begin_inset Formula \( \epsilon \) \end_inset will be at least as pessimistic as the choice defined by the Stochastic Full Knowledge bound \begin_inset LatexCommand \ref{Stochastic Full Knowledge} \end_inset . We thus have two sources of failure: the unobservable bound lemma will fail with probability at most \begin_inset Formula \( \frac{\delta }{2} \) \end_inset and the stochastic full knowledge bound will fail with probability \begin_inset Formula \( \frac{\delta }{2} \) \end_inset . The union bound then implies that the Stochastic Observable Shell Bound holds with probability \begin_inset Formula \( 1-\frac{\delta }{2} \) \end_inset . \layout Standard The information requirements for the continuous space shell bound are even more severe than the requirements for the discrete space shell bound. In particular, we need to know the exact measure of the hypotheses with any particular empirical error. Clearly, this is unrealistic. We can relax this requirement to observations computable in a finite amount of time using the sampling techniques of theorem \begin_inset LatexCommand \ref{th-sample_shell} \end_inset . In particular, given a well-behaved distribution \begin_inset Formula \( Q \) \end_inset , we can make a bounded estimate of the measure of hypotheses with some empirical error rate. Given this bounded estimate, we can then calculate a pessimistic shell bound for the continuous space. \layout Section Conclusion \layout Standard Details for calculating the shell bound are given in appendix section \begin_inset LatexCommand \ref{sec-shell-bound-calc} \end_inset . \layout Standard Shell bounds are easily calculable and can provide large improvements when we can afford to enumerate the hypotheses. Their application is less clear when it is not possible to enumerate the hypotheses because the computational burden may become too large. Via sampling techniques, it is possible to smoothly improve from earlier techniques to the best achievable shell bound result in an anytime fashion. \layout Standard There remain several important open questions: \layout Enumerate Can we remove the remaining division of \begin_inset Formula \( \delta \) \end_inset by \begin_inset Formula \( m \) \end_inset ? This would make shell bounds a bit more elegant and tight. \layout Enumerate Is there a natural lower bound on the true error rate which uses the shell approach? This improvement is of principally theoretical interest. \layout Enumerate For the continuous space shell bounds, two additional divisions by \begin_inset Formula \( m \) \end_inset were introduced. Is it possible to remove these factors with an improved argument? This improvement would clean up the continuous shell bound. \layout Enumerate Our extension to the continuous case was done in the style of PAC-Bayes bounds, but a more common technique for extending to the continuous case is via the use of covering numbers. Is there a natural way to extend Shell bounds to the continuous case using the concept of a covering number? This is another approach which might yield a result requiring less calculation. \layout Chapter Tight covering number bounds \layout Standard \begin_inset LatexCommand \label{sec-cover} \end_inset \layout Section Introduction \layout Standard Covering number bounds are used to bound the true error rate of classifiers chosen from an infinite hypothesis space using \begin_inset Formula \( m \) \end_inset examples \begin_inset LatexCommand \cite{Haussler} \end_inset . A cover is a finite set of hypotheses which satisfies the following property: every hypothesis in the infinite space is \begin_inset Quotes eld \end_inset near \begin_inset Quotes erd \end_inset some element in the finite cover. When a Lipschitz condition holds on the hypothesis space, it is generally possible to construct these covers and the existence of a cover is required for learnability \begin_inset LatexCommand \cite{BI} \end_inset . Alternatively, Sauer's lemma (see \begin_inset LatexCommand \cite{Pollard} \end_inset or \begin_inset LatexCommand \cite{Vapnik} \end_inset ) bounds the size of the cover in terms of the VC dimension which is defined combinatorially: the VC dimension is the largest number of examples which the hypothesis space can classify in an arbitrary manner. \layout Standard The principal disadvantage of covering number results is that they are notorious ly loose, to the point that they are often useless when applied in practice (see \begin_inset Quotes eld \end_inset criticisms \begin_inset Quotes erd \end_inset in \begin_inset LatexCommand \cite{Haussler} \end_inset ). Here, \begin_inset Quotes eld \end_inset useless \begin_inset Quotes erd \end_inset means that the bound on the true error rate is \begin_inset Quotes eld \end_inset always wrong \begin_inset Quotes erd \end_inset . The amount of \begin_inset Quotes eld \end_inset looseness \begin_inset Quotes erd \end_inset can be quantified by comparison with other bounds in the regimes where other bounds hold. On a finite hypothesis space we have near-perfect agreement between the upper bound \begin_inset LatexCommand \ref{th-DHSCP} \end_inset and the lower upper bound \begin_inset LatexCommand \ref{th-dhlub} \end_inset for independent hypotheses. In fact, as the number of examples goes to infinity, the agreement is perfect, regardless of the size of the hypothesis space. When we apply covering number bounds to this problem such properties do not arise. Since part of the argument involves splitting the examples into \begin_inset Formula \( 2 \) \end_inset sets, the difference between a covering number based upper bound and the lower upper bound can be large even when the number of examples goes to infinity. In practice, the covering number bound at least squares the discrete hypothesis size. Obviously, further loosening of covering number bounds with Sauer's lemma result in even worse bounds. \layout Standard Can we construct a calculable true error upper bound for continuous hypothesis spaces which is at least asymptotically tight? In a sense, this has already been done with PAC-Bayes bounds in chapter \begin_inset LatexCommand \ref{sec-PB} \end_inset , but there are drawbacks in applicability to that approach since PAC-Bayes bounds do not apply in a meaningful way to a single hypothesis drawn from an infinite hypothesis space. A covering number argument would hopefully apply in a meaningful way to a singly hypothesis. A covering number bound which is asymptotically tight on \emph on some \emph default learning problems does exist and is covered next. \layout Section The Setting and Prior Results \layout Standard We will first discuss standard covering number bounds. Define a \begin_inset Quotes eld \end_inset distance \begin_inset Quotes erd \end_inset in terms of how often hypotheses disagree according to: \begin_inset Formula \[ d_{D}(h,f)=\Pr _{D}(h(x)\neq f(x))\] \end_inset Now, start with an epsilon net defined by: \begin_inset Formula \[ N(H,\epsilon ,d_{D})=\textrm{inf}_{F}\left| F:\, \, \forall h\in H\exists f\in F:\, \, d_{D}(h,f)\leq \epsilon \right| \] \end_inset An epsilon net is the minimum size of a set which contains an element \begin_inset Quotes eld \end_inset near \begin_inset Quotes erd \end_inset to every element in \begin_inset Formula \( H \) \end_inset . \layout Standard Then a covering number is defined as: \begin_inset Formula \[ C(H,\epsilon )=\textrm{sup}_{D}N(H,\epsilon ,d_{D})\] \end_inset The covering number is the worst epsilon net. \layout Theorem \begin_inset LatexCommand \label{th-covering} \end_inset (Covering number bound) For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \forall H\, \, \Pr _{D^{m}}\left( e(h)\geq \hat{e}(h)+4\sqrt{\frac{\ln 4C(H,\epsilon )+\ln \frac{1}{\delta }}{m}}\right) \leq \delta \] \end_inset \layout Proof In \begin_inset LatexCommand \cite{Haussler} \end_inset . \layout Standard How tight is this bound when applied to a finite independent hypothesis space? We can improve the constants by using an argument with fewer triangle inequalities in the discrete case and get the following results: \begin_inset Formula \[ \Pr _{D}\left( e(h)-\hat{e}(h)\geq 2\sqrt{\frac{\ln 4|H|+\ln \delta }{m}}\right) \leq \delta \] \end_inset Comparing this with a very loose application of the discrete hypothesis bound \begin_inset LatexCommand \ref{th-adhscb} \end_inset we see that the penalty term in the covering number bound is worse by factor of \begin_inset Formula \( 2\sqrt{2} \) \end_inset . Put another way, dividing the number of samples by \begin_inset Formula \( 8 \) \end_inset or increasing the hypothesis space size to \begin_inset Formula \( |H|^{8} \) \end_inset and then applying a sloppy discrete hypothesis bound is about equivalent to applying a very specialized covering number bound. We seek a covering number bound which does not divide the effective value of a hypothesis by \begin_inset Formula \( 8 \) \end_inset . \layout Section Bracketing Covering Number Bound \layout Standard There is an alternative version of the covering number bounds which has been little used in learning theory. This is mentioned as the \begin_inset Quotes eld \end_inset direct method \begin_inset Quotes erd \end_inset in \begin_inset LatexCommand \cite{Haussler} \end_inset and Section II.2 of \begin_inset LatexCommand \cite{Pollard} \end_inset discusses this approach but is only concerned with asymptotic consistency rather than rates of convergence. \layout Standard We start with a more restricted notion of covering---a covering in which we bracket the hypotheses. Let \begin_inset Formula \( Z \) \end_inset be the space of \begin_inset Formula \( (x,y) \) \end_inset labeled examples and \begin_inset Formula \( h(Z)=\{(x,y):h(x)\neq y\} \) \end_inset in: \begin_inset Formula \[ N_{[]}(H,\gamma ,d_{D})=\textrm{inf}_{F}\left| F:\forall h\in H\, \, \exists (f,f')\in F:\, \, \begin{array}{cc} d_{D}(f,f')\leq \gamma & \\ \textrm{and }f(Z)>h(Z)>f'(Z) & \end{array}\right| \] \end_inset In other words, \begin_inset Formula \( f \) \end_inset and \begin_inset Formula \( f' \) \end_inset are two sets which are \begin_inset Quotes eld \end_inset above \begin_inset Quotes erd \end_inset and \begin_inset Quotes eld \end_inset below \begin_inset Quotes erd \end_inset every hypothesis that they cover. Note that it is very important in this definition that the sets \begin_inset Formula \( f \) \end_inset and \begin_inset Formula \( f' \) \end_inset are not required to correspond to functions in \begin_inset Formula \( H \) \end_inset . In fact, they can simply be understood as arbitrary sets of \begin_inset Formula \( (x,y) \) \end_inset pairs. \layout Standard It is immediately obvious that: \begin_inset Formula \[ N_{[]}(H,\gamma ,d_{D})\geq N(H,\gamma ,d_{D})\] \end_inset However, the relationship between \begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \) \end_inset and \begin_inset Formula \( C(H,\gamma ) \) \end_inset is ambiguous: either could be larger. \layout Standard Using this alternative notion of a covering, we have the following theorem: \layout Theorem \begin_inset LatexCommand \label{th-bracketing_cover} \end_inset Let \begin_inset Formula \( f(h) \) \end_inset be the upper bracket of any hypothesis, \begin_inset Formula \( h \) \end_inset . For all \begin_inset Formula \( \gamma \in (0,1] \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \begin{array}{c} e(h)\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \textrm{ }\\ \textrm{or }\hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \end{array}\right) \leq \delta \] \end_inset where \begin_inset Formula \( \begin{array}{c} \bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \}\\ \textrm{and }b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\} \end{array} \) \end_inset . \layout Standard This theorem constrains the distance between \begin_inset Formula \( \hat{e}(f(h)) \) \end_inset and \begin_inset Formula \( e(f(h)) \) \end_inset (which upper bounds \begin_inset Formula \( e(h) \) \end_inset ) and the distance between \begin_inset Formula \( \hat{e}(f(h)) \) \end_inset and \begin_inset Formula \( \hat{e}(h) \) \end_inset . Consequently, it constrains the distance between \begin_inset Formula \( e(h) \) \end_inset and \begin_inset Formula \( \hat{e}(h) \) \end_inset . The proof of the theorem rests on the following two lemmas which each assert a convergence. Graphically, the proof has the following form: \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 148 10 file bracketing_cover.eps width 4 50 flags 9 \end_inset \layout Lemma Let \begin_inset Formula \( f(h) \) \end_inset be the upper bracket of any hypothesis, \begin_inset Formula \( h \) \end_inset . For all \begin_inset Formula \( \gamma \in (0,1] \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{N_{[]}(H,\gamma ,d_{D})}\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\} \) \end_inset \layout Proof If \begin_inset Formula \( f(h),f'(h) \) \end_inset bracket \begin_inset Formula \( h \) \end_inset , we have: \begin_inset Formula \[ e(f(h))-e(f'(h))\leq \gamma \] \end_inset Therefore a coin which is ``heads'' when \begin_inset Formula \( f(h)(z)\neq f'(h)(z) \) \end_inset has a bias of \begin_inset Formula \( \gamma \) \end_inset or less. Since \begin_inset Formula \( f'(h) \) \end_inset is correct everywhere that \begin_inset Formula \( h \) \end_inset is correct, we also know: \begin_inset Formula \[ \forall \epsilon \, \, \Pr _{D^{m}}(\hat{e}(f(h))-\hat{e}(h)\geq \epsilon )\leq \Pr _{D^{m}}(\hat{e}(f(h))-\hat{e}(f'(h))\geq \epsilon )\] \end_inset Therefore, we have at most \begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \) \end_inset Binomial distributions, each with bias at most \begin_inset Formula \( \gamma \) \end_inset , and wish to bound the probability of a large deviation. Using an upper binomial tail calculation, we get the result. \layout Lemma For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists f,f'\in N_{[]}(H,\gamma ,d_{D}):\, \, e(f(h))\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{N_{[]}(H,\gamma ,d_{D})}\right) \right) \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \} \) \end_inset \layout Proof Application of the discrete hypothesis space bound \begin_inset LatexCommand \ref{th-DHSCP} \end_inset for \begin_inset Formula \( f \) \end_inset in every \begin_inset Formula \( (f,f') \) \end_inset pair. \layout Standard We can now combine the lemmas to get the theorem. \layout Proof (of theorem \begin_inset LatexCommand \ref{th-bracketing_cover} \end_inset ) \latex latex \latex default Allocate \begin_inset Formula \( \frac{\delta }{2} \) \end_inset confidence to each lemma and use the union bound with both lemmas to get: \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \begin{array}{c} e(f(h))\geq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \\ \hat{e}(f(h))-\hat{e}(h)\geq b\left( m,\gamma ,\frac{\delta }{2N_{[]}(H,\gamma ,d_{D})}\right) \end{array}\textrm{ }\right) \leq \delta \] \end_inset where \begin_inset Formula \( \begin{array}{c} \bar{e}\left( m,\frac{K}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,K,p)=\delta \}\\ \textrm{and }b\left( m,p,\delta \right) \equiv '\min _{K}\left\{ \frac{K}{m}:\, \, 1-\textrm{Bin}(m,K,p)\leq \delta \right\} \end{array} \) \end_inset . Since \begin_inset Formula \( e(h)\leq e(f(h)) \) \end_inset by construction, the proof is complete. \layout Standard The alternative covering number argument has a significant advantage over the standard argument: when restricted to a finite hypothesis space, the argument becomes tight. In particular, on a finite hypothesis space, we can set \begin_inset Formula \( \gamma =0 \) \end_inset and get: \begin_inset Formula \( N_{[]}(H,0,d_{D})\leq |H| \) \end_inset . The bound reduces to the standard discrete hypothesis space bound \begin_inset LatexCommand \ref{th-DHSCP} \end_inset . Consequently, there exist learning problems where this bound is quite tight. \layout Standard The bracketing covering approach has the following advantages and disadvantages: \layout Enumerate Advantage: Covering defined in terms of the actual problem rather than a worst case over all problems. \layout Enumerate Advantage: Asymptotically tight for some learning problems. \layout Enumerate Disadvantage: Less general. There may exist spaces which are not coverable in the alternative approach. \layout Enumerate Disadvantage: \begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \) \end_inset is more difficult to compute than \begin_inset Formula \( N(H,\gamma ,d_{D}) \) \end_inset . \layout Standard In a sense, this bound is useless because it requires knowledge of the unknown distribution \begin_inset Formula \( D \) \end_inset in order to calculate a covering number. In the next section, we will see that bounds on the bracketing covering number which hold for all \begin_inset Formula \( D \) \end_inset can often be found. \layout Section Covering number calculations \layout Standard It is important to demonstrate that this covering number is feasible to calculate and gives a better answer than the traditional approach. We will do this by first calculating the bracketing covering number for a very simple continuous classifier and then comparing the results with the traditional covering number approach. \layout Standard Bracketing covering numbers have already been proved for many function classes \begin_inset LatexCommand \cite{Dudley} \end_inset \begin_inset LatexCommand \cite{VW} \end_inset . Here, we will present a proof for the simplest of continuous hypothesis spaces: the step function on a line segment. Each hypothesis will be indexed by a number \begin_inset Formula \( a\in [0,1] \) \end_inset according to: \begin_inset Formula \[ h_{a}(x)=\textrm{sign}(x-a)\] \end_inset What is \begin_inset Formula \( N_{[]}(H,\gamma ,d_{D}) \) \end_inset for this hypothesis set? \layout Theorem Assume that \begin_inset Formula \( D \) \end_inset can be described by a probability density function, then: \layout Theorem \begin_inset Formula \[ N_{[]}(H,\gamma ,d_{D})\leq \frac{1}{\gamma }+1\] \end_inset \layout Proof Consider a range of hypotheses from \begin_inset Formula \( h_{a} \) \end_inset to \begin_inset Formula \( h_{b} \) \end_inset . For this range of hypotheses, we can choose a bracketing pair, \begin_inset Formula \( (f,f') \) \end_inset . In particular, we can choose \begin_inset Formula \( f \) \end_inset and \begin_inset Formula \( f' \) \end_inset which agree on \begin_inset Formula \( [0,a) \) \end_inset and \begin_inset Formula \( [b,1] \) \end_inset and always predict either incorrectly ( \begin_inset Formula \( f \) \end_inset ) or correctly ( \begin_inset Formula \( f' \) \end_inset ) on \begin_inset Formula \( [a,b) \) \end_inset . The distance between these functions satisfies: \begin_inset Formula \[ d_{D}(f,f')=\Pr _{D}(x\in [a,b))\] \end_inset and every hypothesis \begin_inset Formula \( h_{c} \) \end_inset in \begin_inset Formula \( [a,b) \) \end_inset satisfies \begin_inset Formula \( \forall z\, \, f(z)\geq h_{c}(z)\geq f'(z) \) \end_inset . Consequently, if \begin_inset Formula \( a \) \end_inset and \begin_inset Formula \( b \) \end_inset are chosen appropriately, we will observe \begin_inset Formula \( d_{D}(f,f')\leq \gamma \) \end_inset . \layout Proof If \begin_inset Formula \( D \) \end_inset can be described by a probability distribution, then we can simply calculate the marginal distribution, \begin_inset Formula \( D_{x} \) \end_inset , and the cumulative distribution of the margin, \begin_inset Formula \( F_{D}(x) \) \end_inset . Now, find \begin_inset Formula \( a_{i} \) \end_inset for which \begin_inset Formula \( F_{D}(a_{i})=\frac{i}{\gamma } \) \end_inset for \begin_inset Formula \( i<\frac{1}{\gamma } \) \end_inset . Choose \begin_inset Formula \( b_{i}=a_{i+1} \) \end_inset . There are at most \begin_inset Formula \( \frac{1}{\gamma }+1 \) \end_inset intervals, each with a measure (according to \begin_inset Formula \( D \) \end_inset ) of at most \begin_inset Formula \( \gamma \) \end_inset . Consequently, we can cover \begin_inset Formula \( H \) \end_inset with \begin_inset Formula \( \frac{1}{\gamma }+1 \) \end_inset pairs of \begin_inset Formula \( (f_{i},f_{i}') \) \end_inset . \layout Standard Given that the bracketing cover is \begin_inset Formula \( \frac{1}{\gamma }+1 \) \end_inset , we can use theorem \begin_inset LatexCommand \ref{th-bracketing_cover} \end_inset to define a constraint that the true error rate must satisfy with high probability. Setting \begin_inset Formula \( \gamma =\frac{1}{m-1} \) \end_inset , we get: \begin_inset Formula \[ \hat{e}(f(h))\leq \hat{e}(h)+b\left( m,\frac{1}{m-1},\frac{\delta }{2m}\right) \] \end_inset and \begin_inset Formula \[ e(h)\leq \bar{e}\left( m,\hat{e}(f(h)),\frac{\delta }{2m}\right) \] \end_inset To be fair in comparison to the standard covering number approaches, we should relax our theorem to use the Hoeffding approximation. Note that this is a bit unfair because the first inequality is (inherently) a highly biased Binomial with lower variance. Relaxing to the Hoeffding bound, we get: \begin_inset Formula \[ \hat{e}(f(h))\leq \hat{e}(h)+\frac{1}{m-1}+\sqrt{\frac{\ln \frac{2m}{\delta }}{2m}}\] \end_inset and \begin_inset Formula \[ e(h)\leq \hat{e}(f(h))+\sqrt{\frac{\ln \frac{2m}{\delta }}{2m}}\] \end_inset which implies: \begin_inset Formula \[ e(h)\leq \hat{e}(h)+\frac{1}{m-1}+2\sqrt{\frac{\ln 2m+\ln \frac{1}{\delta }}{2m}}\] \end_inset Note again that we are being \begin_inset Quotes eld \end_inset unfair \begin_inset Quotes erd \end_inset to the new approach by using the Hoeffding approximation rather than exact Binomial-tail bounds. The standard covering number approach has not yet been reduced to exact Binomial-tail bounds. Using the standard approach, the covering number, we get \begin_inset Formula \( C(H,\frac{1}{m-1})=m \) \end_inset . This implies a bound of: \begin_inset Formula \[ e(h)\leq \hat{e}(h)+4\sqrt{\frac{\ln 8m+\ln \frac{1}{\delta }}{m}}\] \end_inset Comparing the bounds, we see that the new approach is about \begin_inset Formula \( \frac{16}{2}=8 \) \end_inset times more efficient in the number of examples required to achieve a bound on a given deviation. \layout Subsection Note on the Bracketing Cover proof \layout Standard There are several important things to note about this proof. \layout Enumerate We used the property that a small change in the hypothesis only affected the prediction on a small portion of the input space. \layout Enumerate The bound on \begin_inset Formula \( N_{[]} \) \end_inset holds for all \begin_inset Formula \( D \) \end_inset with a density function, not just the \begin_inset Formula \( D \) \end_inset which we happen to observe. \layout Enumerate The bound on \begin_inset Formula \( N_{[]}(H,\gamma ,D) \) \end_inset is exactly the same as a bound on \begin_inset Formula \( N\left( H,\frac{\gamma }{2},D\right) \) \end_inset . \layout Standard In fact, the proof can be extended to \emph on all \emph default \begin_inset Formula \( D \) \end_inset (even ones with point masses) at the cost of a factor of \begin_inset Formula \( 2 \) \end_inset worsening and a messier argument. Property (2) is desirable because it is often not the case that we know the distribution \begin_inset Formula \( D \) \end_inset when we wish to apply the bound. Property (1) is an essential technique that can be used to prove other covering number bounds for this notion of covering number. \layout Standard Can we show partial order covering number bounds for other classifiers? There is a straightforward extension of the previous proof for classifiers which consist of axis parallel intervals in \begin_inset Formula \( R^{n} \) \end_inset . More work is required to prove partial order covering numbers for the hypothesi s spaces of standard learning algorithms. \layout Section Conclusion and Future Work \layout Standard We presented an alternative covering number argument and showed that the true error rate bounds constructed using this argument are within \begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \) \end_inset of the lower bound on some learning problems. This is a significant improvement over prior results which just bound the ratio of the lower and upper bounds up to a constant. We also presented a simple improvement on PAC-Bayes bounds for stochastic classifiers which achieves a similar \begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \) \end_inset difference between the lower and upper bounds. \layout Standard It is interesting to examine the relationship between the bracketing covering number and the PAC-Bayes bound. With this notion of covering number we can guarantee that all of the hypotheses covered by the same bracketing pair have similar empirical as well as true errors. Thus, we can relate the error rate of an individual hypothesis to a set of hypotheses with a significant measure --- exactly the setting where the PAC-Bayes bound is tight. \layout Standard Much work remains to be done in order to fulfill a quest for quantitatively tight learning bounds. \layout Enumerate Proofs on the size of the partial order covering number need to be made for common learning algorithms. \layout Enumerate Can this alternate form of covering number be related to the VC dimension or to the standard definition of covering number? \layout Enumerate Can we extend the class of problems for which the lower and upper bounds differ by only \begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \) \end_inset to a larger set? \layout Chapter Holdout bounds: Progressive Validation \layout Standard \begin_inset LatexCommand \label{sec-PV} \end_inset \layout Section Progressive Validation Technique \layout Standard Progressive validation is a technique which allows you to use almost \begin_inset Formula \( \frac{1}{2} \) \end_inset of the data in a holdout set for training purposes while still providing the same guarantee as the holdout bound. It first appeared in \begin_inset LatexCommand \cite{Progressive} \end_inset and is discussed in a more refined and detailed form here. \layout Standard Suppose that you have a training set of size \begin_inset Formula \( m_{\textrm{train}} \) \end_inset and test set of size \begin_inset Formula \( m_{\textrm{pv}} \) \end_inset . Progressive validation starts by first learning a hypothesis on the training set and then testing on the first example of the test set. Then, we train on training set plus the first example of the test set and test on the second example of the test set. The process continues \begin_inset Formula \( m_{\textrm{test}} \) \end_inset iterations. Let \begin_inset Formula \( m \) \end_inset abbreviate \begin_inset Formula \( m_{\textrm{pv}} \) \end_inset . Then, we have \begin_inset Formula \( m \) \end_inset hypotheses, \begin_inset Formula \( h_{1},...,h_{m} \) \end_inset and \begin_inset Formula \( m \) \end_inset error observations, \begin_inset Formula \( \hat{e}_{1},...,\hat{e}_{m} \) \end_inset . The hypothesis output by progressive validation is the randomized hypothesis which chooses uniformly from \begin_inset Formula \( h_{1},...,h_{m} \) \end_inset and evaluates to get an estimated output. Note that this protocol is similar to those in \begin_inset LatexCommand \cite{Littlestone89} \end_inset and the new thing here is an analysis of performance. \layout Standard Since we are randomizing over hypotheses trained on \begin_inset Formula \( m_{\textrm{train}} \) \end_inset to \begin_inset Formula \( m_{\textrm{train}}+m_{\textrm{pv}}-1 \) \end_inset examples, the expected number of examples used by any hypothesis is \begin_inset Formula \( m_{\textrm{train}}+\frac{m_{\textrm{pv}}-1}{2} \) \end_inset . Given that training can exhibit phase transitions, the extra few examples can greatly improve the accuracy of the trained example. \layout Standard Viewed as an interactive proof of learning, the progressive validation technique follows the protocol of figure \begin_inset LatexCommand \ref{fig-pv-protocol} \end_inset . \layout Standard \begin_float fig \layout Standard \align center \begin_inset Figure size 267 175 file thesis-presentation/progressive_validation.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-pv-protocol} \end_inset The progressive validation protocol has a learner repeatedly commit to a hypothesis before it is given a new example. Based upon the test errors, a bound on the true error rate of the metahypothesi s which chooses randomly from each of \begin_inset Formula \( (h_{1},...,h_{m}) \) \end_inset before each evaluation is provided. \end_float \layout Standard The true error rate of this randomized hypothesis will be: \begin_inset Formula \[ e_{\textrm{pv}}=\frac{1}{m_{\textrm{pv}}}\sum _{i=1}^{m_{\textrm{pv}}}e(h_{i})\] \end_inset where \begin_inset Formula \( e(h_{i})=\Pr _{D}(h_{i}(x)\neq y) \) \end_inset and the empirical error estimate of this randomized hypothesis will be: \begin_inset Formula \[ \hat{e}_{\textrm{pv}}=\frac{1}{m_{\textrm{pv}}}\sum _{i=1}^{m_{\textrm{pv}}}\hat{e}_{i}\] \end_inset \layout Section Variance Analysis \layout Standard Bounding the deviation of \begin_inset Formula \( \hat{e}_{\textrm{pv}} \) \end_inset is more difficult than bounding the deviation of a holdout error. To understand this, we can think of two games, the holdout game and the progressive validation game. \layout Standard In the holdout game, your opponent chooses a bias and then nature flips \begin_inset Formula \( m \) \end_inset coins with that bias. If the deviation of the average number of heads is larger than \begin_inset Formula \( \epsilon \) \end_inset , then you lose. Otherwise, you win. \layout Standard In the progressive validation game, the opponent chooses the bias of \emph on each \emph default coin just before it is flipped. The goal of the opponent remains the same, and the opponent wins if a large deviation is observed. \layout Standard The progressive validation opponent is at least as strong as the holdout opponent since the progressive validation opponent could choose the same bias for every coin. Nonetheless, we will see that the progressive validation opponent is \emph on not \emph default much stronger. \layout Standard There are two ways in which we can show that the progressive validation opponent is not much stronger. The first technique will show that the variance of the progressive validation estimate is smaller than might be expected. Then we will show that the \emph on deviations \emph default of the progressive validation opponent behave much like the deviations of an independent opponent. \layout Theorem Suppose we test the progressive validation hypothesis on \begin_inset Formula \( m_{\textrm{test}}=m_{\textrm{pv}} \) \end_inset additional examples. Let \begin_inset Formula \( \hat{e}_{\textrm{test}} \) \end_inset be the empirical error on these examples. Then, we have: \begin_inset Formula \[ E_{(x,y)^{m_{\textrm{pv}}+m_{\textrm{test}}}\sim D^{m_{\textrm{pv}}+m_{\textrm{test}}}}(\hat{e}_{\textrm{test}}-e_{\textrm{pv}})^{2}\] \end_inset \begin_inset Formula \[ \geq E_{(x,y)^{m_{\textrm{pv}}}\sim D^{m_{\textrm{pv}}}}(\hat{e}_{\textrm{pv}}-e_{\textrm{pv}})^{2}\] \end_inset \layout Proof Every example on the left hand side can be though of as a coin with bias \begin_inset Formula \( e_{\textrm{pv}}. \) \end_inset The variance of the LHS is then \begin_inset Formula \( m*e_{\textrm{pv}}(1-e_{\textrm{pv}}) \) \end_inset . The right hand side is: \begin_inset Formula \[ E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{\textrm{pv}}-e_{\textrm{pv}})^{2}\] \end_inset \begin_inset Formula \[ =E_{(x,y)^{m}\sim D^{m}}\left[ \sum _{i=1}^{m}\hat{e}_{i}-e_{i}\right] ^{2}\] \end_inset where \begin_inset Formula \( e_{i}=e(h_{i}) \) \end_inset \begin_inset Formula \[ =E_{(x,y)^{m}\sim D^{m}}\left[ \sum _{i\neq j}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})+\sum _{i=1}^{m}(\hat{e}_{i}-e_{i})^{2}\right] \] \end_inset \begin_inset Formula \[ =\sum _{i\neq j}E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})+\sum _{i=1}^{m}E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})^{2}\] \end_inset The cross product term is: \begin_inset Formula \[ E_{(x,y)^{m}\sim D^{m}}(\hat{e}_{i}-e_{i})(\hat{e}_{j}-e_{j})\] \end_inset \layout Proof Without loss of generality, assume that \begin_inset Formula \( i\textrm{Occam}>\textrm{Microchoice }>\textrm{Shell}\simeq \textrm{Holdout}\] \end_inset This ordering is approximately as expected based on theoretical considerations. The Simple bound can never be much better than Occam Bound and the Occam bound can be arbitrarily tighter than the Simple bound. A similar statement holds for the Microchoice Bound and the Shell bound. The Occam bound is only significantly looser than the Microchoice bound because we used the Hoeffding approximation to the Binomial tail. The Shell bound is not always the best, but it does behave well in comparison to the more standard holdout approach. \layout Standard It is interesting to note that the sampling shell bound is \emph on not \emph default better than the Microchoice bound on these learning problems, even with fast sampling techniques. Apparently, the looseness introduced by bounding the sampling error is not countered by the improvement in tightness. \layout Standard Empirically, we can observe a very noticeable behavior. For problems with less than \begin_inset Formula \( 100 \) \end_inset examples the sample complexity bounds are superior to the holdout bound. Between \begin_inset Formula \( 100 \) \end_inset and \begin_inset Formula \( 1000 \) \end_inset examples, the behavior changes with the holdout bound generally winning, although not necessarily by much. Above \begin_inset Formula \( 1000 \) \end_inset examples, the holdout bound is significantly and consistently tighter than the sample complexity bounds. This behavior strongly suggests that the sample complexity bounds are loose. Each of these bounds is ``tight'' in one sense or another, but there may exist some as yet undiscovered observable property prevalent in practical machine learning algorithms which allows us to create a tighter bound. In particular, the problem of correlated hypotheses has yet to be solved in a convincing manner. \layout Standard Also note that the holdout bound is \emph on not \emph default the tightest bound we report. In general, we have the following ordering: \begin_inset Formula \[ \textrm{Holdout}>\textrm{Holdout}+\textrm{Micro}>\textrm{Holdout}+\textrm{Shell}\] \end_inset \layout Standard The combined bounds seem to have the best behavior in practice. \layout Standard There are several directions of future investigation which could further strengthen any of these approaches. For the sample complexity approach, it would be useful to address the non-indep endence of samples in the fast sampling method used for the Sampling Shell bound. We tested the simplest of holdout techniques so another natural extension is to test other holdout techniques. This was not done here, because the theory of these other techniques is lacking. \layout Problem (Open) Address the looseness introduced by hypotheses with a strong correlation. For example, two decision trees which differ in only one leaf probably don't have significantly different error rates. Using the union bound over these decision trees introduces unnecessary slack. Note that VC dimension and covering number analysis address this, but (unfortun ately) the formulas are either unevaluatable or introduce so much slack that the quantitative results are worse rather than better. \layout Chapter Neural Networks \layout Standard \begin_inset LatexCommand \label{SNN} \end_inset This work is joint with Rich Caruana and was published at NIPS \begin_inset LatexCommand \cite{SNN} \end_inset . \layout Standard Estimating the true error rate of a continuous valued classifier can be surprisingly difficult. For example, all known bounds on the true error rate of artificial neural networks tend to be extremely loose and often result in the meaningless bound of \begin_inset Quotes eld \end_inset always err \begin_inset Quotes erd \end_inset (error rate = 1.0). Figure \begin_inset LatexCommand \ref{bound_vs_epoch} \end_inset demonstrates this. \layout Standard The approach here is to \emph on not \emph default bound the true error rate of a neural network. Instead, we bound the true error rate of a related distribution over neural networks which we create by analyzing one neural network. The stochastic bound approach proves much more fruitful than trying to bound the true error rate of an individual network. The best current approaches \begin_inset LatexCommand \cite{Bartlett} \end_inset \begin_inset LatexCommand \cite{Panchenko} \end_inset often require \begin_inset Formula \( 1000 \) \end_inset , \begin_inset Formula \( 10000 \) \end_inset , or more examples before producing a nontrivial bound on the true error rate. We produce nontrivial bounds on the true error rate of a stochastic neural network with less than \begin_inset Formula \( 100 \) \end_inset examples. \layout Standard Our approach uses a PAC-Bayes bound such as theorem ( \begin_inset LatexCommand \ref{th-repbb} \end_inset ). The approach can be thought of as a redivision of the work between the experimenter and the theoretician: we make the experimenter work harder so that the theoretician's true error bound becomes much tighter. This \begin_inset Quotes eld \end_inset extra work \begin_inset Quotes erd \end_inset on the part of the experimenter is significant, but tractable, and the resulting bounds are \emph on much \emph default tighter. \layout Standard An alternative viewpoint is that the classification problem \emph on is \emph default finding a hypothesis with a low upper bound on the future error rate. We present a post-processing phase for neural networks which results in a classifier with a much lower upper bound on the future error rate. The post-processing can be used with any artificial neural net trained with any optimization method; it does not require the learning procedure be modified, re-run, or even that the threshold function be differentiable. In fact, this post-processing step can easily be adapted to other continuous valued learning algorithms. \layout Standard The post-processing step finds a \begin_inset Quotes eld \end_inset large \begin_inset Quotes erd \end_inset distribution over classifiers, which has a small \emph on average \emph default empirical error rate. Given the average empirical error rate, it is straightforward to apply the PAC-Bayes bound in order to find a bound on the \emph on average \emph default true error rate. We can find this large distribution over classifiers by performing a simple noise sensitivity analysis on the learned model. The noise model allows us to generate a distribution of classifiers with a known, small, average empirical error rate. We refer to the distribution of neural nets that results from this noise analysis as a \begin_inset Quotes eld \end_inset stochastic \begin_inset Quotes erd \end_inset neural net model. \layout Standard Why do we expect the PAC-Bayes bound to be a significant improvement over standard covering number and VC bound approaches? There exist learning problems for which the difference between the lower bound and the PAC-Bayes upper bound is tight up to \begin_inset Formula \( O\left( \frac{\ln m}{m}\right) \) \end_inset where \begin_inset Formula \( m \) \end_inset is the number of training examples. This is superior to the guarantees which can be made for typical covering number bounds where the gap is, at best, known up to an (asymptotic) constant. The guarantee that PAC-Bayes bounds are sometimes quite tight encourages us to apply them here. \layout Section Theoretical setup \layout Standard We first present a modern neural network bound (the \begin_inset Quotes eld \end_inset competition \begin_inset Quotes erd \end_inset ), then specialize the PAC-Bayes bound to a stochastic neural network. A stochastic neural network is simply a neural network where each weight in the neural network is drawn from some distribution whenever it is used. The reason for constructing a stochastic neural network is that it will have a \emph on much \emph default lower true error upper bound than the neural network. Furthermore, this will be accomplished without increasing the empirical error rate more than marginally. \layout Subsection Neural Network bound \layout Standard We will compare a specialization of the best current neural network true error rate bound \begin_inset LatexCommand \cite{Panchenko} \end_inset with our approach. The neural network bound is described in terms of the following parameters: \layout Enumerate A margin, \begin_inset Formula \( 0<\theta <1 \) \end_inset . \layout Enumerate A function \begin_inset Formula \( \phi \) \end_inset defined by \begin_inset Formula \( \phi (x)=1 \) \end_inset if \begin_inset Formula \( x<0 \) \end_inset , \begin_inset Formula \( \phi (x)=0 \) \end_inset if \begin_inset Formula \( x>1 \) \end_inset , and linear in between. \layout Enumerate \begin_inset Formula \( A_{i} \) \end_inset , an upper bound on the sum of the magnitude of the weights in the \begin_inset Formula \( i \) \end_inset th layer of the neural network \layout Enumerate \begin_inset Formula \( L_{i} \) \end_inset , a Lipschitz constant which holds for the \begin_inset Formula \( i \) \end_inset th layer of the neural network. A Lipschitz constant is a bound on the magnitude of the derivative. \layout Enumerate \begin_inset Formula \( d \) \end_inset , the size of the input space. \layout Standard With these parameters defined, we get the following bound. \layout Theorem (2 Layer Feed-Forward Neural Network bound) For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset \begin_inset Formula \[ \Pr _{D}\left( \exists h\in H:\, \, e(h)>\inf _{\theta }\, \, b(\theta )\right) \leq \delta \] \end_inset where \begin_inset Formula \( b(\theta )=\frac{1}{m}\sum \phi \left( \frac{yh(x)}{\theta }\right) +\frac{2\sqrt{2\pi }}{\theta }32\sqrt{\frac{d+1}{m}}L_{1}L_{2}A_{1}A_{2}+\frac{\sqrt{\frac{1}{2}\ln \frac{2}{\delta }}+2}{\sqrt{m}} \) \end_inset \layout Proof Given in \latex latex \begin_inset LatexCommand \cite{Panchenko} \end_inset \latex default \layout Standard The theorem is actually only given up to a universal constant. \begin_inset Quotes eld \end_inset \begin_inset Formula \( 32 \) \end_inset \begin_inset Quotes erd \end_inset might be the right choice, but this is just an educated guess by the author \begin_inset LatexCommand \cite{DPP} \end_inset . The neural network true error bound above is (perhaps) the tightest known bound for general feed-forward neural networks and so it is the natural bound to compare with. \layout Standard This 2 layer feed-forward bound is not easily applied in a tight manner because we can't calculate a priori what our weight bound \begin_inset Formula \( A_{i} \) \end_inset should be. This can be patched up using the principle of structural risk minimization. In particular, we can state the bound for \begin_inset Formula \( A_{1}=\alpha ^{j} \) \end_inset where \begin_inset Formula \( j \) \end_inset is some non-negative integer and \begin_inset Formula \( \alpha >1 \) \end_inset is a constant. If the \begin_inset Formula \( j \) \end_inset th bound holds with probability \begin_inset Formula \( \frac{\delta }{j(j+1)} \) \end_inset , then all bounds will hold simultaneously with probability \begin_inset Formula \( 1-\delta \) \end_inset , since \begin_inset Formula \[ \sum _{j=1}^{\infty }\frac{1}{j(j+1)}=1\] \end_inset Applying this approach to the values of both \begin_inset Formula \( A_{1} \) \end_inset and \begin_inset Formula \( A_{2} \) \end_inset , we get the following theorem: \layout Corollary (2 Layer Feed-Forward Neural Network bound) For all \begin_inset Formula \( \delta \in (0,1] \) \end_inset \begin_inset Formula \[ \Pr _{D}\left( \exists h\in H,j,k:\, \, e(h)>\inf _{\theta }\, \, b(\theta ,j,k)\right) \] \end_inset where \begin_inset Formula \( b(\theta ,j,k)=\frac{1}{m}\sum \phi \left( \frac{yh(x)}{\theta }\right) +\frac{2\sqrt{2\pi }}{\theta }32\sqrt{\frac{d+1}{m}}L_{1}L_{2}\alpha ^{j}\beta ^{k}+\frac{\sqrt{\frac{1}{2}\ln \frac{2j(j+1)k(k+1)}{\delta }}+2}{\sqrt{m}} \) \end_inset \layout Proof Apply the union bound to all possible values of \begin_inset Formula \( j \) \end_inset and \begin_inset Formula \( k \) \end_inset as discussed above. \layout Standard In practice, we will use \begin_inset Formula \( \alpha =\beta =1.1 \) \end_inset and report the value of the tightest applicable bound for all \begin_inset Formula \( j,k \) \end_inset . \layout Subsection Stochastic Neural Network bound \layout Standard We will specialize a PAC-Bayes bound ( \begin_inset LatexCommand \ref{th-repbb} \end_inset ) for application to a stochastic neural network with a choice of the \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset . Our \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset will be zero on all neural net structures other than the one we train and a multidimensional isotropic gaussian on the values of the weights in our neural network. The multidimensional gaussian will have a mean of \begin_inset Formula \( 0 \) \end_inset and a variance in each dimension of \begin_inset Formula \( b^{2} \) \end_inset . This choice is made for convenience and happens to provide good results. \layout Standard The optimal value of \begin_inset Formula \( b \) \end_inset is unknown and dependent on the learning problem so we will wish to parameteriz e it in an example dependent manner. We can do this using the same trick as for the original neural net bound. Use a sequence of bounds where \begin_inset Formula \( b=c\alpha ^{j} \) \end_inset for \begin_inset Formula \( c \) \end_inset and \begin_inset Formula \( \alpha \) \end_inset some constants and \begin_inset Formula \( j \) \end_inset a nonnegative number. For the \begin_inset Formula \( j \) \end_inset th bound set \begin_inset Formula \( \delta \rightarrow \frac{\delta }{j(j+1)} \) \end_inset . The union bound will imply that all bounds hold simultaneously with probability at least \begin_inset Formula \( 1-\delta \) \end_inset . \layout Standard Assuming that our \begin_inset Quotes eld \end_inset posterior \begin_inset Quotes erd \end_inset \begin_inset Formula \( Q \) \end_inset is also defined by a multidimensional gaussian with the mean and variance in each dimension defined by \begin_inset Formula \( w_{i} \) \end_inset and \begin_inset Formula \( s_{i}^{2} \) \end_inset , we can specialize to the following corollary: \layout Corollary (Stochastic Neural Network bound) \begin_inset LatexCommand \label{co-snnb} \end_inset Let \begin_inset Formula \( k \) \end_inset be the number of weights in a neural network, \begin_inset Formula \( w_{i} \) \end_inset be the \begin_inset Formula \( i \) \end_inset the weight and \begin_inset Formula \( s_{i} \) \end_inset be the variance of the \begin_inset Formula \( i \) \end_inset th weight. Then, we have: \begin_inset Formula \begin{equation} \label{big_bound} \Pr _{D}\left( \exists q(h):\, \, \textrm{KL}(\hat{e}_{q}(h)||e_{q}(h))\geq \inf _{j}\frac{\sum _{i=1}^{k}\left[ \ln \frac{c\alpha ^{j}}{s_{i}}+\frac{s_{i}^{2}+w_{i}^{2}}{2c^{2}\alpha ^{2j}}-\frac{1}{2}\right] +\ln \frac{j(j+1)m}{\delta }}{m-1}\right) \leq \delta \end{equation} \end_inset \layout Proof Analytic calculation of the KL divergence between two multidimensional Gaussians and the union bound applied for each value of \begin_inset Formula \( j \) \end_inset . \layout Standard We will choose \begin_inset Formula \( \alpha =1.1 \) \end_inset and \begin_inset Formula \( c=0.2 \) \end_inset as reasonable default values. \layout Standard One more step is necessary in order to apply this bound. The essential difficulty is evaluating \begin_inset Formula \( \hat{e}_{q}(h) \) \end_inset . This quantity is observable although calculating it to high precision is difficult. We will use the Monte Carlo sampling technique of section \begin_inset LatexCommand \ref{pb-approx} \end_inset in order to bound the value of \begin_inset Formula \( \hat{e}_{q}(h) \) \end_inset and then use the bound on this value in the PAC-Bayes bound. We use \begin_inset Formula \( n=1000 \) \end_inset evaluations of the empirical error rate of the stochastic neural network. \layout Subsection Distribution Construction algorithm \layout Standard One critical step is missing in the description: How do we calculate the multidimensional gaussian, \begin_inset Formula \( Q \) \end_inset ? The variance of the posterior gaussian needs to be dependent on each weight in order to achieve a tight bound since we want any \begin_inset Quotes eld \end_inset meaningless \begin_inset Quotes erd \end_inset weights to not contribute significantly to the overall sample complexity. We use a simple greedy algorithm to find the appropriate variance in each dimension. \layout Enumerate Train a neural net on the examples \layout Enumerate For every weight, \begin_inset Formula \( w_{i} \) \end_inset , search for the variance, \begin_inset Formula \( s_{i}^{2} \) \end_inset , which reduces the empirical accuracy of the trained network by \begin_inset Formula \( 1\% \) \end_inset (for example) while holding all other weights fixed. \layout Enumerate The stochastic neural network defined by \begin_inset Formula \( \{w_{i},\, \, s_{i}^{2}\} \) \end_inset will generally have a too-large empirical error. Therefore, we calculate a global multiplier \begin_inset Formula \( \lambda <1 \) \end_inset such that the stochastic neural network defined by \begin_inset Formula \( \{w_{i},\, \, \lambda s_{i}^{2}\} \) \end_inset decreases the empirical accuracy by only \begin_inset Formula \( 1\% \) \end_inset . \layout Enumerate Then, we evaluate the empirical error rate of the resulting stochastic neural net with \begin_inset Formula \( 1000 \) \end_inset samples from the stochastic neural network. \layout Section Experimental Results \layout Standard \begin_float fig \layout Standard \begin_inset Figure size 142 99 file bound_vs_epoch.ps width 4 48 angle 270 flags 9 \end_inset \begin_inset Figure size 142 99 file bound_vs_epoch_easy.ps width 4 48 angle 270 flags 11 \end_inset \layout Caption \begin_inset LatexCommand \label{bound_vs_epoch} \end_inset Plot of errors and true error bounds for the neural network (NN) and the stochastic neural network (SNN). The graph exhibits overfitting after approximately 6000 pattern presentations. The slope of the neural network true error bound is positive because the size of the weights is gradually increasing. Note that a true error bound of \begin_inset Quotes eld \end_inset 100 \begin_inset Quotes erd \end_inset implies that a factor of \begin_inset Formula \( 100^{2} \) \end_inset more examples are required in order to make a non-vacuous bound. The graph on the right expands the vertical scale by excluding the poor true error bound. \end_float \layout Standard How well can we bound the true error rate of a stochastic neural network? The answer is \emph on much \emph default better than we can bound the true error rate of a neural network. \layout Standard Our experimental results take place on a synthetic data set which has 25 input dimensions and one output dimension. Most of these dimensions are useless---simply random numbers drawn from a \begin_inset Formula \( N(0,1) \) \end_inset Gaussian. One of the 25 input dimensions is dependent on the label. First, the label \begin_inset Formula \( y \) \end_inset is drawn uniformly from \begin_inset Formula \( \{-1,1\} \) \end_inset , then the special dimension is drawn from a \begin_inset Formula \( N(y,1) \) \end_inset Gaussian. Note that this learning problem can not be solved perfectly because some examples will be drawn from the tail of the gaussian. \layout Standard The \begin_inset Quotes eld \end_inset ideal \begin_inset Quotes erd \end_inset neural net to use in solving this problem is a single node perceptron. We will instead use a 2 layer neural net with 2 hidden nodes. This overly large neural net will result in the potential for significant overfitting which makes the bound prediction problem interesting. It is also somewhat more \begin_inset Quotes eld \end_inset realistic \begin_inset Quotes erd \end_inset if the neural net structure does not exactly fit the learning problem. \layout Standard All of our data sets will use just \begin_inset Formula \( 100 \) \end_inset examples. Constructing a non-vacuous bound for a continuous hypothesis space at \begin_inset Formula \( 100 \) \end_inset examples is quite challenging as indicated by figure \begin_inset LatexCommand \ref{bound_vs_epoch} \end_inset . Conventional bounds are hopelessly loose while the stochastic neural network bound is still not as tight as might be desired. There are several notable things about this figure. \layout Enumerate The SNN upper bound is \emph on 2-3 \emph default orders of magnitude lower than the NN upper bound. \layout Enumerate The SNN performs better than expected. In particular, the SNN true error rate is closer than \begin_inset Formula \( 1\% \) \end_inset of the NN true error rate. This is surprising considering that we fixed the difference in empirical error rates at \begin_inset Formula \( 1\% \) \end_inset . \layout Enumerate The SNN bound has a minimum at 12000 pattern presentations which weakly predicts the overfitting point of 6000 for both the SNN and the NN. \layout Standard The comparison between the neural network bound and the stochastic neural network bound is not quite \begin_inset Quotes eld \end_inset fair \begin_inset Quotes erd \end_inset due to the form of the bound. In particular, the stochastic neural network bound can never return a value greater than \begin_inset Quotes eld \end_inset always err \begin_inset Quotes erd \end_inset . This implies that when the bound is near the value of \begin_inset Quotes eld \end_inset \begin_inset Formula \( 1 \) \end_inset \begin_inset Quotes erd \end_inset , it is difficult to judge how rapidly extra examples will improve the stochasti c neural network bound. We can judge the sample complexity of the stochastic bound by plotting the value of the numerator in equation \begin_inset LatexCommand \ref{big_bound} \end_inset . Figure \begin_inset LatexCommand \ref{complexity_vs_epoch} \end_inset plots the complexity versus the number of pattern presentations in training. \begin_float fig \layout Standard \align center \begin_inset Figure size 267 187 file complexity.ps width 4 90 angle 270 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{complexity_vs_epoch} \end_inset We plot the \begin_inset Quotes eld \end_inset complexity \begin_inset Quotes erd \end_inset of the stochastic network model (numerator of \begin_inset LatexCommand \ref{big_bound} \end_inset ) vs. training epoch. Note that the complexity increases with more training as expected and stays below \begin_inset Formula \( 100 \) \end_inset , implying non-vacuous bounds on a training set of size \begin_inset Formula \( 100 \) \end_inset . \end_float \layout Standard The stochastic bound is a radical improvement on the neural network bound but it is not yet a perfectly tight bound. Given that we do not have a perfectly tight bound, one important consideration arises: does the minimum of the stochastic bound predict the minimum of the true error rate (as predicted by a large holdout data set). In particular, can we use the stochastic bound to determine when we should cease training? The stochastic bound depends upon (1) the complexity which increases with training time and (2) the training error which decreases with training time. This dependence results in a minima which for our problem occurs at approximate ly 12000 pattern presentations. The point of minimal true error (for the stochastic and deterministic neural networks) occurs at approximately 6000 pattern presentations indicating that the stochastic bound weakly predicts the point of minimum error. The neural network bound has no such minimum. \layout Standard Is the choice of 5% increased empirical error optimal? In general, the \begin_inset Quotes eld \end_inset optimal \begin_inset Quotes erd \end_inset choice of the extra error rate depends upon the learning problem. Since the stochastic neural network bound (corollary \begin_inset LatexCommand \ref{co-snnb} \end_inset ) holds for all multidimensional gaussian distributions, we are free to optimize the choice of distribution in anyway we desire. Figure \begin_inset LatexCommand \ref{q_opt} \end_inset .2 shows the resulting bound for different choices of \begin_inset Formula \( q \) \end_inset . The bound has a minimum at \begin_inset Formula \( 0.03 \) \end_inset extra error indicating that a slightly lower bound is possible if we accept a larger training error. Also note that the complexity always decreases with increasing entropy in the distribution of our stochastic neural net. The existence of a minimum in Figure \begin_inset LatexCommand \ref{q_opt} \end_inset .2 is the \begin_inset Quotes eld \end_inset right \begin_inset Quotes erd \end_inset behavior: the increased empirical error rate is significant in the calculation of the true error bound. \begin_float fig \layout Standard \align center \begin_inset Figure size 267 187 file error_vs_bound.ps width 4 90 angle 270 flags 9 \end_inset \layout Standard \align center \begin_inset LatexCommand \label{q_opt} \end_inset Plot of the stochastic neural net (SNN) bound for \begin_inset Quotes eld \end_inset posterior \begin_inset Quotes erd \end_inset distributions chosen according to the extra empirical error they introduce. \end_float \layout Section Conclusion \layout Standard PAC-Bayes bounds give excellent results on a stochastic neural network. The stochastic neural network bound is radically tighter ( \begin_inset Formula \( 2-3 \) \end_inset orders of magnitude) bound on the true error rate of a classifier while increasing the empirical and true error rates only a small amount. \layout Standard Although, the stochastic neural net bound is not completely tight, it is not vacuous with just \begin_inset Formula \( 100 \) \end_inset examples and the minima of the bound weakly predicts the point where overtraini ng occurs. \layout Problem (Open) To what extent do these results extend to other learning problems and other continuous learning algorithms? Work is under-way (most notably in \begin_inset LatexCommand \cite{Seeger} \end_inset ) to evaluate both of these questions. \layout Chapter Conclusion & Challenges \layout Standard The basic conclusion of this thesis is that we can achieve bounds tight enough to yield useful results on real learning problems with standard learning algorithms or simple variants on standard learning algorithms. The evidence of this conclusion is reported in the last two chapters. \layout Standard To accomplish this goal, considerable theoretical work was completed. This includes: \layout Enumerate Derivation of the microchoice bounds and adaptive microchoice bounds in Section \begin_inset LatexCommand \ref{sec-MC} \end_inset . \layout Enumerate Improvement and simplification of the PAC-Bayes bound in Section \begin_inset LatexCommand \ref{sec-PB} \end_inset . \layout Enumerate Improvement of the margin bound to achieve benefit from averaging \begin_inset LatexCommand \ref{sec-average} \end_inset . \layout Enumerate Derivation of the Shell bounds \begin_inset LatexCommand \ref{sec-shell} \end_inset . \layout Enumerate An investigation into how to repair the covering number approach \begin_inset LatexCommand \ref{sec-cover} \end_inset . \layout Enumerate An analysis of progressive validation \begin_inset LatexCommand \ref{sec-PV} \end_inset . \layout Enumerate An analysis for the combination of training and test set bounds \begin_inset LatexCommand \ref{sec-combined} \end_inset . \layout Standard In particular, microchoice bounds (Section \begin_inset LatexCommand \ref{sec-MC} \end_inset ), PAC-Bayes bounds (Section \begin_inset LatexCommand \ref{sec-PB} \end_inset ), shell bound (Section \begin_inset LatexCommand \ref{sec-shell} \end_inset ), and combined bounds (Section \begin_inset LatexCommand \ref{sec-combined} \end_inset ) have proved useful for practical application. \layout Standard It is worth emphasizing that all of the bounds reported here rest upon only an assumption of example independence implying wide applicability. \layout Bibliography \bibitem {1} Peter Bartlett, \begin_inset Quotes eld \end_inset The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network \begin_inset Quotes erd \end_inset , IEEE Transactions on Information Theory, Vol. 44, No. 2, March 1998. \layout Bibliography \bibitem {BI} Gyora M. Benedek and Alon Itai, ``Learnability by Fixed Distributions'', Proceedings of the 1988 Workshop on Computational Learning Theory. \layout Bibliography \bibitem {Progressive} Avrim Blum, Adam Kalai, and John Langford 1999. Beating the Holdout: Bounds for KFold and Progressive Cross-Validation. COLT99. http://www.cs.cmu.edu/~jcl/papers/progressive_validation/coltfinal.ps \layout Bibliography \bibitem {AR} Avrim Blum and Ron Rivest, \begin_inset Quotes eld \end_inset Training a 3-Node Neural Network is NP-Complete \begin_inset Quotes erd \end_inset , Neural Networks, 5(1):117-127, 1992. \layout Bibliography \bibitem {BEHW} A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth. \begin_inset Quotes eld \end_inset Occam's Razor. \begin_inset Quotes erd \end_inset Information Processing Letters 24: 377-380, 1987. \layout Bibliography \bibitem {Bagging} L. Breiman, "Bagging Predictors" Machine Learning, Vol. 24, No . 2, pp. 123-140. \layout Bibliography \bibitem {Chernoff} H. Chernoff, \begin_inset Quotes eld \end_inset A Measure of the asymptotic efficiency of tests of a hypothesis based on the sum of observations \begin_inset Quotes erd \end_inset , Annals of Mathematical Statistics, 23:493-507, 1952 \layout Bibliography \bibitem {SVM} N. Christianini, J. Shaw-Taylor, \begin_inset Quotes eld \end_inset Support Vector Machines \begin_inset Quotes erd \end_inset , Cambridge University Press, \protected_separator 2000 ISBN: 0 521 78019 5 \layout Bibliography \bibitem {QM} Claude Cohen-Tannoudji, Bernard Diu, Frank Laloe, Bernard Dui, \begin_inset Quotes eld \end_inset Quantum Mechanics \begin_inset Quotes erd \end_inset \layout Bibliography \bibitem {Cover} Thomas Cover and Joy Thomas, ``Elements of Information Theory'' Wiley, New York 1991. \layout Bibliography \bibitem {Devroye} Luc Devroye, Laszlo Gyorfi, Gabor Lugosi, \begin_inset Quotes eld \end_inset A Probabilistic Theory of Pattern Recognition \begin_inset Quotes erd \end_inset Springer-Verlag New York, 1996. \layout Bibliography \bibitem {Domingos} Pedro Domingos. \begin_inset Quotes eld \end_inset A Process-Oriented Heuristic for Model Selection, Machine Learning Proceedings of the Fifteenth International Conference, "Morgan Kaufmann Publishers, San Francisco, CA, 1998, pp 127-135. \layout Bibliography \bibitem {Dudley} R. M. Dudley, ``A course on empirical processes'', Lecture Notes in Mathematics 1097, 2-141. Springer-Verlag, New York. \layout Bibliography \bibitem {AHKV} A. Ehrenfucht, D. Haussler, M. Kearns, and L. Valiant, \begin_inset Quotes eld \end_inset A General Lower Bound on the Number of Examples Needed for Learning \begin_inset Quotes erd \end_inset , Information and Computation 82(3), pp. 247-261, 1989. \layout Bibliography \bibitem {ET} B. Efron and R. Tibshirani, \begin_inset Quotes eld \end_inset An Introduction to the Bootstrap \begin_inset Quotes erd \end_inset , Chapman & Hall, London, 1993. \layout Bibliography \bibitem {Ypredicting} Yoav Freund. \begin_inset Quotes eld \end_inset Predicting a binary sequence almost as well as the optimal biased coin \begin_inset Quotes erd \end_inset , COLT 1996. http://citeseer.nj.nec.com/freund96predicting.html \layout Bibliography \bibitem {SB} Yoav Freund. Self-Bounding Learning Algorithms. COLT 98. http://www.research.att.com/~yoav/papers/lsearch.ps.gz \layout Bibliography \bibitem {Boosting} Yoav Freund and Robert E. Schapire, ``A Decision Theoretic Generalization of On-line Learning and an Application to Boosting'' Eurocolt 1995 \layout Bibliography \bibitem {Oded} Oded Goldreich, \begin_inset Quotes eld \end_inset The Foundations of Cryptography - Volume 1 \begin_inset Quotes erd \end_inset , ISBN 0-521-79172-3 Cambridge University Press, 2001 \layout Bibliography \bibitem {Haussler} David Haussler, ``Mathematical perspectives on Neural networks '', Lawrence Erlbaum Associates, 1995. \layout Bibliography \bibitem {HKS} David Haussler, Michael Kearns, and Robert Schapire, \begin_inset Quotes eld \end_inset Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC dimension \begin_inset Quotes erd \end_inset , Machine Learning 14, pp. 83-113, 1994. \layout Bibliography \bibitem {old_shell} David Haussler, Michael Kearns, H. Sebastian Seung, and Naftali Tishby, \begin_inset Quotes eld \end_inset Rigorous Learning Curve Bounds from Statistical Mechanics \begin_inset Quotes erd \end_inset Machine Learning,25, 1996, pages 195--236. \layout Bibliography \bibitem {Hoeffding} Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13-30 \layout Bibliography \bibitem {ME} T. Jaakkola, M. Meila, T. Jebara, \begin_inset Quotes eld \end_inset Maximum Entropy D iscrimination \backslash char \begin_inset Quotes erd \end_inset NIPS 1999. \layout Bibliography \bibitem {Kalai} Adam Kalai, \begin_inset Quotes eld \end_inset Probabilistic and On-line methods in Machine Learning \begin_inset Quotes erd \end_inset .thesis, CMU-CS-01-132. \layout Bibliography \bibitem {SQ} Michael Kearns. Efficient Noise-Tolerant Learning From Statistical Queries, Proceedings of the 25th ACM Symposium on the Theory of Computing, pp. 392-401, 1993, ACM Press. http://www.research.att.com/~mkearns/papers/sq-noise.ps.Z \layout Bibliography \bibitem {Sanity} Michael Kearns and Dana Ron, \begin_inset Quotes eld \end_inset Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation. \begin_inset Quotes erd \end_inset Neural Computation 11(6), pages 1427-1453, 1999. Also in Proceedings of the Tenth Annual Conference on Computational Learning Theory, ACM Press, 1997, pages 152--162. \layout Bibliography \bibitem {Panchenko} V. Koltchinskii and D. Panchenko, \begin_inset Quotes eld \end_inset Empirical Margin Distributions and Bounding the Generalization Error of Combined Classifiers \begin_inset Quotes erd \end_inset , preprint, http://citeseer.nj.nec.com/386416.html \layout Bibliography \bibitem {Grun} P.Kontkanen, P. Myllymaki, T. Silander, H.Tirri, and P. Grunwald, Predictive Distributions and Bayesian Networks, Journal of Statistics and Computing 10, pp. 39-54, 2000. \layout Bibliography \bibitem {MC} John Langford and Avrim Blum 1999. Microchoice Bounds and Self Bounding learning algorithms. COLT99. http://www.cs.cmu.edu/~jcl/papers/microchoice/mc.ps \layout Bibliography \bibitem {MC_journal} John Langford and Avrim Blum 1999. Microchoice Bounds and Self Bounding learning algorithms. Machine Learning Journal. http://www.cs.cmu.edu/~jcl/papers/microchoice/journal/journal_final.ps \layout Bibliography \bibitem {SNN} John Langford and Rich Caruana, (Not) Bounding the True Error NIPS2001. \layout Bibliography \bibitem {Shell} John Langford and David McAllester, \begin_inset Quotes eld \end_inset Computable Shell Decomposition Bounds \begin_inset Quotes erd \end_inset COLT 2000. \layout Bibliography \bibitem {averaging_icml} John Langford, Matthias Seeger, and Nimrod Megiddo, \begin_inset Quotes eld \end_inset An Improved Predictive Accuracy Bound for Averaging Classifiers \begin_inset Quotes erd \end_inset ICML2001. \layout Bibliography \bibitem {averaging_tech} John Langford and Matthias Seeger, \begin_inset Quotes eld \end_inset Bounds for Averaging Classifiers. \begin_inset Quotes erd \end_inset CMU tech report, CMU-CS-01-102, 2001. \layout Bibliography \bibitem {Littlestone89} N. Littlestone. \begin_inset Quotes eld \end_inset From on-line to batch learning \begin_inset Quotes erd \end_inset In \emph on Proceedings of the 2nd Annual Workshop on Computational Learning Theory \emph default , pp. 269--284, 1989. \layout Bibliography \bibitem {Tom} Tom Mitchell, \begin_inset Quotes eld \end_inset Machine Learning \begin_inset Quotes erd \end_inset , McGraw Hill, 1997. \layout Bibliography \bibitem {Yishay} Yishay Mansour, ``Pessimistic Decision Tree Pruning Based on Tree Size'', ICML1997. \layout Bibliography \bibitem {PB} David McAllester, ``PAC-Bayesian Model Averaging'' COLT 1999. \layout Bibliography \bibitem {McD} Colin McDiarmid, ``On the method of bounded differences'', In \backslash emph {Surveys in Combinatorics 1989,} pages 148-188. Cambridge University Press, 1989. \layout Bibliography \bibitem {Mingers} Jon Mingers, An Empirical Comparison of Pruning Methods for Decision Tree Induction, Machine Learning, 1989, vol. 4, 227-243. \layout Bibliography \bibitem {DPP} Dimitry Panchenko, personal communications, 2001 \layout Bibliography \bibitem {Pollard} David Pollard, ``Convergence of Stochastic Processes'', Springe r-Verlag 1984. \layout Bibliography \bibitem {Quinlan} J.L. Quinlan, Induction of Decision Trees, Machine Learning, 1986, vol. 1, pp 81-106. \layout Bibliography \bibitem {Rivest} R.L. Rivest, Learning Decision Lists. Machine Learning, 1987, vol. 2, pp 229-246. \layout Bibliography \bibitem {Margin} Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee, ``Boosting the Margin: A new explanation for the effectiveness of voting methods'' The Annals of Statistics, 26(5):1651-1686, 1998. \layout Bibliography \bibitem {Tobias} Tobias Scheffer and Thorsten Joachims, \begin_inset Quotes eld \end_inset Expected Error Analysis for Model Selection \begin_inset Quotes erd \end_inset , ICML 1999. \layout Bibliography \bibitem {Seeger} Matthias Seeger, \begin_inset Quotes eld \end_inset PAC-Bayesian Generalization Error Bounds for Gaussian Processes \begin_inset Quotes erd \end_inset , Tech Report, \protected_separator Division of Informatics report EDI-INF-RR-0094. http://www.dai.ed.ac.uk/homes/seeger/papers/gpmcall-tr.ps.gz \layout Bibliography \bibitem {VW} Aad W. van der Vaart and Jon A. Wellner, ``Weak Convergence and Empirical Processes'', Springer-Verlag 1996. \layout Bibliography \bibitem {Valiant} L.G. Valiant. \begin_inset Quotes eld \end_inset A Theory of the Learnable. \begin_inset Quotes erd \end_inset Communications of the ACM 27(11):1134-1142, November 1984 \layout Bibliography \bibitem {Vapnik} V. N. Vapnik and A. Y. Chervonenkis. \begin_inset Quotes eld \end_inset On the uniform convergence of relative frequencies of events to their probabilit ies. \begin_inset Quotes erd \end_inset Theory of Probab. and its Applications, 16(2):264-280, 1971. \layout Bibliography \bibitem {V2} Vladimir N. Vapnik, \begin_inset Quotes eld \end_inset Statistical Learning Theory \begin_inset Quotes erd \end_inset , Wiley, December 1999. \layout Bibliography \bibitem {LtoL} Jon Baxter. A Model of Inductive Bias Learning. 1998. http://wwwsyseng.anu.edu.au/~jon/papers/ltolchap.ps.gz \layout Chapter Appendix: Definitions \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \LyXTable multicol5 50 2 0 0 -1 -1 -1 -1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 8 1 0 "" "" 8 1 1 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" 0 8 1 0 0 0 0 "" "" Term \newline Definition \newline \begin_inset Formula \( x \) \end_inset \newline The \begin_inset Quotes eld \end_inset input \begin_inset Quotes erd \end_inset which we can predict with. \newline \begin_inset Formula \( y \) \end_inset \newline The \begin_inset Quotes eld \end_inset output \begin_inset Quotes erd \end_inset which we want to predict. \newline \begin_inset Formula \( (x,y) \) \end_inset \newline An \begin_inset Quotes eld \end_inset example \begin_inset Quotes erd \end_inset or \begin_inset Quotes eld \end_inset sample \begin_inset Quotes erd \end_inset which is an pair \newline \begin_inset Formula \( S \) \end_inset \newline A set of samples. \newline \begin_inset Formula \( m \) \end_inset \newline The number of samples. \newline \begin_inset Formula \( m_{\textrm{test}} \) \end_inset \newline The number of samples in the testing set. \newline \begin_inset Formula \( m_{\textrm{train}} \) \end_inset \newline The number of samples in the training set. \newline \begin_inset Formula \( h \) \end_inset \newline A hypothesis = a function from \begin_inset Formula \( x \) \end_inset to \begin_inset Formula \( y \) \end_inset \newline \begin_inset Formula \( D \) \end_inset \newline An (unknown) distribution over \begin_inset Formula \( (x,y) \) \end_inset pairs. \newline \begin_inset Formula \( \textrm{Bin}(m,k,p) \) \end_inset \newline The probability that a Binomial with \begin_inset Formula \( m \) \end_inset coins and bias \begin_inset Formula \( p \) \end_inset has \begin_inset Formula \( k \) \end_inset or fewer heads. \newline \begin_inset Formula \( \bar{e}(m,k,\delta ) \) \end_inset \newline An upper bound on \begin_inset Formula \( \textrm{Bin}(m,k,p) \) \end_inset . \newline \begin_inset Formula \( p(h) \) \end_inset \newline A distribution over hypotheses not dependent on the training set. \newline \begin_inset Formula \( q(h) \) \end_inset \newline A distribution over hypotheses dependent on the training set. \newline \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset \newline Error rate on the sample set \begin_inset Formula \( S \) \end_inset \newline \begin_inset Formula \( e_{D}(h) \) \end_inset \newline \begin_inset Formula \( \Pr _{(x,y)\sim D}(h(x)\neq y) \) \end_inset = true error rate = future error rate \newline \begin_inset Formula \( \hat{e}_{\textrm{test}}(h) \) \end_inset \newline Error rate on a test set \newline \begin_inset Formula \( \delta \) \end_inset \newline The probability that a bound fails. Typically small. \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \newline \layout Chapter Appendix: Manual \layout Standard This chapter is intended as a manual for the practical application of learning-t heory based bounds. In particular, we introduce a program \begin_inset Quotes eld \end_inset bound \begin_inset Quotes erd \end_inset available at: \layout Standard http://www.cs.cmu.edu/~jcl/programs/bound/bound.html \layout Section Test Error Bound Calculation \layout Standard \begin_inset LatexCommand \label{sec-test-bound-calc} \end_inset We can use the holdout bound to calculate upper and lower true error bounds with the program \begin_inset Quotes eld \end_inset bound \begin_inset Quotes erd \end_inset . Here is an example of the application. \layout Standard \latex latex \backslash begin{verbatim} \newline 11:32PM z-12: echo "test_examples 600 \newline test_errors 192 \newline delta 0.025" | bound \newline Applying varying approximation tail bound \newline delta 0.025 \newline lower_delta 0.5 \newline test_examples 600 \newline test_errors 192 \newline approximation automatic \newline true_error = 0.32 0.358976827934 0.31926717516 \newline \backslash end{verbatim} \layout Standard There are several arguments passed to the program. These include: \layout Enumerate test_examples : the number of test examples. \layout Enumerate test_errors : the number of test errors. \layout Enumerate delta : the probability of failure of the upper bound. \layout Enumerate lower_delta : the probability of failure of the lower bound. \layout Standard The output of the program is several lines stating the used and assumed arguments and a final line of the form: \layout Standard \latex latex \backslash begin{verbatim} \newline true_error = \newline \backslash end{verbatim} \layout Standard The output of this particular application implies that the true error rate of the chosen hypothesis is less than \begin_inset Formula \( 0.35897 \) \end_inset with high confidence (over the drawn sample set). \layout Standard For simplicity, the arguments can also be placed into a file: \layout Standard \latex latex \backslash begin{verbatim} \newline 11:35PM z-14: cat test_error \newline test_examples 800 \newline test_errors 192 \newline delta 0.025 \newline lower_delta 0.025 \newline 11:35PM z-15: bound test_error \newline Applying varying approximation tail bound \newline delta 0.025 \newline lower_delta 0.025 \newline test_examples 800 \newline test_errors 192 \newline approximation automatic \newline true_error = 0.24 0.28252880089 0.20065891277 \newline \backslash end{verbatim} \layout Standard The output of this program can be interpreted as a confidence interval. \layout Section Training Set Bound Calculation \layout Standard \begin_inset LatexCommand \label{sec-train-bound-calc} \end_inset Many of the training set based bounds effectively reduce the value of \begin_inset Formula \( \delta \) \end_inset by some fixed amount. The program 'bound' will automatically reduce the value of \begin_inset Formula \( \delta \) \end_inset with two arguments. \layout Enumerate log_prior : the log (base e) of the \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset (in the sense of theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) of the hypothesis. \layout Enumerate log_hypothesis_size : the log of the number of hypotheses (in the sense of theorem \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) \layout Standard Each of these arguments acts independently to reduce the value of \begin_inset Formula \( \delta \) \end_inset . The value of \begin_inset Formula \( \delta \) \end_inset will only be reduced on \emph on training \emph default example based bounds. There are two more arguments necessary for application of one of the simple training set based bound: \layout Enumerate training_examples : the number of training examples. \layout Enumerate training_errors : the number of training errors. \layout Standard Here is an application of a simple training set based bound: \layout Standard \latex latex \backslash begin{verbatim} \newline 12:07AM z-26: cat train_error \newline train_examples 100 \newline train_errors 3 \newline log_prior -50 \newline delta 0.3 \newline 12:07AM z-27: bound train_error \newline Applying varying approximation tail bound \newline delta 0.3 \newline lower_delta 0.5 \newline train_examples 100 \newline train_errors 3 \newline log_hypothesis_size 0 \newline log_prior -50 \newline approximation automatic \newline true_error = 0.03 0.466502545401 9.31322574615e-10 \newline \backslash end{verbatim} \layout Section Shell Bound Calculation \layout Standard \begin_inset LatexCommand \label{sec-shell-bound-calc} \end_inset The program 'bound' can calculate a shell bound or a sampling shell bound. To do these calculations, two extra parameters must be specified: \layout Enumerate error_log_count : should be the log (base e) of the number of hypotheses with empirical error. \layout Enumerate sample_space_log_size : should be the size of space of hypotheses uniformly sampled from if the evaluation is inexact (and not passed if the evaluation is exact) \layout Standard Suppose wanted to use the shell bound and knew that \begin_inset Formula \( e^{5} \) \end_inset hypotheses had \begin_inset Formula \( 25 \) \end_inset training error and \begin_inset Formula \( e^{30} \) \end_inset had \begin_inset Formula \( 50 \) \end_inset training errors. Then, we might apply the bound as: \layout Standard \latex latex \backslash begin{verbatim} \newline 12:07AM z-28: cat shell_error \newline train_examples 100 \newline train_errors 3 \newline error_log_count 3 1 \newline error_log_count 25 5 \newline error_log_count 50 30 \newline delta 0.3 \newline 12:07AM z-29: bound shell_error \newline Applying varying approximation tail bound \newline Applying shell bound \newline delta 0.3 \newline lower_delta 0.5 \newline train_examples 100 \newline error_log_count 3 1 \newline error_log_count 25 5 \newline error_log_count 50 30 \newline approximation automatic \newline true_error = 0.03 0.144696196541 0.0266506755725 \newline \backslash end{verbatim} \layout Standard Now, suppose that we wanted to use the sampling shell bound and sampled \begin_inset Formula \( e^{5}+e^{10} \) \end_inset times observing \begin_inset Formula \( e^{5} \) \end_inset hypotheses with training error \begin_inset Formula \( 25 \) \end_inset and \begin_inset Formula \( e^{10} \) \end_inset with training error \begin_inset Formula \( 50 \) \end_inset . Then, we might apply 'bound' as follows: \layout Standard \latex latex \backslash begin{verbatim} \newline 12:23AM z-36: cat sampling_shell_error \newline train_examples 100 \newline train_errors 3 \newline error_log_count 3 1 \newline error_log_count 25 5 \newline error_log_count 50 10 \newline sample_space_log_size 30 \newline delta 0.3 \newline 12:23AM z-37: bound sampling_shell_error \newline Applying varying approximation tail bound \newline Applying shell bound \newline delta 0.3 \newline lower_delta 0.5 \newline train_examples 100 \newline sample_space_log_size 30 \newline error_log_count 3 1 \newline error_log_count 25 5 \newline error_log_count 50 10 \newline approximation automatic \newline true_error = 0.03 0.405203092843 0.0266506755725 \newline \backslash end{verbatim} \layout Section Combined Bound Calculation \layout Standard \begin_inset LatexCommand \label{sec-combined-bound-calc} \end_inset The program 'bound' has been instrumented to use technique (3) from above. In order to apply the combined training and test error bound, you must simply supply the necessary information for both the training and test sets. \layout Standard \latex latex \backslash begin{verbatim} \newline 12:25AM z-42: cat train_n_test_error \newline test_examples 10 \newline test_errors 5 \newline delta 0.3 \newline train_examples 20 \newline train_errors 0 \newline log_prior -1 \newline 1:16AM z-43: bound train_n_test_error \newline Applying varying approximation tail bound \newline delta 0.3 \newline lower_delta 0.5 \newline test_examples 10 \newline test_errors 5 \newline train_examples 20 \newline train_errors 0 \newline log_hypothesis_size 0 \newline log_prior -1 \newline approximation automatic \newline true_error = 0.5 0.104335444048 0.369755505584 \newline \backslash end{verbatim} \the_end