#This file was created by Fri May 17 07:38:10 2002 #LyX 1.0 (C) 1995-1999 Matthias Ettrich and the LyX Team \lyxformat 2.15 \textclass amsbook \begin_preamble \usepackage{hyperref} \end_preamble \language default \inputencoding default \fontscheme default \graphics default \paperfontsize default \spacing single \papersize Default \paperpackage a4 \use_geometry 0 \use_amsmath 0 \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \quotes_times 2 \papercolumns 1 \papersides 2 \paperpagestyle default \layout Title Quantitatively Tight Sample Complexity Bounds \layout Author John Langford \layout Standard I present many new results on sample complexity bounds (bounds on the future error rate of arbitrary learning algorithms). Of theoretical interest are qualitative and quantitative improvements in sample complexity bounds as well as some techniques and criteria for judging the tightness of sample complexity bounds. \layout Standard On the practical side, I show quantitative results (with true error rate bounds sometimes less than \begin_inset Formula \( 0.01 \) \end_inset ) for decision trees and neural networks with these sample complexity bounds applied to real world problems. I also present a technique for using both sample complexity bounds and (more traditional) holdout techniques. \layout Standard Together, the theoretical and practical results of this thesis provide a well-founded practical method for evaluating learning algorithm performance based upon both training and testing set performance. \layout Standard Code for calculating these bounds is provided. \layout Standard \begin_inset LatexCommand \tableofcontents{} \end_inset \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 297 235 file graph.eps width 4 100 flags 9 \end_inset \layout Part Introductory Learning Theory \layout Standard The introduction is broken into 4 parts. The first part is informal introduction to the learning model used here. The goal of the first part is to make explicit the design choices made in selecting our model. This is particularly important in machine learning because no model seems perfect. \layout Itemize The second chapter formally introduces the model, the goals of the analysis, and provides context to related work. \layout Itemize The third chapter states the fundamental statistical results upon which all of our techniques rest. \layout Itemize The fourth chapter discusses basic learning theory results. \layout Chapter Informal Introduction \layout Standard What is a sample complexity bound? Informally, it is a bound on the number of examples required to learn a function. Therefore, in order to motivate the use of a sample complexity bound, we must first motivate the learning problem. \layout Section The learning problem \layout Standard What is learning? Learning is the process of discovering relationships between events. Knowledge of the relationships between events is of great importance in coping with the environment. Naturally, humans tend to be quite good at the process of learning---so good that we sometimes do not even realize when it is hard. \layout Standard The learning problem which we focus on is learning for computers. In particular, \begin_inset Quotes eld \end_inset How can a computer learn? \begin_inset Quotes erd \end_inset A few examples are illustrative: \layout Enumerate How can we make a computer with a microphone output a text version of what is being spoken? \layout Enumerate How can we make a computer with a camera recognize John? \layout Enumerate How can we make a computer controlling a robot arrive at some location? \layout Standard We will work on learning a function from some input space to some output space - the supervised learning model. \layout Section The problem with the learning problem \layout Standard The learning problem, as stated, is somewhat ill-posed. There are some very obvious ways for a computer to learn---for example by memorization. The difficulty arises when memorization is too expensive. Expense here typically has to do with acquiring enough experience so that future prediction problems have already been encountered. The real learning problem becomes, \begin_inset Quotes eld \end_inset Given incomplete information, how can a computer learn? \begin_inset Quotes erd \end_inset This formulation of the learning problem gives rise to a new problem - quantifying the amount of information required to learn. Sample complexity bounds address this second question: \begin_inset Quotes eld \end_inset When can a computer learn? \begin_inset Quotes erd \end_inset \layout Standard In some cases learning is essentially hopeless. There are two notions of \begin_inset Quotes eld \end_inset hopeless \begin_inset Quotes erd \end_inset : information theoretic and computational. Information theoretic difficulties arise when it is simply not possible to predict the output given the input. For example, predicting when a radioactive nucleus will decay is \emph on always \emph default difficult no matter what observations are made according to current physics \begin_inset LatexCommand \cite{QM} \end_inset . Even when simple relations between inputs and outputs exist, the computation required to discover the simple relation can be formidable. A fine example of this is provided by cryptography \begin_inset LatexCommand \cite{Oded} \end_inset where a common task is to work out functions for which it is not feasible to predict the input given the output. \layout Section A plethora of learning models \layout Standard There are several possible learning models which can be divided along several axes. Our first axis is the type of information given to the learning algorithm. There are several possibilities: \layout Enumerate Labeled Examples: Vectors of observations. \layout Enumerate Partial relations: partial relations between events such as might be provided by experts. This could include constraints or partial functions. \layout Enumerate Other forms of input \layout Standard We will assume that just examples (vectors of observations) are available as a lowest common denominator amongst learning problems. It is worth noting that this does not preclude the use of other forms of information which could be much more powerful than mere examples. \layout Standard Another important axis is the difficulty of the learning problem. Do we have an opponent trying to minimize learning? Is someone helping us learn? Or is the world oblivious? \layout Enumerate Teacher: The teacher model is a \begin_inset Quotes eld \end_inset best case \begin_inset Quotes erd \end_inset model. Here, we assume that someone is providing the best examples possible in order to learn a relationship. \layout Enumerate Oblivious: The oblivious model is an \begin_inset Quotes eld \end_inset in between \begin_inset Quotes erd \end_inset model where we assume that the world doesn't oppose or help us learn. Examples are picked in some neutral manner. \layout Enumerate Opponent: The opponent model is a \begin_inset Quotes eld \end_inset worst case \begin_inset Quotes erd \end_inset model. Here, we assume that world is choosing examples in way which minimize our chance of learning. \layout Standard Clearly, the strongest form of learning is learning in the opponent model, because if something is learnable in the opponent model, then it is learnable in the oblivious model. The same relationship also holds for oblivious and teacher models. We will work in an oblivious model where examples are chosen in a neutral manner. Why the oblivious model? Aside from the intractability of analysis in an opponent model, we expect that most learning problems actually are oblivious: we have neither an active teacher nor an active opponent. Thus an analysis in the oblivious model will be directly applicable to many learning problems. \layout Standard We have committed to an oblivious model with examples as our source of informati on. With these two questions decided all the remaining questions will essentially be decided in favor of simplicity. There are two more very important questions to decide. The first is: does our algorithm get to pick the examples or are the examples picked for us? \layout Enumerate Active learning: The learning algorithm chooses a partial example and the remainder is filled in by nature. \layout Enumerate Passive learning: The learning algorithm is simply given examples. \layout Standard Active learning (aka experimental science) is inherently more powerful than passive learning. As an example, consider the problem of predicting whether or not it will rain or snow on any given day. By observation, we can eventually discover the \begin_inset Quotes eld \end_inset right \begin_inset Quotes erd \end_inset threshold temperature, but this might take many days. If we instead can control the temperature and make observations, it should be possible to narrow in on the threshold temperature very quickly - with exponentially fewer experiments than days of observation. \layout Standard Despite the power of active learning, we will choose to work with an inactive learning model, because opportunities for passive learning are typically more common than opportunities for active learning since passive learning only requires observation while active learning requires experimentation. Analyzing the active learning setting in a generic manner also appears very difficult. \layout Standard Our plan is to focus on an oblivious model with examples chosen by the world. The remaining question is: Do we know which relation we want to learn? The two possibilities are: \layout Enumerate Supervised learning: We want to learn to model an output in terms of an input. \layout Enumerate Unsupervised learning: We want to learn to model an arbitrary subset of observations in terms of other observations. \layout Standard We will focus on supervised learning and our exact setting will be defined next. \layout Standard The question we want to answer is, \begin_inset Quotes eld \end_inset When is supervised learning in an oblivious model with examples chosen by the world feasible? \begin_inset Quotes erd \end_inset . \layout Section The oblivious passive supervised learning model \layout Standard Oblivious will be modeled by an unknown distribution \begin_inset Formula \( D \) \end_inset over examples. Here, an \begin_inset Quotes eld \end_inset example \begin_inset Quotes erd \end_inset is just a vector of observations. Since this is a supervised learning model, all of our examples will split into two parts, \begin_inset Formula \( (x,y) \) \end_inset where \begin_inset Formula \( x \) \end_inset is the \begin_inset Quotes eld \end_inset input \begin_inset Quotes erd \end_inset and \begin_inset Formula \( y \) \end_inset is the \begin_inset Quotes eld \end_inset output \begin_inset Quotes erd \end_inset (the thing we wish to predict). A quick example is predicting whether precipitation will be in the form of rain or snow ( \begin_inset Quotes eld \end_inset \begin_inset Formula \( y \) \end_inset \begin_inset Quotes erd \end_inset value) given the temperature ( \begin_inset Quotes eld \end_inset \begin_inset Formula \( x \) \end_inset \begin_inset Quotes erd \end_inset value). \layout Standard For simplicity, we will typically work with theorems for binary valued \begin_inset Formula \( y \) \end_inset . We can remove this choice by generalizing sample complexity bounds---but we do not do so for simplicity of presentation. \layout Standard The fundamental assumption we will make in all of our sample complexity bounds is that all examples are drawn independently from the unknown distributi on \begin_inset Formula \( D \) \end_inset . This assumption must be stated explicitly and always kept in mind when considering the relevance of sample complexity bounds. \layout Axiom All examples are drawn independently from an unknown distribution \begin_inset Formula \( D \) \end_inset . \layout Standard With the exception of this assumption, all of the other parameters in our bounds will be verifiable at the time the bound is applied. \layout Standard Note that we use a distribution over \emph on labeled \emph default examples and not a combination of a distribution over the input space along with a function from the input space to the output space as in many other formulations. This choice is made because it is both more general and mathematically simpler. \layout Standard The number of samples, \begin_inset Formula \( m \) \end_inset , required for learning is the fundamental quantity we will be concerned with. In particular, we will not be concerned with the time complexity or the space complexity of learning algorithms. This choice is made for the purposes of simplicity and implies that the relationship between sample complexity bounds and learning algorithms will be similar to the difference between information theory and coding information for transmission across a noisy channel. \layout Standard Any learning algorithm must output some hypothesis, \begin_inset Formula \( h \) \end_inset , for predicting the output given the input. This hypothesis is essentially a program which, given the input, predicts the output. The hypothesis may or may not be randomized---it might choose an output deterministically or according to some randomization. \layout Standard The next item to quantify is learning. When has learning occurred? We will say that learning has occurred when the \emph on true error \emph default is significantly less than a uniform random prediction. The true error \begin_inset Formula \( e_{D}(h) \) \end_inset is defined in the following way: \begin_inset Formula \[ e_{D}(h)=\Pr _{D}(h(x)\neq y)\] \end_inset \layout Standard Unfortunately, the true error is not an observable quantity in our model because the distribution, \begin_inset Formula \( D \) \end_inset , is unknown. However, there is a related quantity which is observable. Given a sample set \begin_inset Formula \( S \) \end_inset of \begin_inset Formula \( (x,y) \) \end_inset pairs \begin_inset Formula \( \{(x_{1},y_{1}),...,(x_{m},y_{m})\} \) \end_inset , the \emph on empirical error \emph default , \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset is defined similarly as: \begin_inset Formula \[ \hat{e}_{S}(h)=\Pr _{S}(h(x)\neq y)=\frac{1}{m}\sum _{i=1}^{m}I(h(x_{i})\neq y_{i})\] \end_inset where \begin_inset Formula \( I() \) \end_inset is a function which maps \begin_inset Quotes eld \end_inset true \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 1 \) \end_inset and \begin_inset Quotes eld \end_inset false \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 0 \) \end_inset . Here \begin_inset Formula \( \Pr _{S}(...) \) \end_inset is a probability taken with respect to the uniform distribution over the set of examples, \begin_inset Formula \( S \) \end_inset . \layout Section Questions we can answer \layout Standard Our real goal in learning theory is to answer the question \begin_inset Quotes eld \end_inset When can we learn? \begin_inset Quotes erd \end_inset Unfortunately, there is no good answer to this question given only the assumption of independence. In particular, it may be impossible to learn. The simplest example of such a learning problem is the case of a distribution \begin_inset Formula \( D \) \end_inset which always flips a coin in deciding the value of the output \begin_inset Formula \( y \) \end_inset . The bias of the coin is the same irrespective of the input \begin_inset Formula \( x \) \end_inset . Since the value of the input is explicitly independent of the output we surely can not hope to learn a useful relation between the input and output. \layout Standard Considerable work has been done elsewhere to answer \begin_inset Quotes eld \end_inset When can we learn? \begin_inset Quotes erd \end_inset Typically this is done in models with stronger assumptions---for example, under the additional assumption that the output is related to the input by an \begin_inset Quotes eld \end_inset OR \begin_inset Quotes erd \end_inset of a subset of the input bits. \layout Standard We will instead focus on a different question: \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset This question \emph on is \emph default answerable in a probabilistic manner. In particular we can make a statement such as \begin_inset Quotes eld \end_inset With high probability over samples drawn from \begin_inset Formula \( D \) \end_inset we have learned if the empirical error is less than some value. \begin_inset Quotes erd \end_inset In practice, we will want to know how \emph on much \emph default we have learned which we can do by providing a high confidence bound on the true error rate of the learned hypothesis. \layout Chapter Formal Model and Context \layout Section Formal Model \layout Standard Let \begin_inset Formula \( X \) \end_inset be the space of the input to a predictor and \begin_inset Formula \( Y \) \end_inset be the space of the output. A labeled example \begin_inset Formula \( (x,y) \) \end_inset consists of an input, \begin_inset Formula \( x \) \end_inset and the desired output, \begin_inset Formula \( y \) \end_inset . Our formal model starts with the assumption that all labeled examples are drawn independently from a distribution \begin_inset Formula \( D \) \end_inset over the space \begin_inset Formula \( X\times Y \) \end_inset . This is strictly more general than the 'target concept' model which assumes that there exists some function \begin_inset Formula \( f:X\rightarrow Y \) \end_inset used to generate the label \begin_inset LatexCommand \cite{Valiant} \end_inset . In particular we can model probabilistic learning problems which do not have a particular \begin_inset Formula \( Y \) \end_inset value for each \begin_inset Formula \( X \) \end_inset value. This generalization is essentially \begin_inset Quotes eld \end_inset free \begin_inset Quotes erd \end_inset in the sense that it does not add to the complexity of presenting the results. \layout Standard The set of \begin_inset Formula \( m \) \end_inset independently drawn samples presented to a learning algorithm will be denoted as \begin_inset Formula \( S \) \end_inset . The learning algorithm will output a hypothesis \begin_inset Formula \( h:X\rightarrow Y \) \end_inset which has some unobservable true error rate \begin_inset Formula \( e_{D}(h) \) \end_inset and an observable empirical error rate, \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset . \layout Definition (True error) The true error \begin_inset Formula \( e_{D}(h) \) \end_inset of a hypothesis \begin_inset Formula \( h \) \end_inset is defined in the following way: \begin_inset Formula \[ e_{D}(h)=\Pr _{D}(h(x)\neq y)\] \end_inset \layout Standard Unfortunately, the true error is not an observable quantity in our model because the distribution, \begin_inset Formula \( D \) \end_inset , is unknown. However, there is a related quantity which is observable. \layout Definition (Empirical Error) Given a sample set \begin_inset Formula \( S \) \end_inset , the \emph on empirical error \emph default , \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset is defined as: \begin_inset Formula \[ \hat{e}_{S}(h)=\Pr _{S}(h(x)\neq y)=\frac{1}{m}\sum _{i=1}^{m}I(h(x_{i})\neq y_{i})\] \end_inset where \begin_inset Formula \( I() \) \end_inset is a function which maps \begin_inset Quotes eld \end_inset true \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 1 \) \end_inset and \begin_inset Quotes eld \end_inset false \begin_inset Quotes erd \end_inset to \begin_inset Formula \( 0 \) \end_inset . Here \begin_inset Formula \( \Pr _{S}(...) \) \end_inset is a probability taken with respect to the uniform distribution over the set of examples, \begin_inset Formula \( S \) \end_inset . \layout Section Relationship to Prior Work \layout Subsection Distribution free learning \layout Standard The question answered here differs significantly from much prior learning theory, including the results of Vapnik \begin_inset LatexCommand \cite{Vapnik} \end_inset , Valiant \begin_inset LatexCommand \cite{Valiant} \end_inset , Devroye \begin_inset LatexCommand \cite{Devroye} \end_inset , and many others. See \begin_inset LatexCommand \cite{Haussler} \end_inset for a good summary. The principle difference is the question we address: \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset \layout Standard Much of the prior work in learning theory addresses the question: \begin_inset Quotes eld \end_inset How many examples are needed in order to guarantee that I will choose (nearly) the best hypothesis from some fixed hypothesis set? \begin_inset Quotes erd \end_inset \layout Standard In particular, suppose that we have a hypothesis set \begin_inset Formula \( H \) \end_inset . If we guarantee that: \layout Standard \begin_inset Formula \[ \Pr _{S\sim D^{m}}(\exists h\in H:\, \, |e_{D}(h)-\hat{e}_{S}(h)|>\epsilon )<\delta \] \end_inset then if our learning algorithm choses the hypothesis with minimum empirical error (known as \begin_inset Quotes eld \end_inset empirical risk minimization \begin_inset Quotes erd \end_inset in the language of \begin_inset LatexCommand \cite{V2} \end_inset ), we will be guaranteed that the chosen hypothesis satisfies: \begin_inset Formula \[ e_{D}(h)-e^{*}_{D}\leq 2\epsilon \] \end_inset where \begin_inset Formula \( e^{*}_{D}=\inf _{h\in H}e_{D}(h) \) \end_inset . \layout Standard There are several difficulties inherent in this approach which we will avoid. \layout Enumerate Results in this model apply only for the empirical risk minimization (ERM) algorithm. The ERM algorithm is known to be NP-complete \begin_inset LatexCommand \cite{AR} \end_inset for some hypothesis spaces and, in general, is essentially dependent upon the axiom of choice. These results will approximately apply to approximate ERM algorithms, but it is unclear how \begin_inset Quotes eld \end_inset approximate \begin_inset Quotes erd \end_inset typical learning algorithms are. By answering \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset this complexity is avoided. \layout Enumerate There is no natural notion of preference (or \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset ) amongst the hypotheses, \begin_inset Formula \( h \) \end_inset , in the hypothesis space, \begin_inset Formula \( H \) \end_inset . This is very important for practical application as is shown in Figure \begin_inset LatexCommand \ref{fig-simple-holdout} \end_inset . \layout Enumerate Answers to this question are generally insensitive to the final result, \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset . This is again important in practice (shown here \begin_inset LatexCommand \ref{fig-bounds} \end_inset ) because the variance of the distribution of the empirical error, \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset , changes significantly with different true error rates, \begin_inset Formula \( e_{D}(h) \) \end_inset . \layout Standard These drawbacks can be alleviated (but not removed) to some extent. For example, many people apply results in this model to arbitrary learning algorithms by simply noticing that the deviation of the empirical and true error rates is small. This, in turn, implies that whatever hypothesis your algorithm learns (empirica l minimum or not), it's true error is within \begin_inset Formula \( \epsilon \) \end_inset of the empirical error. This is one approach to addressing the question \begin_inset Quotes eld \end_inset have we learned? \begin_inset Quotes erd \end_inset - others will be presented here. \layout Standard The second drawback can be alleviated using \begin_inset Quotes eld \end_inset structural risk minimization \begin_inset Quotes erd \end_inset (as in \begin_inset LatexCommand \cite{V2} \end_inset ). Structural risk minimization removes most of drawback (2), although it is awkward for specifying very fine-grained preferences. We will use an arbitrary measure \begin_inset Formula \( P \) \end_inset over the hypothesis space \begin_inset Formula \( h \) \end_inset . This prior \begin_inset Formula \( P \) \end_inset need not (necessarily) be a Bayesian prior - all that is formally required is that this \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset be specified without using information from the examples. The notion of measure \begin_inset Formula \( P \) \end_inset is more general than structural risk minimization because for \emph on every \emph default \begin_inset Quotes eld \end_inset structure \begin_inset Quotes erd \end_inset \begin_inset Formula \( T \) \end_inset on which structural risk minimization is done, we can produce a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula \( P_{T} \) \end_inset dependent upon the structure \begin_inset Formula \( T \) \end_inset and derive the same (or tighter) results. \layout Standard The third drawback can be alleviated using \begin_inset Quotes eld \end_inset relative risk \begin_inset Quotes erd \end_inset as in \begin_inset LatexCommand \cite{V2} \end_inset , but this still leaves some slack in the bounds. \layout Standard The work in this thesis can be thought of as directly addressing the altered question which alleviates problem 1. Since our goal is addressing this altered question rather than deriving the answer from other results, we will be able to state and prove tighter results. Furthermore, because we address the question people encounter in practice, the bounds presented here will be more directly applicable. In particular, they will apply to arbitrary learning algorithms (although not necessarily \emph on tightly \emph default to arbitrary learning algorithms) rather than just the empirical risk minimizati on algorithm. \layout Subsection Bayesian Analysis \layout Standard The basic result of Bayesian analysis is that Bayes rule: \begin_inset Formula \[ \Pr (f|S)=\frac{\Pr (S|f)\Pr (f)}{\Pr (S)}\] \end_inset is the \emph on optimal \emph default learning algorithm when the learning problem is drawn from the distribution \begin_inset Formula \( \Pr (f) \) \end_inset . Note that the \begin_inset Quotes eld \end_inset hypothesis \begin_inset Quotes erd \end_inset learned by the Bayesian learning algorithm is the weighted average predictor, \begin_inset Formula \[ h(x)=\textrm{sign}(\int \Pr (f|S)f(x)df)\] \end_inset This rather strong statement is difficult to utilize in practice because specifying and using arbitrary distributions \begin_inset Formula \( \Pr (f) \) \end_inset is unwieldy. In practice, many people use approximations and there is considerable question about whether or not the specified \begin_inset Formula \( \Pr (f) \) \end_inset is \begin_inset Quotes eld \end_inset right enough \begin_inset Quotes erd \end_inset after approximations. A chapter is later devoted to analyzing hypotheses of this weighted-average form and a theorem \begin_inset LatexCommand \ref{th-averaging} \end_inset about their accuracy is proved. \layout Standard Some work has been done to analyze the robustness of Bayesian algorithms under approximation errors. There are two common traits of Bayesian-related analysis: \layout Enumerate All statements are parameterized by a prior, \begin_inset Formula \( P \) \end_inset . \layout Enumerate The analysis is typically an \begin_inset Quotes eld \end_inset average case \begin_inset Quotes erd \end_inset (w.r.t. the prior \begin_inset Formula \( P \) \end_inset and a fixed Bayesian or approximate Bayesian algorithm, \begin_inset Formula \( A \) \end_inset ). \layout Standard An example of this sort of analysis can be found in \begin_inset LatexCommand \cite{HKS} \end_inset . The work in this thesis adopts parameterization by a measure \begin_inset Formula \( P \) \end_inset , but is \emph on not \emph default an average case analysis. Our analysis is \begin_inset Quotes eld \end_inset worst-case \begin_inset Quotes erd \end_inset in the sense that it applies to \emph on all \emph default learning problems whether or not the measure \begin_inset Formula \( P \) \end_inset is a \begin_inset Quotes eld \end_inset correct \begin_inset Quotes erd \end_inset prior or not. This approach is strongly similar to the work of McAllester \begin_inset LatexCommand \cite{PB} \end_inset , and a later chapter is devoted to a refinement of this result. Despite this, it is interesting to note that our bounds are minimized (in some sense) when the measure \begin_inset Formula \( P \) \end_inset is in fact a \begin_inset Quotes eld \end_inset correct \begin_inset Quotes erd \end_inset Bayesian prior. \layout Standard There have been other attempts to connect Bayesian average case settings with worst case settings. One interesting example is \begin_inset LatexCommand \cite{Ypredicting} \end_inset which discusses the connection between Bayesian setting and the mistake bound model. This is especially interesting because the mistake bound model is, in some sense, more \begin_inset Quotes eld \end_inset worst-case \begin_inset Quotes erd \end_inset than the model we consider here as no assumption of example independence is made. Further work \begin_inset LatexCommand \cite{Grun} \end_inset has occurred in the \begin_inset Quotes eld \end_inset Minimum Description Length \begin_inset Quotes erd \end_inset (MDL) community. \layout Section Overview of the document \layout Standard This document is primarily about the theory of sample complexity for answering the question \begin_inset Quotes eld \end_inset Have we learned? \begin_inset Quotes erd \end_inset . However, we do not neglect the experimental side. In particular, following the theory we will present results for application of sample complexity bounds to machine learning problems. These results are the 'best known results' in terms of bound tightness and should be considered as a guide and challenge to others working on sample complexity bounds. \layout Standard All of the sample complexity bounds presented here will fall within the paradigm of classical (non-Bayesian) statistics. Despite this, Bayesians may be interested in the results. In particular, it is worth noting that we will consider the use of a 'prior' and a 'posterior' (in a \emph on classical \emph default manner) within these bounds. \layout Standard In order to make this thesis more coherent, previous work (of which there is quite a bit) will be integrated into the presentation rather than separated into a section of its own. Credit will be given at the time the work is introduced. \layout Standard The document is organized into 3 parts: \layout Enumerate Introductory material. \layout Enumerate New results on Sample Complexity. \layout Enumerate Experimental results of applying Sample Complexity. \layout Standard What follows is a brief chapter-by-chapter summary of the theoretical results in this thesis. Much of the work can be summarized as approaches which use extra information (in the algorithm, in error rates of hypotheses which are \emph on not \emph default chosen, in test sets, etc...) in order to construct tighter bounds on the future error rate of a hypothesis. \layout Standard The principal \emph on practical \emph default result of this thesis is the construction of a program 'bound' which automatica lly uses any of several theorems in calculating a true error bound on the future error rate of a particular hypothesis. The use of this program will be demonstrated as the bounds are presented. \layout Subsection Microchoice Bounds \layout Standard The Microchoice technique allows for online construction of a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset which can be used in the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ). Empirical testing (presented in chapter 11) shows that this approach is practical and yields useful results on decision trees for real-world problems drawn from the UCI database of problems. The Adaptive Microchoice bounds further extend this approach and can result in functional improvements over the Occam's Razor bound. \layout Subsection Pac-Bayes Bounds \layout Standard Pac-Bayes bounds are a new approach (first presented by David McAllester \begin_inset LatexCommand \cite{PB} \end_inset ) for for dealing with continuously parameterized classifiers such as (stochasti c) neural networks. This chapter refines and improves the PAC-Bayes bound, giving it an information theoretic interpretation. Empirical results (presented in chapter 12) show that refined PAC-Bayes bound works to produce \emph on nonvacuous \emph default bounds on realistic learning problems. Nonvacuous bounds for continuous-valued classifiers are currently rare. \layout Subsection Averaging Bounds \layout Standard Averaging bounds deal with classifiers formed by picking according to the weighted majority on some other set of classifiers. Averaging is a very common technique in practice (see \begin_inset LatexCommand \cite{SFBL} \end_inset for examples), so results specialized for averaging classifiers are useful. This chapters states and proves a bound on averaging classifiers which shows that \begin_inset Quotes eld \end_inset hypothesis space complexity \begin_inset Quotes erd \end_inset \emph on decreases \emph default as averaging becomes more uniform. Prior theoretical work \begin_inset LatexCommand \cite{SFBL} \end_inset was of the form \begin_inset Quotes eld \end_inset averaging does not increase the hypothesis space complexity much \begin_inset Quotes erd \end_inset . \layout Subsection Shell Bounds \layout Standard Shell bounds are a new approach which trades extra information for tighter bounds. Empirical results in (presented in chapter 11) show this can be a useful approach. In order to ameliorate the increased information requirements, a sampled version of the bound is stated which allows for smooth interpolation between simpler bounds and the full shell bound. In addition, the shell bound has been extended to continuous spaces with an approach similar to PAC-Bayes bounds. \layout Subsection Bracketing Covering Number Bounds \layout Standard The results of this chapter are entirely theoretical. They point out an approach which may lead to useful technique for bounds on continuous valued hypothesis spaces. Using a stronger notion of cover (the bracketing cover), simplified bounds on the future error rate of continuous classifiers can be constructed. This approach can be significantly tighter than the standard covering number approach. More work in calculation of bracketing covers is needed before these results can be applied. \layout Subsection Progressive Validation \layout Standard Progressive Validation is a variant of the holdout bound ( \begin_inset LatexCommand \ref{th-hb} \end_inset ) which is roughly twice as efficient about how it uses examples. Progressive Validation is not used in the experimental results chapter(s) because more work is required in order to prove the bound without a Hoeffding-l ike approximation. \layout Subsection Combining training and test sets \layout Standard The idea behind this chapter is that it should be possible to combine \emph on any \emph default test set bound with \emph on any \emph default training set based bound in order to derive a bound with more robust behavior. A general technique is stated and proved to work. Empirical results (in chapter 11) show that this approach can work well in practice. \layout Chapter Basic Observations \layout Standard The principle observable quantity is the empirical error rate ( \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset ) of a hypothesis. What is the distribution of the empirical error rate for a fixed hypothesis? For each example, we know that the probability that the hypothesis will err is given by true error rate, \begin_inset Formula \( e_{D}(h) \) \end_inset . This can be modeled by a biased coin flip: heads if you are wrong and tails if you are right. \layout Standard Let us call the bias of the coin \begin_inset Formula \( p=e_{D}(h) \) \end_inset . What then is the probability of observing \begin_inset Formula \( k \) \end_inset heads out of \begin_inset Formula \( m \) \end_inset coin flips? This is a very familiar distribution in statistics called the Binomial distribution. Let \begin_inset Formula \( \hat{p} \) \end_inset be the observed rate of heads. \layout Definition (Binomial Distribution) The Binomial distribution is given by: \begin_inset Formula \[ \Pr _{\{0,1\}^{m}}\left( \hat{p}=\frac{k}{m}|p\right) =\binom{m}{k}p^{k}(1-p)^{m-k}\] \end_inset Here we use 'choose' notation defined by \begin_inset Formula \( \binom{m}{k}=\frac{m!}{(m-k)!k!} \) \end_inset . \layout Section The Basic Building Block \layout Standard Our real interest will be captured by Binomial tails because we wish to bound the probability of observing a misleadingly small event. The probability of a Binomial tail is just the cumulative distribution function: \layout Definition (Binomial Tail) \begin_inset Formula \[ \textrm{Bin}(m,k,p)\equiv \Pr _{\{0,1\}^{m}}\left( \hat{p}\leq \frac{k}{m}|p\right) =\sum _{j=1}^{k}\binom{m}{j}p^{j}(1-p)^{m-j}\] \end_inset = the probability that \begin_inset Formula \( m \) \end_inset coins with bias \begin_inset Formula \( p \) \end_inset produce \begin_inset Formula \( k \) \end_inset or fewer heads. \layout Standard For the learning problem, we will always choose \begin_inset Formula \( p=e_{D}(h) \) \end_inset and \begin_inset Formula \( \hat{p}=\hat{e}_{S}(h) \) \end_inset . With these definitions, we can interpret the Binomial tail as the probability of an empirical error less than or equal to \begin_inset Formula \( \frac{k}{m} \) \end_inset . \layout Section Approximation techniques \layout Standard \begin_inset LatexCommand \label{sec-approximation} \end_inset \layout Standard Exact calculation of \begin_inset Formula \( \textrm{Bin}(m,k,p) \) \end_inset (covered in the next subsection) can require computation at least proportional to \begin_inset Formula \( m \) \end_inset , which is often too expensive. For the bounds in this thesis, we will only need to calculate an upper bound on the quantity \begin_inset Formula \( \textrm{Bin}(m,k,p) \) \end_inset . There are several inequalities which are often used. The first of these is the Hoeffding inequality \begin_inset LatexCommand \cite{Hoeffding} \end_inset . Assume that \begin_inset Formula \( \frac{k}{m}

0 \) \end_inset , \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, e(h)\geq \hat{e}(h)+\sqrt{\frac{\ln |H|+\ln \frac{1}{\delta }}{2m}}\right) \leq \delta \] \end_inset \layout Proof Loosen corollary ( \begin_inset LatexCommand \ref{th-REDHSCP} \end_inset ) with the bound \begin_inset Formula \( \textrm{KL}(q||p)\geq 2(q-p)^{2} \) \end_inset . \layout Standard The agnostic form of this bound has the advantage that it explicitly shows that we are forcing the convergence (in high probability) of the empirical error rate to the true error rate. Graphically, we are forcing every empirical error to be near to its true error. This is a picture which represents the connection \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 89 12 file convergence.eps width 4 30 flags 9 \end_inset \layout Standard Later bounds will have more complicated convergence conditions with more complicated graphs. These more complicated bounds are necessary in order to avoid the limitations of the holdout bound. See figure \begin_inset LatexCommand \ref{fig-simple-holdout} \end_inset for a comparison with the holdout set. \layout Section Lower Bounds \layout Standard It is important to study lower bounds in order to discover how much room for improvement exists in our upper bounds. \layout Standard There are two forms of lower bounds: a lower bounds on the true error rate and lower \emph on upper \emph default bounds which lower bound the value our upper bound could return (given available information). For lower bounds essentially the upper bound theorem applies with the bound reversed. This form of lower bound allow us to construct a two-sided confidence interval on the location of the true error rate. Typically, such lower bounds can be formed by simply changing the sign in upper bound arguments. \layout Theorem \begin_inset LatexCommand \label{th-dhlb} \end_inset (Discrete Hypothesis Lower Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Proof For every individual hypothesis, we know that: \begin_inset Formula \[ \Pr _{D^{m}}(e(h)\leq \bar{e}(m,\hat{e}(h),\delta ))\leq \delta \] \end_inset Applying the union bound for every hypothesis gives us the result. \layout Section Lower Upper Bounds \layout Standard \begin_inset LatexCommand \label{sec-lower_upper} \end_inset \layout Standard The second form of lower bound is a lower bound on the upper bound (similar to the results of \begin_inset LatexCommand \cite{AHKV} \end_inset ). If we can lower bound the upper bound (as a function of its observables), then we can be confident that the upper bound is no looser than the gap between the lower upper bound and the upper bound. \layout Standard How tight is the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset )? The answer is \emph on sometimes \emph default tight. In particular, we can exhibit a set of learning problem where the discrete hypotheses bound can not be made significantly lower as a function of the observables, \begin_inset Formula \( m \) \end_inset , \begin_inset Formula \( \delta \) \end_inset , \begin_inset Formula \( |H| \) \end_inset , and \begin_inset Formula \( \hat{e}(h) \) \end_inset . Fix the value of these quantities and then we will construct a learning problem for which a lower upper bound can not be stated. \layout Standard Our learning problem will be defined over the input space \begin_inset Formula \( X=\{0,1\}^{|H|} \) \end_inset . The hypotheses will be \begin_inset Formula \( h_{i}(x)=x_{i} \) \end_inset where \begin_inset Formula \( x_{i} \) \end_inset is the \begin_inset Formula \( i \) \end_inset th component of the vector \begin_inset Formula \( x \) \end_inset . This construction allows us to vary the true error rate of each hypothesis independent of the other hypotheses. In fact, we can pick any true error rate for any hypothesis by simply adjusting the probability that \begin_inset Formula \( x_{i}=y \) \end_inset . Our learning problem can therefore generate problems according to the following algorithm: \layout Algorithm \begin_inset LatexCommand \label{alg-lp} \end_inset Draw_Sample(float \begin_inset Formula \( e \) \end_inset ) \layout Enumerate Pick \begin_inset Formula \( y\in \{0,1\} \) \end_inset uniformly \layout Enumerate For \begin_inset Formula \( i \) \end_inset , pick \begin_inset Formula \( x_{i}\neq y \) \end_inset with probability \begin_inset Formula \( e \) \end_inset . \layout Standard By construction, the true error rate of each hypothesis will be \begin_inset Formula \( e(h_{i}) \) \end_inset . Now, we can prove the following theorem: \layout Theorem \begin_inset LatexCommand \label{th-dhlub} \end_inset (Discrete Hypothesis lower upper bound) For all true error rates \begin_inset Formula \( e(h)>\frac{\ln |H|+\ln \frac{1}{\delta }}{m} \) \end_inset there exists a learning problem and algorithm such that: \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Standard Intuitively, this theorem implies that we can not improve significantly on the results of theorem \begin_inset LatexCommand \ref{th-DHSCP} \end_inset without using extra information about our learning problem. Some of our later results do exactly this - they use extra information. \layout Proof Using the family of learning problems implicitly defined by algorithm \begin_inset LatexCommand \ref{alg-lp} \end_inset , we know that \begin_inset Formula \[ \forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq \frac{\delta }{|H|}\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow \forall h,\forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <1-\frac{\delta }{|H|}\] \end_inset (independence) \begin_inset Formula \[ \Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \forall h\, \, e(h)<\bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) <\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\] \end_inset (negation) \begin_inset Formula \[ \Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-\left( 1-\frac{\delta }{|H|}\right) ^{|H|}\] \end_inset (approximation) \begin_inset Formula \[ \Rightarrow \forall \delta \in [0,1]:\, \, \Pr _{D^{m}}\left( \exists h\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\frac{\delta }{|H|}\right) \right) \geq 1-e^{-\delta }\] \end_inset \layout Standard For small \begin_inset Formula \( \delta \) \end_inset , we get that \begin_inset Formula \( 1-e^{-\delta }\simeq \delta \) \end_inset which implies that, no significantly better true error bound can be stated for all learning algorithms. In particular, the empirical risk minimization algorithm (which chooses the hypothesis with minimal empirical error) will have a significant probabilit y of a large deviation between the empirical and true error. \layout Section Structural Risk Minimization \layout Standard Structural Risk Minimization \begin_inset LatexCommand \cite{Vapnik} \end_inset (SRM) is a technique used in the learning theory community to avoid the difficulties associated with convergence on hypothesis sets that are too \begin_inset Quotes eld \end_inset large \begin_inset Quotes erd \end_inset . SRM works with a sequence of nested hypothesis sets, \begin_inset Formula \( H_{1}\subset H_{2}\subset ....\subset H_{l} \) \end_inset . For each hypothesis set, a discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) on the difference between empirical and true error exists. For \begin_inset Quotes eld \end_inset small \begin_inset Quotes erd \end_inset hypothesis sets, this bound may be tight while for large hypothesis sets it may be inherently loose. However, we also expect that the best hypothesis in the hypothesis set improves as the hypothesis set becomes larger. This naturally induces a trade-off: there will be some hypothesis set \begin_inset Formula \( H_{i} \) \end_inset for which the true error bound is minimized. \layout Standard We can't simply apply the discrete hypothesis bound to the meta-algorithm which picks the algorithm (and associated hypothesis space) with the smallest true error bound since this meta-algorithm could, potentially, output any hypothesis in \begin_inset Formula \( H_{l} \) \end_inset . The simplest way to retrofit the bound to include all hypothesis sets is with a simple theorem which essentially states that we can guarantee \emph on nearly \emph default the same bound as would apply on the smallest hypothesis space \begin_inset Formula \( H_{i} \) \end_inset containing the output hypothesis, \begin_inset Formula \( h \) \end_inset . \layout Theorem \begin_inset LatexCommand \label{th-SRM} \end_inset (Structural Risk Minimization) Let \begin_inset Formula \( p(i) \) \end_inset be some measure across the \begin_inset Formula \( l \) \end_inset hypothesis sets with \begin_inset Formula \( \sum _{i=1}^{l}p(i)=1 \) \end_inset . Then: \begin_inset Formula \[ \forall p(i):\, \, \, \Pr _{D^{m}}\left( \exists h\in H_{i}\in \{H_{1},...,H_{l}\}:\, \, e(h)\leq \bar{e}\left( m,\hat{e}(h),\frac{p(i)\delta }{|H_{i}|}\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \min _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset . \layout Proof Apply the union bound to the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ). \layout Standard The SRM bound is slightly inefficient in the sense that the bound for all hypotheses in \begin_inset Formula \( H_{2} \) \end_inset includes a bound for every hypothesis in \begin_inset Formula \( H_{1} \) \end_inset . This effect is typically small because the size of the hypothesis sets usually grows exponentially, implying that the extra confidence given to a hypothesis \begin_inset Formula \( h \) \end_inset in \begin_inset Formula \( H_{1} \) \end_inset by the bounds used on hypothesis set \begin_inset Formula \( H_{2},H_{3},... \) \end_inset is small relative to the confidence given by the bound for \begin_inset Formula \( H_{1} \) \end_inset . One can remove this slack in Structural Risk Minimization bound by \begin_inset Quotes eld \end_inset cutting out \begin_inset Quotes erd \end_inset the nested portion of each hypothesis set in the formulation of \begin_inset Formula \( H_{1},...,H_{l} \) \end_inset . We will call this Disjoint Structural Risk Minimization (also mentioned in \begin_inset LatexCommand \cite{MC_Journal} \end_inset ). \layout Section Incorporating a \begin_inset Quotes eld \end_inset Prior \begin_inset Quotes erd \end_inset \layout Standard In constructing the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ), it is notable that an arbitrary choice was made. We decided to give the same error allowance to every hypothesis. This is an arbitrary choice which, in practice, we will wish to make differentl y. The next theorem is essentially a restatement of the discrete hypothesis bound with this arbitrary choice made explicit. The same bound using the Hoeffding approximation has appeared elsewhere \begin_inset LatexCommand \cite{BEHW} \end_inset \begin_inset LatexCommand \cite{PB} \end_inset . \layout Theorem \begin_inset LatexCommand \label{th-ORB} \end_inset (Occam's Razor Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset over the hypothesis space, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \forall p(h)\, \, \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)\geq \bar{e}\left( m,\hat{e}(h),\delta p(h)\right) \right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Standard It is very important to notice that the \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset must be selected \emph on before \emph default seeing the examples. \layout Proof The proof again starts with the basic observation that: \begin_inset Formula \[ \Pr _{D^{m}}\left( e(h)\geq \bar{e}(m,\hat{e}(h),\delta )\right) \leq \delta \] \end_inset then, we apply the union bound in a \emph on nonuniform \emph default manner. In particular, we allocate confidence \begin_inset Formula \( \delta p(h) \) \end_inset to hypothesis \begin_inset Formula \( h \) \end_inset . Since \begin_inset Formula \( p \) \end_inset is normalized, we know that \begin_inset Formula \[ \sum _{h}\delta p(h)=\delta \] \end_inset which implies that the union bound completes the proof. \layout Standard Once again, we can relax the Occam's Razor bound (theorem \begin_inset LatexCommand \ref{th-ORB} \end_inset ) with the relative entropy Chernoff bound ( \begin_inset LatexCommand \ref{eq-recb} \end_inset ) to get a somewhat more tractable expression. \layout Corollary \begin_inset LatexCommand \label{th-reorb} \end_inset (Relative Entropy Occam's Razor Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Quotes eld \end_inset priors \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h) \) \end_inset over the hypothesis space, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))\geq \frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\right) \leq \delta \] \end_inset \layout Proof approximate the Binomial tail with ( \begin_inset LatexCommand \ref{eq-recb} \end_inset ) and solve for the minimum. \layout Standard The Occam's razor bound is often nonvacuous for discrete learning algorithms such as decision lists and decision trees. The next chapter will discuss a particular motivated choice of the measure \begin_inset Formula \( p(h) \) \end_inset which can lead to much tighter bounds in practice. \layout Standard The application of the Occam's Razor bound is somewhat more complicated then the application of the test set bound. Pictorially, the protocol for bound application is given in figure \begin_inset LatexCommand \ref{fig-training-set} \end_inset . \begin_float fig \layout Standard \align center \begin_inset Figure size 267 159 file thesis-presentation/training_set.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-training-set} \end_inset In order to apply this bound it is necessary that the choice of \begin_inset Quotes eld \end_inset Prior \begin_inset Quotes erd \end_inset be made before seeing any training examples. Then, the bound is calculated based upon the chosen hypothesis. Note that it \emph on is \emph default \begin_inset Quotes eld \end_inset legal \begin_inset Quotes erd \end_inset to chose the hypothesis based upon the prior \begin_inset Formula \( p(h) \) \end_inset as well as the empirical error \begin_inset Formula \( \hat{e}_{S}(h) \) \end_inset . \end_float \layout Standard Examples of calculation of these bounds is detailed in appendix section \begin_inset LatexCommand \ref{sec-train-bound-calc} \end_inset . \layout Standard The \begin_inset Quotes eld \end_inset Occam's Razor bound \begin_inset Quotes erd \end_inset is strongly related to compression. In particular, for any self-terminating description language, \begin_inset Formula \( d(h) \) \end_inset , we can associate a \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset \begin_inset Formula \( p(h)=2^{-d(h)} \) \end_inset with the property that \begin_inset Formula \( \sum _{h}p(h)\leq 1 \) \end_inset . Consequently, short description length hypotheses will tend to have a tighter convergence and the penalty term, \begin_inset Formula \( \ln \frac{1}{p(h)} \) \end_inset is the number of \begin_inset Quotes eld \end_inset nats \begin_inset Quotes erd \end_inset (bits base e). \layout Part New Techniques \layout Chapter Microchoice Bounds (the algebra of choices) \layout Standard \begin_inset LatexCommand \label{sec-MC} \end_inset \layout Standard The work in this chapter is joint with Avrim Blum and was first presented at ICML \begin_inset LatexCommand \cite{MC} \end_inset and then in the Machine Learning Journal \begin_inset LatexCommand \cite{MC_journal} \end_inset . The presentation here generalizes, unifies, and improves the earlier work. Microchoice bounds can be thought of as unifying \begin_inset Quotes eld \end_inset Self-bounding Learning Algorithms \begin_inset Quotes erd \end_inset \begin_inset LatexCommand \cite{SB} \end_inset and the Occam's razor bound, \begin_inset LatexCommand \cite{BEHW} \end_inset . \layout Standard The bounds of the previous chapter do not use much structure on the hypothesis space. Yet learning algorithms induce a natural structure. In particular, many learning algorithms work by an iterative process in which they take a sequence of steps each from a small set of choices (small in comparison to the overall hypothesis set size). Local optimization algorithms such as hill-climbing or simulated annealing, for example, work in this manner. Each step in a local optimization algorithm can be viewed as making a choice from a small set of possible steps to take. If we take into account the number of choices made and the size (and other properties) of each choice set, can we produce a tighter bound? This chapter introduce microchoice bounds which use this type of information to construct high confidence bounds on the future error rate. \layout Standard The microchoice bounds can be thought of in several ways. The simple microchoice bound (given in section \begin_inset LatexCommand \ref{sec:mc} \end_inset ) can be thought of as a well motivated application of the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ). The idea is to use the learning algorithm itself to define a description language for hypotheses, so that the description length of the hypothesis actually produced gives a bound on the estimation error. The adaptive microchoice theorem (in section \begin_inset LatexCommand \ref{sec:query} \end_inset ) can be thought of as a computationally tractable adaptation of Self-bounding Learning algorithms \begin_inset LatexCommand \cite{SB} \end_inset . Microchoice bounds tie together and show the relationship between these different sample complexity bounds. \layout Standard Microchoice bounds also provide insight into the nature of choices. In general, we know that choice is \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset for the purposes of creating a uniform bound on the true error rate. The microchoice bounds give a quantitative understanding of how much choice is \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset . In particular, the log of the choice space size is the natural measure of \begin_inset Quotes eld \end_inset badness \begin_inset Quotes erd \end_inset . This is directly related to the log of the hypothesis space size in the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ). There is also an indirect relationship with information theory where the log of the alphabet size is an important parameter for specifying the number of bits required to send a message. \layout Standard Viewed as an interactive proof of learning, microchoice bounds can be described pictorially as in figure \begin_inset LatexCommand \ref{fig-microchoice-protocol} \end_inset . \layout Standard \begin_float fig \layout Standard \align center \begin_inset Figure size 267 176 file thesis-presentation/microchoice.eps width 4 90 flags 9 \end_inset \layout Caption \begin_inset LatexCommand \label{fig-microchoice-protocol} \end_inset Microchoice bounds collapse the two-round Occam's Razor style protocol into a one round protocol. This is (essentially) done by providing a compiler for the verifier which takes a learning algorithm as input, and extracts a choice of \begin_inset Quotes eld \end_inset prior \begin_inset Quotes erd \end_inset . \end_float \layout Standard Important early work developing approximately self-bounding learning algorithms was also done by Domingos \begin_inset LatexCommand \cite{Domingos} \end_inset . \layout Section A Motivating Observation \layout Standard \begin_inset LatexCommand \label{sec:motivating} \end_inset \layout Standard Imagine, for the moment, that we know the (unknown) problem distribution, \begin_inset Formula \( D \) \end_inset . For a given learning algorithm \begin_inset Formula \( A \) \end_inset , a distribution \begin_inset Formula \( D \) \end_inset on labeled examples induces a distribution \begin_inset Formula \( q(h) \) \end_inset over the possible hypotheses \begin_inset Formula \( h\in H \) \end_inset produced by algorithm \begin_inset Formula \( A \) \end_inset after \begin_inset Formula \( m \) \end_inset examples \begin_float footnote \layout Standard \begin_inset Formula \( q(h) \) \end_inset is the probability over draws of examples and any internal randomization of the algorithm. \end_float . A natural choice for the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ) is the measure \begin_inset Formula \( p(h)=q(h) \) \end_inset . Is this choice optimal? The answer is \begin_inset Quotes eld \end_inset yes \begin_inset Quotes erd \end_inset , given the right notion of optimal. In particular, if we start with the relative entropy Occam's razor bound ( \begin_inset LatexCommand \ref{th-reorb} \end_inset ), we can show that \begin_inset Formula \( p(h)=q(h) \) \end_inset minimizes the expected value of the bound on the Kullback-Leibler divergence between the empirical error and true error. \layout Theorem (KL divergence minimization) \begin_inset Formula \( p(h)=q(h) \) \end_inset minimizes the expected value of the KL divergence in the relative entropy Occam's Razor bound ( \begin_inset LatexCommand \ref{th-reorb} \end_inset ). \layout Proof We need to show that \begin_inset Formula \[ q(h)=\textrm{argmin}_{p(h)}\sum _{h}q(h)\frac{\ln \frac{1}{p(h)}+\ln \frac{1}{\delta }}{m}\] \end_inset removing terms which the minimum does not depend on, we get: \begin_inset Formula \[ \textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{1}{p(h)}\] \end_inset adding a constant, we get: \begin_inset Formula \[ \textrm{argmin}_{p(h)}\sum _{h}q(h)\ln \frac{q(h)}{p(h)}\] \end_inset This is equivalent to minimizing the Kullback-Leibler divergence between the distribution of \begin_inset Formula \( q(h) \) \end_inset and \begin_inset Formula \( p(h) \) \end_inset which is minimized for \begin_inset Formula \( q(h)=p(h) \) \end_inset . \layout Standard Using the KL divergence as our notion of loss is somewhat non-intuitive. However, it is mathematically simple and not irrational. For all true errors, the KL divergence will be upper and lower bounded between an \begin_inset Formula \( l_{1} \) \end_inset and an \begin_inset Formula \( l_{2} \) \end_inset metric. Since these are two of the most common metrics, the choice of KL divergence based metric should behave similarly well. \layout Standard The point of these observations is to notice that if the structure of the learning algorithm produces a choice \begin_inset Formula \( p(h) \) \end_inset that approximates \begin_inset Formula \( q(h) \) \end_inset , the result should be better estimation bounds. \layout Section The Simple Microchoice Bound \layout Standard \begin_inset LatexCommand \label{sec:mc} \end_inset \layout Standard The simple microchoice bound is essentially a compelling and easy way to select a measure \begin_inset Formula \( p(h) \) \end_inset for learning algorithms that operate by making a series of small choices. In particular, consider a learning algorithm that works by making a sequence of choices, \begin_inset Formula \( c_{1},...,c_{d} \) \end_inset , from a sequence of choice sets, \begin_inset Formula \( C_{1},...,C_{d} \) \end_inset , finally producing a hypothesis, \begin_inset Formula \( h\in H \) \end_inset . Specifically, the algorithm first looks at the choice set \begin_inset Formula \( C_{1} \) \end_inset and the data \begin_inset Formula \( z^{N} \) \end_inset to produce choice \begin_inset Formula \( c_{1}\in C_{1} \) \end_inset . The choice \begin_inset Formula \( c_{1} \) \end_inset then determines the next choice set \begin_inset Formula \( C_{2} \) \end_inset (different initial choices produce different choice sets for the second level). The algorithm again looks at the data to make some choice \begin_inset Formula \( c_{2}\in C_{2} \) \end_inset . This choice then determines the next choice set \begin_inset Formula \( C_{3} \) \end_inset , and so on. These choice sets can be thought of as nodes in a \emph on choice tree \emph default , where each node in the tree corresponds to some internal state of the learning algorithm, and a node containing some choice set \begin_inset Formula \( C \) \end_inset has branching factor \begin_inset Formula \( |C| \) \end_inset . Pictorially, we can draw the tree as follows: \layout Standard \begin_inset Figure size 270 112 file thesis-presentation/microchoice_tree.eps width 4 91 flags 9 \end_inset \layout Standard Depending on the learning algorithm, sub-trees of the overall tree may be identical. We address optimization of the bound for this case later. Eventually there is a final choice leading to a leaf, and a single hypothesis is output. \layout Standard For example, the decision list algorithm of Rivest \begin_inset LatexCommand \cite{Rivest} \end_inset , applied to a set of \begin_inset Formula \( n \) \end_inset features, uses the data to choose one of \begin_inset Formula \( 4n+2 \) \end_inset rules (e.g., \begin_inset Quotes eld \end_inset if \begin_inset Formula \( \bar{x}_{3} \) \end_inset then \begin_inset Formula \( - \) \end_inset \begin_inset Quotes erd \end_inset ) to put at the top. Based on the choice made, it moves to a choice set of \begin_inset Formula \( 4n-2 \) \end_inset possible rules to put at the next level, then a choice set of size \begin_inset Formula \( 4n-6 \) \end_inset , and so on, until eventually it chooses a rule such as \begin_inset Quotes eld \end_inset else \begin_inset Formula \( + \) \end_inset \begin_inset Quotes erd \end_inset leading to a leaf. \layout Standard The microchoice bound calculation program is as follows: \layout Algorithm Calculate_Microchoice \layout Enumerate set \begin_inset Formula \( p\leftarrow 1 \) \end_inset \layout Enumerate while learning algorithm has not halted. \begin_deeper \layout Enumerate \begin_inset Formula \( |C|\leftarrow \) \end_inset number of possible data-dependent choices \layout Enumerate \begin_inset Formula \( p\leftarrow \frac{p}{|C|} \) \end_inset \end_deeper \layout Enumerate return \begin_inset Formula \( p \) \end_inset \layout Standard Pictorially, this algorithm can be thought of as taking a \begin_inset Quotes eld \end_inset supply \begin_inset Quotes erd \end_inset of probability at the root of the choice tree. \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 267 139 file thesis-presentation/microchoice_alg.eps width 4 90 flags 9 \end_inset \layout Standard The root takes its supply and splits it equally among all its children. Recursively, each child then does the same: it takes the supply it is given and splits it evenly among its children, until all of the supplied probability is allocated among the leaves. If we examine some leaf containing a hypothesis \begin_inset Formula \( h \) \end_inset , we see that this method gives at least probability \begin_inset Formula \( p(h)=\prod _{i=1}^{d(h)}\frac{1}{|C_{i}(h)|} \) \end_inset to each \begin_inset Formula \( h \) \end_inset for any path of depth \begin_inset Formula \( d(h) \) \end_inset reaching the hypothesis \begin_inset Formula \( h \) \end_inset . \layout Standard Note it is possible that several leaves will contain the same hypothesis \begin_inset Formula \( h \) \end_inset , and in that case one should really add the allocated measures together. However, the microchoice bound neglects this issue, implying that it will be unnecessarily loose for learning algorithms which can arrive at the same hypothesis in multiple ways. The reason for neglecting this is that now, \begin_inset Formula \( p(h) \) \end_inset is something the learning algorithm itself can calculate by simply keeping track of the sizes of the choice sets it has encountered so far. It is important to notice that this construction is defined before observing any data. Consequently, every hypothesis has some bound associated with it before the data is used to pick a particular hypothesis and its corresponding bound. \layout Standard Another way to view this process is that we cannot know in advance which choice sequence the algorithm will make. However, a distribution \begin_inset Formula \( D \) \end_inset on labeled examples induces a probability distribution over choice sequences, inducing a probability distribution \begin_inset Formula \( q(h) \) \end_inset over hypotheses. Ideally we would like to use \begin_inset Formula \( p(h)=q(h) \) \end_inset in our bounds as noted above. However, we cannot calculate \begin_inset Formula \( q(h) \) \end_inset (since the distribution \begin_inset Formula \( D \) \end_inset is unknown), so instead, our choice of \begin_inset Formula \( p(h) \) \end_inset will be just an estimate. We hope that the algorithm designer has chosen a \begin_inset Quotes eld \end_inset good \begin_inset Quotes erd \end_inset leaning algorithm which induces a distribution \begin_inset Formula \( p(h) \) \end_inset over the final hypotheses which is near to \begin_inset Formula \( q(h) \) \end_inset . Our estimate \begin_inset Formula \( p(h) \) \end_inset is the probability distribution resulting from picking each choice uniformly at random from the current choice set at each level (note: this is different from picking a final hypothesis uniformly at random). I.e., it can be viewed as the measure associated with the assumption that at each step, all choices are equally likely. \layout Standard We immediately find the following theorem: \layout Theorem \begin_inset LatexCommand \label{th-smb} \end_inset (Microchoice Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, e(h)>\bar{e}(m,\hat{e}(h),\frac{\delta }{\prod _{i=1}^{d(h)}|C_{i}(h)|})\right) \leq \delta \] \end_inset where \begin_inset Formula \( \bar{e}\left( m,\frac{k}{m},\delta \right) \equiv \max _{p}\{p:\, \, \textrm{Bin}(m,k,p)=\delta \} \) \end_inset \layout Standard \noindent \series bold Proof. \series default Specialization of the Occam's Razor bound ( \begin_inset LatexCommand \ref{th-ORB} \end_inset ). \layout Standard Once again, it will be worthwhile to slightly loosen this bound with the following corollary: \layout Corollary \begin_inset LatexCommand \label{th-remb} \end_inset (Relative Entropy Microchoice Bound) For all hypothesis spaces, \begin_inset Formula \( H \) \end_inset , for all \begin_inset Formula \( \delta \in (0,1] \) \end_inset : \begin_inset Formula \[ \Pr _{D^{m}}\left( \exists h\in H:\, \, \textrm{KL}(\hat{e}(h)||e(h))>\frac{\sum _{i=1}^{d(h)}\ln |C_{i}(h)|+\ln \frac{1}{\delta }}{m}\right) \leq \delta \] \end_inset \layout Standard The point of the microchoice bound is that the quantity \begin_inset Formula \( \bar{e}(...) \) \end_inset is something the algorithm can calculate as it goes along, based on the sizes of the choice sets encountered. To see this, note that the hypothesis dependent term is \begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \) \end_inset . We can calculate \begin_inset Formula \( d(h) \) \end_inset by noting the number of choices made. The choice sets, \begin_inset Formula \( C_{i}(h) \) \end_inset , can often be easily deduced by reasoning about the possible microchoices the algorithm could have made given different datasets. \layout Standard In many natural cases, a \begin_inset Quotes eld \end_inset fortuitous distribution and target concept \begin_inset Quotes erd \end_inset corresponds to a shallow leaf or a part of the tree with low branching, resulting in a better bound. For instance, in the decision list case, \begin_inset Formula \( \sum _{i=1}^{d(h)}\ln |C_{i}(h)| \) \end_inset is roughly \begin_inset Formula \( d\ln n \) \end_inset where \begin_inset Formula \( d \) \end_inset is the length of the list produced and \begin_inset Formula \( n \) \end_inset is the number of features. Notice that \begin_inset Formula \( d\ln n \) \end_inset is also the description length of the final hypothesis produced in the natural encoding, thus in this case these theorems yield similar bounds to a simple application of Occam's razor or SRM. \layout Standard More generally, the microchoice bound is similar to Occam's razor or SRM bounds when each \begin_inset Formula \( k \) \end_inset -ary choice in the tree corresponds to \begin_inset Formula \( \log k \) \end_inset bits in the natural encoding of the final hypothesis \begin_inset Formula \( h \) \end_inset . However, sometimes this may not be the case. Consider, for instance, a local optimization algorithm in which there are \begin_inset Formula \( n \) \end_inset parameters and each step adds or subtracts 1 from one of the parameters. Suppose in addition the algorithm knows certain constraints that these parameters must satisfy (perhaps a set of linear inequalities) and the algorithm restricts itself to choices in the legal region. In this case, the branching factor, at most \begin_inset Formula \( 2n \) \end_inset , might become much smaller if we are \begin_inset Quotes eld \end_inset lucky \begin_inset Quotes erd \end_inset and head toward a highly constrained portion of the solution space. One could always reverse-engineer an encoding of hypotheses based on the choice tree, but the microchoice approach is much more natural. \layout Standard There is also an opportunity to use \emph on a \protected_separator priori \emph default knowledge in the choice of \begin_inset Formula \( p(h) \) \end_inset . In particular, instead of splitting our confidence equally at each node of the tree, we could split it unevenly, according to some heuristic function \begin_inset Formula \( g \) \end_inset . If \begin_inset Formula \( g \) \end_inset is \begin_inset Quotes eld \end_inset good \begin_inset Quotes erd \end_inset it may produce error bounds similar to the bounds when \begin_inset Formula \( p(h)=q(h) \) \end_inset . In fact, the method of section ( \begin_inset LatexCommand \ref{sec:query} \end_inset ) where we combine these results with Freund's query-tree \begin_inset LatexCommand \cite{SB} \end_inset approach can be thought of as an attempt to do exactly this. \layout Subsection Examples \layout Standard It is difficult to create a bound which is universally better than previous bounds. The microchoice bound can be much better than the discrete hypothesis bound ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) and can be slightly worse. To develop some understanding of how they compare we consider several cases. \layout Subsubsection Greedy Set Cover \layout Standard Consider a greedy set cover algorithm for learning an OR function over \begin_inset Formula \( F \) \end_inset Boolean features. The algorithm begins with a choice space of size \begin_inset Formula \( F+1 \) \end_inset (one per feature or halt) and chooses the feature that covers the most positive examples while covering no negative ones. It then moves to a choice space of size \begin_inset Formula \( F \) \end_inset (one per feature remaining or halt) and chooses the best remaining feature and so on until it halts. If the number of features chosen is \begin_inset Formula \( k \) \end_inset then the microchoice bound is: \layout Standard \begin_inset Formula \[ \epsilon (h)=\frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{i=1}^{k}\ln (F-i+2)\right) \leq \frac{1}{m}\left( \ln \frac{1}{\delta }+k\ln (F+1)\right) \] \end_inset \layout Standard The bound of ( \begin_inset LatexCommand \ref{th-DHSCP} \end_inset ) is: \layout Standard \begin_inset Formula \[ \epsilon =\frac{1}{m}\left( \ln \frac{1}{\delta }+F\ln 2\right) .\] \end_inset \layout Standard If \begin_inset Formula \( k \) \end_inset is small, then the microchoice bound is a lot better, but if \begin_inset Formula \( k=O(F) \) \end_inset then the microchoice bound is slightly worse than the discrete hypothesis bound. Notice that in this case the microchoice bound is essentially the same as the standard Occam's razor analysis when one uses \begin_inset Formula \( O(\ln F) \) \end_inset bits per feature to describe the hypothesis. \layout Subsubsection Decision Trees \layout Standard Decision trees over discrete sets (say, \begin_inset Formula \( \{0,1\}^{F} \) \end_inset ) are another natural setting for application of the microchoice bound. \layout Standard A decision tree differs from a decision list in that the size of the available choice set is larger due to the fact that there are multiple nodes where a new test may be applied. In particular, for a decision tree with \begin_inset Formula \( K \) \end_inset leaves at an average depth of \begin_inset Formula \( d \) \end_inset , the choice set size is \begin_inset Formula \( K(F-d) \) \end_inset , giving a bound noticeably worse than the bound for the decision list. This motivates a slightly different decision algorithm which considers only one leaf node at a time. The algorithm adds a new test or decides to never add a new test at this node. In this case, there are \begin_inset Formula \( (F-d(v)+1) \) \end_inset choices for a node \begin_inset Formula \( v \) \end_inset at depth \begin_inset Formula \( d(v) \) \end_inset , implying the bound: \begin_inset Formula \begin{equation} \label{eqn:dectreemc} \textrm{KL}(\hat{e}(h)||e(h))\leq \frac{1}{m}\left( \ln \frac{1}{\delta }+\sum _{v}\ln (F-d(v)+1)\right) \end{equation} \end_inset where \begin_inset Formula \( v \) \end_inset ranges over the nodes of the decision tree. Once again, this is very similar to what might be produced by an Occam's Razor Bound with an appropriate choice of prior. This result is again sometimes much better than the Discrete Hypothesis bound and sometimes slightly worse. \layout Subsection Pruning \layout Standard Decision tree algorithms for real-world learning problems often have some form of \begin_inset Quotes eld \end_inset pruning \begin_inset Quotes erd \end_inset as in \begin_inset LatexCommand \cite{Quinlan} \end_inset and \begin_inset LatexCommand \cite{Mingers} \end_inset . The tree is first grown to full size producing a hypothesis with minimum empirical error. Then the tree is \begin_inset Quotes eld \end_inset pruned \begin_inset Quotes erd \end_inset starting at the leaves and progressing up through the tree toward the root node using some test for the significance of an internal node. An internal node is not significant if the reduction in total error is small in comparison to the complexity of its children. Insignificant internal nodes are replaced with a leaf resulting in a smaller tree. \layout Standard Microchoice bounds have the property that they incidentally prove a bound for every decision tree which can be found by pruning internal nodes. In particular, one of the choices available when constructing a node is to make the node a leaf. Therefore, if we begin with the tree \begin_inset Formula \( T \) \end_inset and then prune to the smaller tree \begin_inset Formula \( T' \) \end_inset , we can apply the bound ( \begin_inset LatexCommand \ref{eqn:dectreemc} \end_inset ) to \begin_inset Formula \( T' \) \end_inset \emph on as if \emph default the algorithm had constructed \begin_inset Formula \( T' \) \end_inset directly rather than having gone first through the tree \begin_inset Formula \( T \) \end_inset . This suggests another possible pruning criterion: prune a node if the pruning would result in an improved microchoice bound. That is, prune if the increase in empirical error is less than the decrease in \begin_inset Formula \( \epsilon (h) \) \end_inset . This pruning criteria is a \begin_inset Quotes eld \end_inset pessimistic criteria \begin_inset Quotes erd \end_inset \begin_inset LatexCommand \cite{Yishay} \end_inset . \layout Standard The similarities to SRM are discussed next. \layout Subsection Microchoice and Structural Risk Minimization \layout Standard The microchoice bound is essentially a compelling application of the Disjoint SRM bound \begin_inset LatexCommand \ref{th-SRM} \end_inset where the description language for a hypothesis is the sequence of data-depende nt choices which the algorithm makes in the process of deciding upon the hypothesis. The hypothesis set \begin_inset Formula \( H_{i} \) \end_inset is all hypotheses with the same description length in this language. \layout Standard As an example, consider a binary decision tree with \begin_inset Formula \( F \) \end_inset Boolean features and a Boolean label. The first hypothesis set, \begin_inset Formula \( H_{1} \) \end_inset will consist of \begin_inset Formula \( 2 \) \end_inset hypotheses; always false and always true. In general, we will have one hypothesis set for every legal configuration of internal nodes. The size of a hypothesis set where every tree contains \begin_inset Formula \( k \) \end_inset internal nodes will be \begin_inset Formula \( 2^{k+1} \) \end_inset because there are \begin_inset Formula \( k+1 \) \end_inset leaves each of which can take \begin_inset Formula \( 2 \) \end_inset values. The weighting \begin_inset Formula \( p(i) \) \end_inset across the different hypothesis sets is defined by the microchoice allocation of confidence. \layout Standard The principle disadvantage of the microchoice bound is that the sequence of data-dependent choices may contain redundancy. A different SRM bound with a different set of disjoint hypothesis sets might be able to better avoid redundancy. As an example, assume that we are working with a decision tree on \begin_inset Formula \( F \) \end_inset binary features. There are \begin_inset Formula \( F+2 \) \end_inset choices (any of \begin_inset Formula \( F \) \end_inset features or \begin_inset Formula \( 2 \) \end_inset labels) at the top node. At the next node down there will be \begin_inset Formula \( F+1 \) \end_inset choices in both the left and right children. Repeat until a maximal decision tree is constructed. There will be \begin_inset Formula \( \prod _{i=0}^{F}(F-i+2)^{2^{i}} \) \end_inset possible trees. This number is somewhat larger than the number of Boolean functions on \begin_inset Formula \( F \) \end_inset features: \begin_inset Formula \( 2^{2^{F}} \) \end_inset . \layout Section Combining Microchoice with Freund's Query Tree approach \layout Standard \begin_inset LatexCommand \label{sec:query} \end_inset \layout Standard The next section is devoted to an improvement of the microchoice bound called adaptive microchoice, which arises from synthesizing Freund's query trees \begin_inset LatexCommand \cite{SB} \end_inset with the microchoice bound. This improvement is not easily expressed as a simplification of Structural Risk Minimization. In essence, the adaptive microchoice bound can gain from dependence on the learning problem distribution \begin_inset Formula \( D \) \end_inset and can take advantage of an \begin_inset Quotes eld \end_inset easy \begin_inset Quotes erd \end_inset distribution. \layout Standard First we require some background material in order to state and understand Freund's bound. \layout Subsection Preliminaries and Definitions \layout Standard The statistical query framework introduced by Kearns \begin_inset LatexCommand \cite{SQ} \end_inset restricts learning algorithm to only access the data using statistical queries. A statistical query takes as input a binary predicate, \begin_inset Formula \( \chi \) \end_inset , mapping examples to a binary output: \begin_inset Formula \( (X,Y)\rightarrow \{0,1\} \) \end_inset . The output of the statistical query is the average of \begin_inset Formula \( \chi \) \end_inset over the examples seen. Let \begin_inset Formula \( z_{i} \) \end_inset be the \begin_inset Formula \( i \) \end_inset th labeled example, then: \begin_inset Formula \[ \hat{D}_{\chi }=\frac{1}{m}\sum _{i=1}^{m}\chi (z_{i})\] \end_inset \layout Standard The output is an empirical estimate of the true value \begin_inset Formula \( D_{\chi }={\textbf {E}}_{D}[\chi (z)]=\Pr _{z\sim D}(\chi (z)=1) \) \end_inset of the query under the distribution \begin_inset Formula \( D \) \end_inset \begin_float footnote \layout Standard In the real SQ model there is no set of examples. The algorithm asks a query \begin_inset Formula \( \chi \) \end_inset and is given a response \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset that is guaranteed to be near to the true value \begin_inset Formula \( D_{\chi } \) \end_inset . That is, the true SQ model is an abstraction of the scenario described here where \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset is computed from an observed sample. \end_float . One simple example of a predicate \begin_inset Formula \( \chi \) \end_inset is \begin_inset Quotes eld \end_inset the first bit is \begin_inset Formula \( 1 \) \end_inset \begin_inset Quotes erd \end_inset . A more complicated predicate might be \begin_inset Quotes eld \end_inset the third bit xor the 4th bit is \begin_inset Formula \( 0 \) \end_inset \begin_inset Quotes erd \end_inset . Naturally, the distribution of \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset will be the familiar Binomial distribution. \layout Standard It is convenient to define \begin_inset Formula \[ \bar{I}_{D_{\chi }}(\delta )=\frac{1}{m}\max _{k}\left\{ k:\, \, 1-\textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \] \end_inset and \begin_inset Formula \[ \underline{I}_{D_{\chi }}(\delta )=\frac{1}{m}\min _{k}\left\{ k:\, \, \textrm{Bin}\left( m,k,D_{\chi }\right) \geq \frac{\delta }{2}\right\} \] \end_inset and let \begin_inset Formula \[ I_{\chi }(\delta )=\left[ \underline{I}_{D_{\chi }}(\delta ),\bar{I}_{D_{\chi }}(\delta )\right] \] \end_inset Intuitively, \begin_inset Formula \( I_{\chi }(\delta ) \) \end_inset is a (fixed) interval in which the random variable \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset will fall with high probability. In other words, we know that: \begin_inset Formula \[ \Pr \left[ \hat{D}_{\chi }\not \in I_{\chi }(\delta )\right] \leq \delta \] \end_inset Now, we want to construct a confidence interval based upon the high confidence interval \begin_inset Formula \( I_{\chi }(\delta ) \) \end_inset . We can do this using the inversion lemma ( \begin_inset LatexCommand \ref{lem-inversion} \end_inset ) to get: \begin_inset Formula \[ \bar{D}_{\chi }(\delta )=\max _{p}\left\{ p:\underline{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \] \end_inset and \begin_inset Formula \[ \underline{D}_{\chi }(\delta )=\min _{p}\left\{ p:\bar{I}_{p}(\delta )=\hat{D}_{\chi }\right\} \] \end_inset \layout Standard The random interval defined here contains the \begin_inset Quotes eld \end_inset real \begin_inset Quotes erd \end_inset answer \begin_inset Formula \( D_{\chi } \) \end_inset with high probability. In other words, we have: \begin_inset Formula \[ \Pr \left[ D_{\chi }\not \in [\underline{D}_{\chi }(\delta ),\bar{D}_{\chi }(\delta )]\right] \leq \delta \] \end_inset \layout Subsection Background and Summary \layout Standard Freund \begin_inset LatexCommand \cite{SB} \end_inset considers choice algorithms that at each step perform a Statistical Query on the sample, using the result to determine which choice to take. For an algorithm \begin_inset Formula \( A \) \end_inset , tolerance \begin_inset Formula \( \alpha \) \end_inset (defined next), and distribution \begin_inset Formula \( D \) \end_inset , Freund defines the query tree \begin_inset Formula \( T_{A}(D,\alpha ) \) \end_inset as the choice tree created by considering only those choices resulting from answers \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset to queries \begin_inset Formula \( \chi \) \end_inset such that \begin_inset Formula \( \left| \hat{D}_{\chi }-D_{\chi }\right| \leq \alpha \) \end_inset . The idea is that if a particular predicate, \begin_inset Formula \( \chi \) \end_inset , is true with probability \begin_inset Formula \( .9 \) \end_inset (for example) on a random sample it is very unlikely that the empirical result of the query will be \begin_inset Formula \( .1 \) \end_inset . More generally, the chance the answer to a given query is off by more than \begin_inset Formula \( \alpha \) \end_inset is at most \begin_inset Formula \( 2e^{-2m\alpha ^{2}} \) \end_inset by Hoeffding's inequality. So, if the entire tree contains a total of \begin_inset Formula \( |Q(T_{A}(D,\alpha ))| \) \end_inset queries in it, the probability \emph on any \emph default of these queries is off by more than \begin_inset Formula \( \alpha \) \end_inset is at most \begin_inset Formula \( 2\cdot |Q(T_{A}(D,\alpha ))|\cdot e^{-2m\alpha ^{2}} \) \end_inset . In other words, this is an upper bound on the probability the algorithm ever \begin_inset Quotes eld \end_inset falls off the tree \begin_inset Quotes erd \end_inset and makes a low probability choice. The point of this is that we can allocate half (say) of the confidence parameter \begin_inset Formula \( \delta \) \end_inset to the event that the algorithm ever falls off the tree, and then spread the remaining half evenly on the hypotheses in the tree (which hopefully is a much smaller set than the entire hypothesis set). \layout Standard Unfortunately, the query tree suffers from the same problem as the \begin_inset Formula \( q(h) \) \end_inset distribution considered in section ( \begin_inset LatexCommand \ref{sec:motivating} \end_inset ), namely that to compute it, one needs to know \begin_inset Formula \( D \) \end_inset . So, Freund proposes an algorithmic method to find a super-set approximation of the tree. The idea is that by analyzing the results of queries, it is possible to determine which outcomes were unlikely given that the query is close to the desired outcome. In particular, each time a query \begin_inset Formula \( \chi \) \end_inset is asked and a response \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset is received, if it is true that \begin_inset Formula \( |\hat{D}_{\chi }-D_{\chi }|\leq \alpha \) \end_inset , then the range \begin_inset Formula \( \left[ \hat{D}_{\chi }-2\alpha ,\hat{D}_{\chi }+2\alpha \right] \) \end_inset contains the range \begin_inset Formula \( \left[ D_{\chi }-\alpha ,D_{\chi }+\alpha \right] \) \end_inset . Thus, under the assumption that no query in the \emph on correct \emph default tree is answered badly, a super-set of the correct tree can be produced by exploring all choices resulting from responses within \begin_inset Formula \( 2\alpha \) \end_inset of the response actually received. By applying this method to every node in the query tree we can generate an empirically observable super-set of the query tree: that is, the original query tree is a pruning of the empirically constructed tree. \layout Standard A drawback of this method is that it can easily take exponential time to produce the approximate tree, because even the smaller correct tree can have a size exponential in the running time of the learning algorithm. Instead, we would much rather simply keep track of the choices actually made and the sizes of the nodes actually followed, which is what the microchoic e approach allows us to do. As a secondary point, given \begin_inset Formula \( \delta \) \end_inset , computing a good value of \begin_inset Formula \( \alpha \) \end_inset for Freund's approach is not trivial, see \begin_inset LatexCommand \cite{SB} \end_inset ; we will be able to finesse that issue and use the tighter bound \begin_inset Formula \( \hat{D}_{\chi }\in I_{\chi }(\delta ) \) \end_inset . \layout Standard In order to apply the microchoice approach, we modify Freund's query tree so that different nodes in the tree receive different confidence, \begin_inset Formula \( \delta \) \end_inset , much in the same way that different hypotheses \begin_inset Formula \( h \) \end_inset in our choice tree receive different values of \begin_inset Formula \( \delta (h) \) \end_inset . \layout Subsection Microchoice Bounds for Query Trees \layout Standard The manipulations of the choice tree are now reasonably straightforward. We begin by describing the \emph on true \emph default microchoice query tree and then give the algorithmic approximation. As with the choice tree in section ( \begin_inset LatexCommand \ref{sec:mc} \end_inset ), one should think of each node in the tree as representing the current internal state of the algorithm. \layout Standard We incorporate Freund's approach into the choice tree construction by having each internal node allocate a portion, \begin_inset Formula \( p' \) \end_inset of its \begin_inset Quotes eld \end_inset supply \begin_inset Quotes erd \end_inset of failure probability to the event that \begin_inset Formula \( \hat{D}_{\chi }\not \in I_{\chi }(\delta *p') \) \end_inset . The node then splits the remainder of its supply evenly among the children corresponding to choices that result from answers \begin_inset Formula \( \hat{D}_{\chi } \) \end_inset with \begin_inset Formula \( \hat{D}_{\chi }\in I_{\chi }(\delta *p') \) \end_inset . Choices that would result from \begin_inset Quotes eld \end_inset bad \begin_inset Quotes erd \end_inset answers to the query are \emph on pruned away \emph default from the tree and get nothing. This continues down the tree to the leaves. Pictorially, this looks like: \layout Standard \added_space_top 0.3cm \added_space_bottom 0.3cm \align center \begin_inset Figure size 267 116 file thesis-presentation/amicro_tree.eps width 4 90 flags 9 \end_inset \layout Standard How should \begin_inset Formula \( p' \) \end_inset be chosen? Smaller values of \begin_inset Formula \( p' \) \end_inset result in larger intervals \begin_inset Formula \( I_{\chi }(\delta *p') \) \end_inset leading to more children in the pruned tree and less confidence given to each. Larger values of \begin_inset Formula \( p' \) \end_inset result in less left over to divide among the children. Unfo