Quantitatively Tight Sample Complexity Bounds

John Langford

I present many new results on sample complexity bounds (bounds on the future error rate of arbitrary learning algorithms). Of theoretical interest are qualitative and quantitative improvements in sample complexity bounds as well as some techniques and criteria for judging the tightness of sample complexity bounds.
On the practical side, I show quantitative results (with true error rate bounds sometimes less than 0.01) for decision trees and neural networks with these sample complexity bounds applied to real world problems. I also present a technique for using both sample complexity bounds and (more traditional) holdout techniques.
Together, the theoretical and practical results of this thesis provide a well-founded practical method for evaluating learning algorithm performance based upon both training and testing set performance.
Code for calculating these bounds is provided.

Contents

*1!Figure

Part 1
Introductory Learning Theory


The introduction is broken into 4 parts. The first part is informal introduction to the learning model used here. The goal of the first part is to make explicit the design choices made in selecting our model. This is particularly important in machine learning because no model seems perfect.

Chapter 1
Informal Introduction

What is a sample complexity bound? Informally, it is a bound on the number of examples required to learn a function. Therefore, in order to motivate the use of a sample complexity bound, we must first motivate the learning problem.

1.1  The learning problem

What is learning? Learning is the process of discovering relationships between events. Knowledge of the relationships between events is of great importance in coping with the environment. Naturally, humans tend to be quite good at the process of learning-so good that we sometimes do not even realize when it is hard.
The learning problem which we focus on is learning for computers. In particular, ``How can a computer learn?'' A few examples are illustrative:
  1. How can we make a computer with a microphone output a text version of what is being spoken?
  2. How can we make a computer with a camera recognize John?
  3. How can we make a computer controlling a robot arrive at some location?
We will work on learning a function from some input space to some output space - the supervised learning model.

1.2  The problem with the learning problem

The learning problem, as stated, is somewhat ill-posed. There are some very obvious ways for a computer to learn-for example by memorization. The difficulty arises when memorization is too expensive. Expense here typically has to do with acquiring enough experience so that future prediction problems have already been encountered. The real learning problem becomes, ``Given incomplete information, how can a computer learn?'' This formulation of the learning problem gives rise to a new problem - quantifying the amount of information required to learn. Sample complexity bounds address this second question: ``When can a computer learn?''
In some cases learning is essentially hopeless. There are two notions of ``hopeless'': information theoretic and computational. Information theoretic difficulties arise when it is simply not possible to predict the output given the input. For example, predicting when a radioactive nucleus will decay is always difficult no matter what observations are made according to current physics[]. Even when simple relations between inputs and outputs exist, the computation required to discover the simple relation can be formidable. A fine example of this is provided by cryptography [] where a common task is to work out functions for which it is not feasible to predict the input given the output.

1.3  A plethora of learning models

There are several possible learning models which can be divided along several axes. Our first axis is the type of information given to the learning algorithm. There are several possibilities:
  1. Labeled Examples: Vectors of observations.
  2. Partial relations: partial relations between events such as might be provided by experts. This could include constraints or partial functions.
  3. Other forms of input
We will assume that just examples (vectors of observations) are available as a lowest common denominator amongst learning problems. It is worth noting that this does not preclude the use of other forms of information which could be much more powerful than mere examples.
Another important axis is the difficulty of the learning problem. Do we have an opponent trying to minimize learning? Is someone helping us learn? Or is the world oblivious?
  1. Teacher: The teacher model is a ``best case'' model. Here, we assume that someone is providing the best examples possible in order to learn a relationship.
  2. Oblivious: The oblivious model is an ``in between'' model where we assume that the world doesn't oppose or help us learn. Examples are picked in some neutral manner.
  3. Opponent: The opponent model is a ``worst case'' model. Here, we assume that world is choosing examples in way which minimize our chance of learning.
Clearly, the strongest form of learning is learning in the opponent model, because if something is learnable in the opponent model, then it is learnable in the oblivious model. The same relationship also holds for oblivious and teacher models. We will work in an oblivious model where examples are chosen in a neutral manner. Why the oblivious model? Aside from the intractability of analysis in an opponent model, we expect that most learning problems actually are oblivious: we have neither an active teacher nor an active opponent. Thus an analysis in the oblivious model will be directly applicable to many learning problems.
We have committed to an oblivious model with examples as our source of information. With these two questions decided all the remaining questions will essentially be decided in favor of simplicity. There are two more very important questions to decide. The first is: does our algorithm get to pick the examples or are the examples picked for us?
  1. Active learning: The learning algorithm chooses a partial example and the remainder is filled in by nature.
  2. Passive learning: The learning algorithm is simply given examples.
Active learning (aka experimental science) is inherently more powerful than passive learning. As an example, consider the problem of predicting whether or not it will rain or snow on any given day. By observation, we can eventually discover the ``right'' threshold temperature, but this might take many days. If we instead can control the temperature and make observations, it should be possible to narrow in on the threshold temperature very quickly - with exponentially fewer experiments than days of observation.
Despite the power of active learning, we will choose to work with an inactive learning model, because opportunities for passive learning are typically more common than opportunities for active learning since passive learning only requires observation while active learning requires experimentation. Analyzing the active learning setting in a generic manner also appears very difficult.
Our plan is to focus on an oblivious model with examples chosen by the world. The remaining question is: Do we know which relation we want to learn? The two possibilities are:
  1. Supervised learning: We want to learn to model an output in terms of an input.
  2. Unsupervised learning: We want to learn to model an arbitrary subset of observations in terms of other observations.
We will focus on supervised learning and our exact setting will be defined next.
The question we want to answer is, ``When is supervised learning in an oblivious model with examples chosen by the world feasible?''.

1.4  The oblivious passive supervised learning model

Oblivious will be modeled by an unknown distribution D over examples. Here, an ``example'' is just a vector of observations. Since this is a supervised learning model, all of our examples will split into two parts, (x,y) where x is the ``input'' and y is the ``output'' (the thing we wish to predict). A quick example is predicting whether precipitation will be in the form of rain or snow (`` y'' value) given the temperature (`` x'' value).
For simplicity, we will typically work with theorems for binary valued y. We can remove this choice by generalizing sample complexity bounds-but we do not do so for simplicity of presentation.
The fundamental assumption we will make in all of our sample complexity bounds is that all examples are drawn independently from the unknown distribution D. This assumption must be stated explicitly and always kept in mind when considering the relevance of sample complexity bounds.
Axiom 1 All examples are drawn independently from an unknown distribution D.
With the exception of this assumption, all of the other parameters in our bounds will be verifiable at the time the bound is applied.
Note that we use a distribution over labeled examples and not a combination of a distribution over the input space along with a function from the input space to the output space as in many other fo