Quantitatively Tight Sample Complexity Bounds
John Langford
I present many new results on sample complexity bounds (bounds on the future
error rate of arbitrary learning algorithms). Of theoretical interest are qualitative
and quantitative improvements in sample complexity bounds as well as some techniques
and criteria for judging the tightness of sample complexity bounds.
On the practical side, I show quantitative results (with true error rate bounds
sometimes less than
) for decision trees and neural networks with
these sample complexity bounds applied to real world problems. I also present
a technique for using both sample complexity bounds and (more traditional) holdout
techniques.
Together, the theoretical and practical results of this thesis provide a well-founded
practical method for evaluating learning algorithm performance based upon both
training and testing set performance.
Code for calculating these bounds is provided.
Contents
*1!Figure
Part 1
Introductory Learning Theory
The introduction is broken into 4 parts. The first part is informal introduction
to the learning model used here. The goal of the first part is to make explicit
the design choices made in selecting our model. This is particularly important
in machine learning because no model seems perfect.
- The second chapter formally introduces the model, the goals of the analysis,
and provides context to related work.
- The third chapter states the fundamental statistical results upon which all
of our techniques rest.
- The fourth chapter discusses basic learning theory results.
What is a sample complexity bound? Informally, it is a bound on the number of
examples required to learn a function. Therefore, in order to motivate the use
of a sample complexity bound, we must first motivate the learning problem.
What is learning? Learning is the process of discovering relationships between
events. Knowledge of the relationships between events is of great importance
in coping with the environment. Naturally, humans tend to be quite good at the
process of learning-so good that we sometimes do not even realize when it
is hard.
The learning problem which we focus on is learning for computers. In particular,
``How can a computer learn?'' A few examples are illustrative:
- How can we make a computer with a microphone output a text version of what is
being spoken?
- How can we make a computer with a camera recognize John?
- How can we make a computer controlling a robot arrive at some location?
We will work on learning a function from some input space to some output space
- the supervised learning model.
1.2 The problem with the learning problem
The learning problem, as stated, is somewhat ill-posed. There are some very
obvious ways for a computer to learn-for example by memorization. The difficulty
arises when memorization is too expensive. Expense here typically has to do
with acquiring enough experience so that future prediction problems have already
been encountered. The real learning problem becomes, ``Given incomplete information,
how can a computer learn?'' This formulation of the learning problem gives
rise to a new problem - quantifying the amount of information required to learn.
Sample complexity bounds address this second question: ``When can a computer
learn?''
In some cases learning is essentially hopeless. There are two notions of ``hopeless'':
information theoretic and computational. Information theoretic difficulties
arise when it is simply not possible to predict the output given the input.
For example, predicting when a radioactive nucleus will decay is always
difficult no matter what observations are made according to current physics[].
Even when simple relations between inputs and outputs exist, the computation
required to discover the simple relation can be formidable. A fine example of
this is provided by cryptography [] where a common task is to work
out functions for which it is not feasible to predict the input given the output.
1.3 A plethora of learning models
There are several possible learning models which can be divided along several
axes. Our first axis is the type of information given to the learning algorithm.
There are several possibilities:
- Labeled Examples: Vectors of observations.
- Partial relations: partial relations between events such as might be provided
by experts. This could include constraints or partial functions.
- Other forms of input
We will assume that just examples (vectors of observations) are available as
a lowest common denominator amongst learning problems. It is worth noting that
this does not preclude the use of other forms of information which could be
much more powerful than mere examples.
Another important axis is the difficulty of the learning problem. Do we have
an opponent trying to minimize learning? Is someone helping us learn? Or is
the world oblivious?
- Teacher: The teacher model is a ``best case'' model. Here, we assume that
someone is providing the best examples possible in order to learn a relationship.
- Oblivious: The oblivious model is an ``in between'' model where we assume
that the world doesn't oppose or help us learn. Examples are picked in some
neutral manner.
- Opponent: The opponent model is a ``worst case'' model. Here, we assume that
world is choosing examples in way which minimize our chance of learning.
Clearly, the strongest form of learning is learning in the opponent model, because
if something is learnable in the opponent model, then it is learnable in the
oblivious model. The same relationship also holds for oblivious and teacher
models. We will work in an oblivious model where examples are chosen in a neutral
manner. Why the oblivious model? Aside from the intractability of analysis in
an opponent model, we expect that most learning problems actually are oblivious:
we have neither an active teacher nor an active opponent. Thus an analysis in
the oblivious model will be directly applicable to many learning problems.
We have committed to an oblivious model with examples as our source of information.
With these two questions decided all the remaining questions will essentially
be decided in favor of simplicity. There are two more very important questions
to decide. The first is: does our algorithm get to pick the examples or are
the examples picked for us?
- Active learning: The learning algorithm chooses a partial example and the remainder
is filled in by nature.
- Passive learning: The learning algorithm is simply given examples.
Active learning (aka experimental science) is inherently more powerful than
passive learning. As an example, consider the problem of predicting whether
or not it will rain or snow on any given day. By observation, we can eventually
discover the ``right'' threshold temperature, but this might take many days.
If we instead can control the temperature and make observations, it should be
possible to narrow in on the threshold temperature very quickly - with exponentially
fewer experiments than days of observation.
Despite the power of active learning, we will choose to work with an inactive
learning model, because opportunities for passive learning are typically more
common than opportunities for active learning since passive learning only requires
observation while active learning requires experimentation. Analyzing the active
learning setting in a generic manner also appears very difficult.
Our plan is to focus on an oblivious model with examples chosen by the world.
The remaining question is: Do we know which relation we want to learn? The two
possibilities are:
- Supervised learning: We want to learn to model an output in terms of an input.
- Unsupervised learning: We want to learn to model an arbitrary subset of observations
in terms of other observations.
We will focus on supervised learning and our exact setting will be defined next.
The question we want to answer is, ``When is supervised learning in an oblivious
model with examples chosen by the world feasible?''.
1.4 The oblivious passive supervised learning model
Oblivious will be modeled by an unknown distribution
over examples.
Here, an ``example'' is just a vector of observations. Since this is a supervised
learning model, all of our examples will split into two parts,
where
is the ``input'' and
is the ``output'' (the thing
we wish to predict). A quick example is predicting whether precipitation will
be in the form of rain or snow (``
'' value) given the temperature
(``
'' value).
For simplicity, we will typically work with theorems for binary valued
.
We can remove this choice by generalizing sample complexity bounds-but we
do not do so for simplicity of presentation.
The fundamental assumption we will make in all of our sample complexity bounds
is that all examples are drawn independently from the unknown distribution
.
This assumption must be stated explicitly and always kept in mind when considering
the relevance of sample complexity bounds.
Axiom 1
All examples are drawn independently from an unknown distribution
.
With the exception of this assumption, all of the other parameters in our bounds
will be verifiable at the time the bound is applied.
Note that we use a distribution over labeled examples and not a combination
of a distribution over the input space along with a function from the input
space to the output space as in many other fo