12.3. Results & Discussion
12.3.1. Holdout bound
Our goal is to bound the true error of the hypothesis output by our learning algorithm.
To do this, we apply sample complexity bounds to the results of the decision tree on UCI
database problems. The problems chosen from the UCI database are those for which a
discrete decision tree is applicable. All bounds are calculated with a probability of failure of .
As mentioned in the introduction, there are two approaches. The
commonly used approach is to first divide the example set into two sets, and . Then, train using the
examples in and test
on the examples.
We chose an split
of the data into .
We will compare each bound with the simple holdout approach because this is the
commonly used baseline.
12.3.2. Comparison with a standard confidence interval approach
When attempting to calculate a confidence interval on the true error rate given the
holdout set, many people follow a standard statistical prescription:
- Calculate the empirical mean .
- Calculate the empirical variance .
- Pretend that the distribution is a normal with the above parameters
and construct a confidence interval by cutting the tails of the Gaussian
cumulative distribution.
This approach is motivated by the fact that for any fixed true error rate, the
distribution of empirical errors will behave like a gaussian asymptotically. Here,
asymptotically means “in the limit as the number of test examples goes to infinity”.
The problem with this approach is that it leads to fundamentally misleading results.
In particular, 12.3.2 shows that the confidence interval is not confined to the interval .
It is difficult to give an interpretation to intervals with boundaries less than or greater
than .
In addition, this approach is sometimes highly overoptimistic. When the test error is , our confidence interval
should not have size
for any finite .
In contrast, the holdout bound approach uses the underlying Binomial distribution
directly. This implies:
- The holdout bound approach is never optimistic.
- The holdout bound based confidence interval always returns an upper and
lower bound in .
- The holdout bound approach is more accurate.
The bootstrap [15] is sometimes used as a confidence interval. The assumption under
which this works is essentially equivalent to an assumption of “enough” data. For finite
amounts of data, the bootstrap “confidence intervals” will necessarily be violated on
datasets with phase transitions such as 10.4.1. This is discussed more in the next
section.
12.3.3. Comparison with point estimators
Point estimators are techniques for directly estimating the value of the true error. In
theory, there should be no need to compare point estimators with confidence interval
bounds such as those discussed here because the goals are simply different: point
estimators attempt to estimate the value of the true error while confidence intervals
confine the value of the true error to an interval with high probability. However, point
estimators are often used for more than estimating true error. It is a common practice to
use point estimators in deciding which of two learning algorithms (or learning algorithm
parameters) is better.
There are several several point estimators in use, including:
- Holdout test set error rate.
- The bootstrap.
One commonly used point estimator is the bootstrap. In typical use, the bootstrap
which functions like this:
Repeat many times:
- Pick
examples uniformly from the set of
examples.
- Train on the
examples
- Test on examples not included in the training set.
After the above computation, the training and test errors are combined according some formula
(which often varies) to get an estimate of the true error rate of a hypothesis learned on all of
the
original examples.
There is one immediate observation: the resampling process typically results in about
unique examples being included in the resampled subset. This has very strong implications
because there exist learning problems with “phase transitions” where the accuracy of the
learned hypothesis (even for the best possible learning algorithm) as a function of decreases suddenly
when
reaches some critical threshold. This implies that point estimators cannot always be
accurate on dataset with a phase transition like 10.4.1. When learning a hypothesis on examples results in a true error
rate of , it could be the case
that learning on examples
results in a true error rate of
or it could be the case that the true error rate will be .
Given that the bootstrap can sometimes fail to predict the true error rate, reasoning
about which algorithm is preferable based upon the bootstrap output is questionable.
One alternative to this is reasoning with the criteria:
Pick the learning algorithm with the lower upper bound.
Assuming that examples are independent, this approach can never fail arbitrarily
badly (with high probability).
There are still questionable issues with this approach such as: “What if the upper
bound is not tight?” It could be the case that a better learning algorithm has a
worse upper bound implying that the worse algorithm will be picked according
to this criteria. One solution to this dilemma is to always involve some small
amount of holdout examples in your bound calculation. Used judiciously, these
holdout examples can guarantee that the bound-based criteria never becomes too
loose.
12.3.4. Simplistic bounds vs. the Holdout bound
Figure 12.3.3 compares a bound based upon theorem 4.2.3 and theorem 4.1.1. It is
remarkably pessimistic about the prospect of training set based bounds because the
confidence intervals are essentially vacuous. This bound can be improved always by
using exact (rather than approximate) calculations of the Binomial tail. It can
also be improved in practice by using a nonuniform “prior” over the hypothesis
space.
12.3.5. Occam vs. the Holdout bound
A “prior” (in the sense of theorem 4.6.1) does not help much with the visible
confidence intervals, although an examination of the calculations suggests that
improvements do exist - they just aren’t enough to make the confidence intervals
nonvacuous in figure 12.3.4. Note that the “prior” used here is the Microchoice prior.
Next, we will get rid of the approximation.
12.3.6. Microchoice
For the first time, we observe confidence intervals which are nonvacuous on a
training set in figure 12.3.5. This is encouraging, and a comparison with the holdout
approach indicates that the training set based confidence intervals are actually superior
on datasets with a small number of examples (and thus with a very small holdout
set).
12.3.7. Shell Bound
The Shell bound performs better than the Microchoice bound in figure
12.3.6. The information and computation requirements needed to calculate
the shell bound are quite large, but the resulting bound is noticeably tighter,
especially on problems with more examples. This bound is strong evidence that
training set based bounds can be made competitive with test set based bounds.
However, it is unnecessary to choose between these approaches since we can
construct a bound which uses information from both training and test set based
bounds.
12.3.8. Combined Microchoice and holdout bound
The combined Microchoice and Holdout bound performs only slightly worse than
the best of the either bound and is sometimes better than either bound individually for
the problems reported in figure 12.3.7. This particular combined bound is
(perhaps) the most practical result of this thesis since it is easy to calculate
the necessary information and reasonably easy to calculate the value of the
bound.
12.3.9. Combined Shell and Holdout Bound
The combined shell and holdout bound gives the best results of all in figure 12.3.8. The
downside of using the shell bound is that significantly more computation and information
is required in order to calculate the bound. The computational cost for the bound is
which makes it impractical to apply this bound beyond about with
current computers.