One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.
The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.
In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.
In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.
Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs.
- Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data.
- “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well.
- Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another.
This only scratches the surface—there are yet more subtleties. For example the (as mentioned in meaning of probability) shifts from one viewpoint to another.
Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.
- Functionally. Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l1 or l2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.
- Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S, e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of f(). We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff.
- Computationally Regularization can be thought of as a computational shortcut to computing the f() above. Hence, smoothness, convexity, and other computational constraints are important issues.
One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.
A loss function is some function which, for any example, takes a prediction and the correct prediction, and determines how much loss is incurred. (People sometimes attempt to optimize functions of more than one example such as “area under the ROC curve” or “harmonic mean of precision and recall”.) Typically we try to find predictors that minimize loss.
There seems to be a strong dichotomy between two views of what “loss” means in learning.
- Loss is determined by the problem. Loss is a part of the specification of the learning problem. Examples of problems specified by the loss function include “binary classification”, “multiclass classification”, “importance weighted classification”, “l2 regression”, etc… This is the decision theory view of what loss means, and the view that I prefer.
- Loss is determined by the solution. To solve a problem, you optimize some particular loss function not given by the problem. Examples of these loss functions are “hinge loss” (for SVMs), “log loss” (common in Bayesian Learning), and “exponential loss” (one incomplete explanation of boosting). One advantage of this viewpoint is that an appropriate choice of loss function (such as any of the above) results in a (relatively tractable) convex optimization problem.
I don’t fully understand the second viewpoint. It seems (to some extent) like looking where the light is rather than where your keys fell on the ground. Many of these losses-of-convenience also seem to have behavior unlike real world problems. For example in this contest somebody would have been the winner except they happened to predict one example incorrectly with very low probability. Under log loss, their loss became very high. This does not seem to correspond to the intuitive notion of what the loss should be on the problem.
“Assumption” is another word to be careful with in machine learning because it is used in several ways.
- Assumption = Bias There are several ways to see that some form of ‘bias’ (= preferring of one solution over another) is necessary. This is obvious in an adversarial setting. A good bit of work has been expended explaining this in other settings with “no free lunch” theorems. This is a usage specialized to learning which is particularly common when talking about priors for Bayesian Learning.
- Assumption = “if” of a theorem The assumptions are the ‘if’ part of the ‘if-then’ in a theorem. This is a fairly common usage.
- Assumption = Axiom The assumptions are the things that we assume are true, but which we cannot verify. Examples are “the IID assumption” or “my problem is a DNF on a small number of bits”. This is the usage which I prefer.
One difficulty with any use of the word “assumption” is that you often encounter “if assumption then conclusion so if not assumption then not conclusion“. This is incorrect logic. For example, with variant (1), “the assumption of my prior is not met so the algorithm will not learn”. Or, with variant (3), “the data is not IID, so my learning algorithm designed for IID data will not work”. In each of these cases “will” must be replaced with “may” for correctness.