One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.
The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.
In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.
In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
“non-defective distributions”.
So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.
So this “surprisingness” notion seems to apply whenever our complexity measure ius somehow tied to the volume of our hypothesis space for the suitably defined measure. If we assign uniform prior over this measure, larger volumes imply larger “surprise”. One measure of complexity that doesn’t directly translate into volumes is dimension, ie you can have 20 dimensional space of feasible parameters in model A that has smaller volume than 2 dimensional space of feasible parameters in model B, yet higher variance. So you’d have higher “surprise”, but lower complexity (assuming uniform priors over parameter spaces).
Also, it’s hard to define a consistent measure on an uncountably infinitely dimensional space (ie, space of continuous valued random variables). So instead of measuring volumes, one could measure “diameters.” For instance you could have an estimator that minimizes worst possible loss, and that would be a bound that doesn’t rely on the notion of volume or prior.
My impression is that for a substantial number of people complexity=prior (not necessarily a Bayesian prior) is a reasonable viewpoint which rarely leads to a confused understanding.
One caveat is that it is not enough. Consider learning a threshold function f(x) = I(x > t) for some t. We know (from either VC analysis or a more direct quantile analysis) that about the same number of labeled IID drawn examples are required for choosing a good threshold function as for choosing between any two functions. Formally speaking, the description length of a threshold function is infinite (since t is a real number).
The basic message is that we know how to learn in several ways which seem difficult to describe as ‘low description length’ or ‘large prior’. I believe the PAC-Bayes bound (and follow-on work) was a substantial step towards removing this caveat, but it is not a complete step.
For instance, I do not know how to describe shell bounds in terms of a convincing description language + bound for general description languages.
Another caveat has to do with the meaning of ‘know’. When all that is needed is knowing “which way is more complex” (such as for setting regularization parameters), a loose understanding of overfitting is adequate. When we are designing new learning algorithms, a “how much more complex” understanding may be necessary.
Interesting. I don’t see what shell bounds measure the “complexity” of: it’s definitely not just a function or a hypothesis space of functions.
Could you meaningfully use the word complexity here?
You can talk about the number of samples required in order to choose a good classifier. This is a ‘complexity’ of sorts.
I’ll buy it’s the “complexity of learning” in some sense, but not the complexity of a function or a function space. I was hoping more for the later….
Interesting post: the topic is nice but I have trouble understanding the details.
First of all, the notions of “function complexity” you mention (VC and Rademacher) are really notions of complexity for function CLASSES and not individual functions. Also they measure something very well-defined: the largest difference between true and empirical error for functions in the class. So do not have to match with one’s understanding of complexity.
Program length is indeed a notion that applies to individual functions.
Now, I agree that “complexity” is probably not the best term. What we are after is, generally speaking, a way to express “preferences” between possible functions. Hence anything that provides an order on functions can be considered a prior. In practice, this order is most often provided via a functional (a map from functions to numbers). For example, in SVM, this is the norm of the function in a Hilbert space, in Bayesian learning, it is the prior probability of that function (of course if you use GP, the latter is a monotonic function of the former).
The fact that this prior has to sum to one is not mandatory (again take the example of SVM, the prior is just the norm). Hoewever, what is mandatory is that not all possible functions have the same prior value (otherwise no generalization is possible, you would just say any possible value is fine as a prediction).
Thanks for you thoughts, Olivier.
Right. Hence, the clarification in my comment:
Great: I’d be interested in things that do correspond more closely to my notion of complexity.
p(w) \propto e^{- ||w||^2} seems like a perfectly good prior to me. The integration to 1 (or rather finite integrability: 1 isn’t a special number) is the crucial thing that makes these make sense. From an MDL persepctive, unnormalizable measures would break Krafts inequality: program length complexity wouldn’t make any sense.