Machine Learning (Theory)

Online learning is in vogue, which means we should expect to see in the near future:

Online boosting.
Online decision trees.
Online SVMs. (actually, we’ve already seen)
Online deep learning.
Online parallel learning.
etc…

There are three fundamental drivers of this trend.

Increasing size of datasets makes online algorithms attractive.
Online learning can simply be more efficient than batch learning. Here is a picture from a class on online learning:

The point of this picture is that even in 3 dimensions and even with linear constraints, finding the minima of a set in an online fashion can be typically faster than finding the minima in a batch fashion. To see this, note that there is a minimal number of gradient updates (i.e. 2) required in order to reach the minima in the typical case. Given this, it’s best to do these updates as quickly as possible, which implies doing the first update online (i.e. before seeing all the examples) is preferred. Note that this is the simplest possible setting—making the constraints nonlinear, increasing the dimensionality, introducing noise, etc… all make online learning preferred.
Online learning is conceptually harder, and hence more researchable. The essential difficulty is that there is no well defined static global optima, as exists for many other optimization-based learning algorithms. Grappling with this lack is of critical importance.

I have recently completed a 500+ page-book on MDL, the first comprehensive overview of the field (yes, this is a sneak advertisement 🙂 ).
Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are:

In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron, MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes!
In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by something similar to a Bayesian prior; the interpretation is different though; it is more similar to what has been called a “luckiness function” in the computational learning theory literature).

Both points are explained at length in the book (see esp page 544). Here I’ll merely say a bit more about the first.

MDL is always based on designing a universal code L relative to some given model M. Informally this is a code such that whenever some distribution P in M can be used to compress some data set well, then L will compress this data set well as well (I’ll skip the formal definition here). One method (but by no means the only method) for designing a universal code relative to model M is by taking some prior W on M and using the corresponding Shannon-Fano code, i.e. the code that encodes data z with length

L(z) = – log P_bayes(z),

where P_bayes(.) = \int P(.) d W(P) is the Bayesian marginal distribution for M relative to prior W. If M is parametric, then with just about any ‘smooth’ prior, the Bayesian code with lengths L(z) = – log P_bayes(z) leads to a reasonable universal code. But if M is nonparametric (infinite dimensional, such as in Gaussian process regression, or histogram density estimation with an arbitrary nr of components) then many priors which are perfectly fine according to Bayesian theory are ruled out by MDL theory. The reason is that for some P in M, the Bayesian codes based on such priors do not compress data sampled from P at all, even if the amount of data tends to infinity. One can formally prove that such Bayesian codes are not “universal” according to the standard definition of universality.

Now there exist two theorems by Andrew Barron (from 1991 and 1998, respectively) that directly connect data compression with frequentist statistical consistency. In essence, they imply that estimation based on universal codes must always be statistically consistent (the theorems also directly connect the convergence rates to the amount of compression obtained). For Bayesian inference, there exist various inconsistency results such as those by Diaconis and Freedman (1986) and Barron (1998). These say that, for some nonparametric models M, and with some priors on M, Bayesian inference can be inconsistent, in the sense that for some P in M, if data are i.i.d. sampled from P then even with an infinite amount of data, the posterior puts all its mass on distributions P’ in M that are substantially different from the “true” P. By Barron’s theorems, something like this can never happen for MDL; Diaconis and Freedman use priors which are not allowed according to MDL theory. In fact, MDL-based reasoning can also motivate certain prior choices in nonparametric contexts. For example, if one has little prior knowledge, why would one adopt an RBF kernel in Gaussian process regression? Answer: because the corresponding code has excellent universal coding properties, as shown by Kakade, Seeger and Foster (NIPS 2005): it has only logarithmic coding overhead if the underlying data generating process satisfies some smoothness properties; many other kernels have polynomial overhead. Thus, Gaussian processes combined with RBF kernels lead to substantial compression of the data, and therefore, by Barron’s theorem, predictions based on such Gaussian processes converge fast to the optimal predictions that one could only make make if one had access to the unknown imagined “true” distribution.

In general, it is often thought that different priors on M lead to codes that better compress data for some P in M, and that worse compress data for other P in M. But with nonparametric contexts, it is not like that: then there exist priors with “universally good” and “universally bad” coding properties.

This is not to say that all’s well for MDL in terms of consistency: as John and I showed in a paper that appeared earlier this year (but is really much older), if the true distribution P is not contained in the model class M under consideration but contains a good approximation P’ in M then both MDL and Bayes may become statistically inconsistent in the sense that they don’t necessarily converge to P’ or any other good approximation of P.

Thus: if model M parametric and P in M , then MDL and Bayes consistent. If model M nonparametric and P in M, then MDL consistent, Bayes not necessarily so. If P not in M, then both Bayes and MDL may be inconsistent.

This leaves one more very important case: what if P is in the closure of M, but not in M itself? For example, M is the set of all Gaussian mixtures with arbitrarily many components, and P is not a Gaussian mixture, but can be arbitrarily well-approximated (in the sense of KL divergence) by a sequence of Gaussian mixtures with ever more components? In this case, Bayes will be consistent but it can be too slow, i.e. it needs more data before the posterior converges than some other methods (like leave-one-out-cross-validation combined with ML estimation). In our forthcoming NIPS 2007 paper, Steven de Rooij, Tim van Erven and I provide a universal-coding based procedure which converges faster than Bayes in those cases, but does not suffer from the disadvantages of leave-one-out-cross validation. Since the method is directly based on universal coding, I’m tempted to call it “MDL”, but the fact that nobody in the MDL community has thought about our idea before, makes me hesitate. When I talked about it to the famous Bayesian Jim Berger, I said “it’s MDL Jim, but not as we know it”.

Posts

Online as the new adjective

NIPS workshops extended to 3 days

NIPS workshp: Learning Problem Design

NIPS workshops are out.

It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)