It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)

I have recently completed a 500+ page-book on MDL, the first comprehensive overview of the field (yes, this is a sneak advertisement 🙂 ).
Chapter 17 compares MDL to a menagerie of other methods and paradigms for learning and statistics. By far the most time (20 pages) is spent on the relation between MDL and Bayes. My two main points here are:

In sharp contrast to Bayes, MDL is by definition based on designing universal codes for the data relative to some given (parametric or nonparametric) probabilistic model M. By some theorems due to Andrew Barron, MDL inference must therefore be statistically consistent, and it is immune to Bayesian inconsistency results such as those by Diaconis, Freedman and Barron (I explain what I mean by “inconsistency” further below). Hence, MDL must be different from Bayes!
In contrast to what has sometimes been claimed, practical MDL algorithms do have a subjective component (which in many, but not all cases, may be implemented by something similar to a Bayesian prior; the interpretation is different though; it is more similar to what has been called a “luckiness function” in the computational learning theory literature).

Both points are explained at length in the book (see esp page 544). Here I’ll merely say a bit more about the first.

MDL is always based on designing a universal code L relative to some given model M. Informally this is a code such that whenever some distribution P in M can be used to compress some data set well, then L will compress this data set well as well (I’ll skip the formal definition here). One method (but by no means the only method) for designing a universal code relative to model M is by taking some prior W on M and using the corresponding Shannon-Fano code, i.e. the code that encodes data z with length

L(z) = – log P_bayes(z),

where P_bayes(.) = \int P(.) d W(P) is the Bayesian marginal distribution for M relative to prior W. If M is parametric, then with just about any ‘smooth’ prior, the Bayesian code with lengths L(z) = – log P_bayes(z) leads to a reasonable universal code. But if M is nonparametric (infinite dimensional, such as in Gaussian process regression, or histogram density estimation with an arbitrary nr of components) then many priors which are perfectly fine according to Bayesian theory are ruled out by MDL theory. The reason is that for some P in M, the Bayesian codes based on such priors do not compress data sampled from P at all, even if the amount of data tends to infinity. One can formally prove that such Bayesian codes are not “universal” according to the standard definition of universality.

Now there exist two theorems by Andrew Barron (from 1991 and 1998, respectively) that directly connect data compression with frequentist statistical consistency. In essence, they imply that estimation based on universal codes must always be statistically consistent (the theorems also directly connect the convergence rates to the amount of compression obtained). For Bayesian inference, there exist various inconsistency results such as those by Diaconis and Freedman (1986) and Barron (1998). These say that, for some nonparametric models M, and with some priors on M, Bayesian inference can be inconsistent, in the sense that for some P in M, if data are i.i.d. sampled from P then even with an infinite amount of data, the posterior puts all its mass on distributions P’ in M that are substantially different from the “true” P. By Barron’s theorems, something like this can never happen for MDL; Diaconis and Freedman use priors which are not allowed according to MDL theory. In fact, MDL-based reasoning can also motivate certain prior choices in nonparametric contexts. For example, if one has little prior knowledge, why would one adopt an RBF kernel in Gaussian process regression? Answer: because the corresponding code has excellent universal coding properties, as shown by Kakade, Seeger and Foster (NIPS 2005): it has only logarithmic coding overhead if the underlying data generating process satisfies some smoothness properties; many other kernels have polynomial overhead. Thus, Gaussian processes combined with RBF kernels lead to substantial compression of the data, and therefore, by Barron’s theorem, predictions based on such Gaussian processes converge fast to the optimal predictions that one could only make make if one had access to the unknown imagined “true” distribution.

In general, it is often thought that different priors on M lead to codes that better compress data for some P in M, and that worse compress data for other P in M. But with nonparametric contexts, it is not like that: then there exist priors with “universally good” and “universally bad” coding properties.

This is not to say that all’s well for MDL in terms of consistency: as John and I showed in a paper that appeared earlier this year (but is really much older), if the true distribution P is not contained in the model class M under consideration but contains a good approximation P’ in M then both MDL and Bayes may become statistically inconsistent in the sense that they don’t necessarily converge to P’ or any other good approximation of P.

Thus: if model M parametric and P in M , then MDL and Bayes consistent. If model M nonparametric and P in M, then MDL consistent, Bayes not necessarily so. If P not in M, then both Bayes and MDL may be inconsistent.

This leaves one more very important case: what if P is in the closure of M, but not in M itself? For example, M is the set of all Gaussian mixtures with arbitrarily many components, and P is not a Gaussian mixture, but can be arbitrarily well-approximated (in the sense of KL divergence) by a sequence of Gaussian mixtures with ever more components? In this case, Bayes will be consistent but it can be too slow, i.e. it needs more data before the posterior converges than some other methods (like leave-one-out-cross-validation combined with ML estimation). In our forthcoming NIPS 2007 paper, Steven de Rooij, Tim van Erven and I provide a universal-coding based procedure which converges faster than Bayes in those cases, but does not suffer from the disadvantages of leave-one-out-cross validation. Since the method is directly based on universal coding, I’m tempted to call it “MDL”, but the fact that nobody in the MDL community has thought about our idea before, makes me hesitate. When I talked about it to the famous Bayesian Jim Berger, I said “it’s MDL Jim, but not as we know it”.

24 Replies to “It’s MDL Jim, but not as we know it…(on Bayes, MDL and consistency)”

Abram Demski says:

9/18/2007 at 3:25 pm

I’m an undergraduate interested in machine learning, and this is the first thing I’ve come upon that has had any argument against the bayesian approach; many places mention the controversy, but mostly what I’ve seen is pro-bayesian, and so I don’t really have any idea why someone would disagree with bayesianism. Where can I get a good overview of these issues?
Ricardo Silva says:

9/19/2007 at 8:33 am

Thanks for providing some sample chapters, Peter. I am sure they have a lot of interesting ideas that I would benefit from knowing more.

However (I’m probably missing something), I don’t quite see why the argument that MDL is different from Bayes is that relevant. As stated, it just shows that MDL can’t do some (undesirable) things that Bayes can when misused. This is a bit like saying a car is different from a vehicle because some vehicles can crash from flying too high.
grunwald says:

9/20/2007 at 8:03 am

There are various criticisms, and these can be found at different places; I have never seen a single article listing all criticisms and discussing whether they make sense or not (but maybe somebody else knows…)

One central criticism which makes sense to me (and
is even acknowledged by many Bayesians) is the following: Bayesian decision theory is based on the idea that you can always express your belief about the truth of a proposition by a *single number*. But assuming this can lead to paradoxes, such as for example the
Ellsberg paradox. This and some other criticisms is briefly but nicely discussed in Joe Halpern’s book ‘Reasoning about Uncertainty’. Joe is remarkably unpartisan and also lists a number of alternatives to Bayesian reasoning.

Bayesians often say that Bayesian decision making is the only ‘coherent’ way of making decisions, and they cite theorems by Cox, De Finetti and Savage that purportedly prove this. In reality it’s much more complicated than that. For example, De Finetti’s theorem
really only proves that you should work
with sets of probabilities (possibly without a prior specified on this set), not with a single probability distribution (and hence, not with a single number).
There is in fact a whole research field called “imprecise probabilities” which
extends Bayesian reasoning to sets of probabilities without priors on them; see
Peter Walley’s book “Statistical Reasoning With Imprecise Probabilities.”
So-called ‘robust Bayesian statistics’
tries to work with the imprecise probability
framework while retaining as much of Bayes
as possible.

A related criticism is that, within Bayesian
theory, you cannot express the statement
‘I don’t know whether this hypothesis is true or not’ (logical Bayesians like Jaynes
say you can, but there is huge controversy within the Bayesian community on whether
Jaynes’ view makes sense or not).
This idea, that as a Bayesian you may be too sure of yourself, has been formalized
in a once-famous paper by Phil Dawid from 1982 called ‘the well-calibrated Bayesian’.
grunwald says:

9/20/2007 at 8:21 am

Hi Ricardo,

In general, Bayesian inference(as commonly understood) and MDL inference (as commonly understood) overlap, but there do exist Bayesian inference algorithms that are not MDL (as stated in my post) but also MDL inference algorithm that are not Bayesian (such as the normalized maximum likelihood distribution – see my book). So the car/vehicle metafor is not entirely appropriate. But even so, your question makes sense of course.

The reason I’m making the argument has to do with some personal frustration – over the years I have often heard Bayesians claim that MDL has nothing new to offer, (a good example is David MacKay’s book (which in fact has many great things in it; I just disagree
with the claim ‘MDL has no apparent
advantages over the direct probabilistic approach (i.e., Bayes) except as a pedagogical tool’ (page 353)).

Another well-known (this time non-Bayesian)
example is the book ‘model selection and multimodel inference’ by Burnham and Anderson, where they write that ‘Rissanen’s
approach is equivalent to BIC’ (which is entirely wrong).
Aleks says:

9/20/2007 at 9:27 am

The universal modeling focus and the skill in dealing with discrete data that originate in information theory are definitely major contributions of MDL. But why is it so important to focus on the differences instead of focusing on the synergies? Yes, some Bayesians have more power and this manifests itself as arrogance. But these character flaws shouldn’t discourage others from pursuing results rather than encouraging them to lock themselves into private gardens.
grunwald says:

9/21/2007 at 3:11 am

In fact I agree that one shouldn’t lock oneself in one’s private garden (the last section of my comparison is called ‘A common future?’ – I do believe that one day these two approaches may merge).

Still I think it’s important to point out
what the contributions of MDL are, and how they are different from Bayesian theory (as it currently stands). Perhaps it would have been
better to phrase this more positively, as ‘how MDL can extend the Bayesian view’. But in the end I don’t think it matters that much, at least for me – although I like to think as the two approaches as distinct, I’ve always happily collaborated with Bayesian statisticians (and had a great time at Bayesian-oriented conferences).
Radford Neal says:

9/22/2007 at 5:28 pm

Maybe I’m just too Bayesian, but I’m rather puzzled by the argument in this post. It seems to be saying that, by definition, MDL uses a universal prior, and consequently, unlike Bayesian modeling, it is guaranteed to be consistent. But all a Bayesian would have to do to handle this is say they use their “super-Bayesian” method, in which, by definition, only priors leading to consistent inference are allowed. Presto! The super-Bayesian method is just a good as MDL. It all seems rather pointless.

I’m similarly puzzled as to why anyone would ever have thought that MDL doesn’t have a subjective component. There are, of course, many universal codes. Which one you choose can only be a subjective choice. The only difference is that the MDL people have thrown away the Bayesian way of making such a choice, on the basis of prior beliefs. And I don’t see what other basis there could be.
Teemu Roos says:

9/24/2007 at 12:02 am

Radford: If this leads to a “super-Bayesian” method which avoids some problems, it can’t be completely pointless, right?

The basis for choosing a universal model out of many (including many Bayesian mixtures with suitable priors) can be, for instance, the minimax regret principle: take the model that minimizes the maximum regret, ie. excess (log-)loss over the best hypothesis.

Perhaps it helps to make an appeal to a Bayesian authority. The minimax regret principle was discussed by Savage in his 1954 book as follows (pp. 168–169): (Note that he uses the term ‘loss’ to mean the regret!)

A categorical defense of the minimax rule seems definitely out of the question. […] On the other hand, there are practical circumstances in which one might well be willing to accept the rule–even one who, like myself, holds a personalistic view of probability. It is hard to state the circumstances precisely, indeed they seem vague almost of necessity. But, roughly, the rule tends to seem acceptable when L* is quite small compared with the values of L(f;i) for some acts f that merit serious consideration and some values of i that do not in common sense seem nearly incredible. Suppose, for example, that I were faced with such a decision problem, in which it may be assumed for simplicity that there is only one minimax act f, and consider how I might defend the choice of that act to someone who proposed another one to me. He might, for example, tell me that he knows from long experience, or by a tip from his broker, that some act g is preferable to f. “Well,” I might say, “I have all the respect in the world for you and your sources of information, but you can see for yourself–for it is objectively so–that the most I can lose if I adopt f is L*.” He will not be able to say the same for g, and in many actual situations the greatest possible loss under g may be many times as great as L* and such of a magnitude as to make a serious difference to me should it occur, which may well end the argument so far as I am concerned.

It is of interest, however, to imagine that my challenger presses me more closely, reminding me that I am a believer in personal probability, and that in fact I myself attach an expected loss L to g that is several times smaller than L*. Even then, depending on the circumstances, I might answer frankly that in practice the theory of personal probability is supposed to be an idealization of one’s own standards of behavior; that the idealization is often imperfect in such a way that an aura of vagueness is attached to many judgements of personal probability; that indeed in the present situation I do not feel I know my own mind well enough to act definitely on the idea that the expected loss for g really is L; but that I do, of course, feel perfectly confident that f cannot result in a loss greater than L*, a prospect that in the case at hand does not distress me much.

In log-loss the minimax regret model is the normalized maximum likelihood (NML) code of Shtarkov, which often considered the ‘optimal’ universal model in MDL. To summarize, the point in the result that Peter is talking about in Bayesian theory is similar to the point in theory of objective Bayesian methods (Berger, Bernardo, …). Of course, there are other points too that are not related to Bayes.
David Rohde says:

9/24/2007 at 9:44 pm

I am a very partisan in suporting Bayes – however I like to collect anti-Bayesian arguments. In my view there is suprisingly little articulate critisism of Bayes out there. If anybody knows any more references please pass them on. If you want a good overview of the current state of play there is a good article (Lindley 2000) Lindley is very partisan Bayesian – but there is a lively discussion of top statisticians from very varied backgrounds.

http://www.physics.uq.edu.au/people/djr/

Note that a lot of historical critisism of Bayes is actually aimed at Bayes postulate i.e. the use of flat priors. Nearly all modern Bayesians reject Bayes postulate (the most prominent exception is Edwin Jaynes).
David Rohde says:

9/24/2007 at 10:11 pm

There is an argument that perhaps can be regarded as both glib and sophisticated at the same time…

If I have a personal probability distribution P(X_1, …, X_N, …, X_{N+K}) which can be generated by a likelihood and a prior and another person has a personal probability distribution P(X_1, …, X_N, …, X_{N+K}) that can be generated by the same likelihood function with a different prior.

If I am right in picking up the gist of the Dawid calibration paper and the Diacronis and Freedman non-parametric Bayes paper the result has the following meaning.

Me and my friend may not be brought into approximate concensus when we compare our conditional probabilities :

P(X_{N+1}, …, X_{N+K}|X_1=x_1, …, X_N=x_N).

We may even fail to come into agreement as N->\infty.

To me this is an important and interesting result in the calculus of uncertain reasoning. However I can not see how this is an anti-Bayesian argument. To reject Bayesian theory because it does not always result in concensus seems to be similar logic to reject the conservation of energy because it forbids perpeutal motion machines.
David Rohde says:

9/24/2007 at 10:15 pm

I forgot to say that a list of anti-Bayes papers are on my home page http://www.physics.uq.edu.au/people/djr/
grunwald says:

9/25/2007 at 8:12 am

Well… the point is that there is nothing in Bayesian theory which tells one under what conditions on the prior, one can achieve consistency (or equivalently, in David’s words, under what conditions on the prior two Bayesians will eventually agree).

If one predicts outcomes sequentially based on a distribution P corresponding to a universal code, one does have a guarantee of consistency (and, with some caveats, the reverse holds as well: if the code corresponding to P is not universal, then one can get inconsistency).

Since I think consistency is important, for me this shows that the concept of universal coding is of fundamental importance in statistics. That’s the point I wanted to make.

(note that universal codes are sometimes, but not nearly always, in the form of a Bayesian marginal distribution. Whether Bayesian universal codes have a special status as being ‘best’ in one sense or another, is a different question to which I don’t really know the answer.)
Radford Neal says:

9/25/2007 at 8:54 am

You’re right. As far as I’m aware, there is no necessary and sufficient condition for consistency of a Bayesian prior that can in all cases be easily checked. Are you claiming that there IS a simple, easily checked condition that can tell you in all cases whether or not a code is universal? If you’re not claiming that, I don’t see what the advantage of the MDL framework is supposed to be in this regard.

In any case, there are practical reasons why one might sometimes use a prior that isn’t consistent, even though such a prior would seldom be a perfect expression of one’s prior beliefs. One almost never can perfectly express prior beliefs, there are always compromises with the amount of intellectual and computational effort that one’s willing to spend. I doubt that MDL will magically avoid such compromises.
Tariq Rashid says:

9/25/2007 at 6:46 pm

Consider a set of theorems. The representation language of these theorems have “lengths” and the MDL states that those with a shorter length are better. Now, if we use a different representation language those lengths will change, resulting in a different ranking of “goodness”. So the MDL is invalid.

You can even think of this using human languages such as French, Spanish – each gives different lengths to the same theorems.
Balaji Krishnapuram says:

9/25/2007 at 7:17 pm

This is a really interesting question: are there easily checked conditions that can tell us whether or not a code is universal (or posed alternatively: how could one check if a prior leads to consistency)? If not, Radford has a good point about the theory really not offering (meaningful) practical benefits.

I’ve not followed the recent developments in MDL theory much, so I would appreciate pointers to such papers if they exist.
Teemu Roos says:

9/26/2007 at 11:14 pm

I don’t think Peter or anyone else is suggesting to “reject Bayesian theory”. We love it. Seriously.

As a not-completely-serious note, I have to say I’m having trouble with the analogies used in this discussion. In the end, I did understand the MDL=car – Bayes=vehicle analogy, but the ‘perpetual motion machine’=consensus – ‘conservation of energy’=Bayes analogy goes simply over my head. Perhaps the idea is to say that consensus (=consistency) is impossible to achieve, and that therefore it is unfair to criticize Bayes on that basis? (Hmm, now this sounds like a plausible interpretation. Did I get it right?)

But according to Peter

if model M parametric and P in M , then MDL and Bayes consistent. If model M nonparametric and P in M, then MDL consistent, Bayes not necessarily so. If P not in M, then both Bayes and MDL may be inconsistent.

I’m not saying this is an “anti-Bayesian argument” (causes bad feelings), but isn’t this at least a pro-MDL argument?

(Radford’s and Balaji’s question whether universality can be somehow easily ascertained in MDL is good, and I too would be interested to hear what the answer is.)
David Rohde says:

9/28/2007 at 5:55 pm

I should say that of the very little of Peter’s book that I read there is a tiny fraction that I feel knowledgable i.e. the Bayes/subjectivism part. This is definately well written and not the primary topic of Peter’s book. So it looks like a very impressive piece of scholarship. I am interested to read more of it at some future point in time …

And Teemu, thanks for pointing out my analogy was unclear – I think you did get the gist of it, I will try to expand…

I am trying to think in terms of the operational subjective position developed by de Finetti and which has a text book presentation by Frank Lad. In this thoery then M does not exist. It can be thought of as just a conveniant mathematical device for putting an exchangeable distribution over X_1, …, X_{N+K}.

According to this theory what is important is to faithfully articulate your beliefs about X_1, …, X_{N+K} as a probability distribution. Discusions involving X_{N+K+1} are really not relevant, interesting or possibly even meaningful (you may not be able to specify what X_{N+K+1} means in terms of what observation it refers to).

Of course the sheer scale of specifiying P(X_1, …, X_{N+K}) and the probabilistic inarticulateness that we all share makes the practical specification of this nasty. We often (always) take shortcuts i.e. pretend M exists and argue that for any normal prior that given enough data we will converge to this. Such an approach is open for criticism but this is not really Bayesian theory though…

In my view in the Bayesian approach the discussion should be restricted to concerning a finite number of observations, so I don’t recognise the concept of a consistent prior.

The operational subjective issue is raised in both Peter’s book and the Diacronis and Freedman paper. In particular D&F talk about the issue being important to subjectivists in terms of inter-subjectivity (i.e. concensus). I personally think lack of concensus in complex situations is almost inevitable. A method that doesn’t recognise this I would regard as suspect.

In Jaynes book Chapter 5 : http://omega.albany.edu:8008/JaynesBook.html

He talks about far simpler situations where two different people will be increasingly polarised into opposite views as more data is collected.

I should add some distinguished Bayesians (Radford Neal for one) do not seem to adopt the above approach. I can’t quote it directly but in his Bayesian Neural Networks book Neal makes some statement about hierachichal models saying you can put probabilities on the observables, you can use priors and likelihoods or you can use hierachichal models but hierachichal models are the easiest to understand. (not a direct quote – but I think a fair representation of the statement). The de Finetti approach tends to the other extreme probabilities over observables are primary and other approaches are a possible conveniance. I personally do not know how to elicit a prior over the parameters of a neural network so I find it hard to fully agree here. The fact that Radford Neal cross-validates is recognising that the priors he uses are to some extent conveniance priors – otherwise you can at least in principle calculate expected utility directly.
Radford Neal says:

9/28/2007 at 8:44 pm

A couple clarifications regarding my views… I agree in theory with de Finetti that the distribution for observables is primary, but I think that the device of introducing models, parameters, hyper-parameters, etc. is very beneficial from a practical standpoint in most situations. I don’t agree with de Finetti’s fondness for “probabilities” that are not countably additive, nor with the idea that looking at the infinite data limit is pointless. I’m not sure where the comment about my using cross-validation comes from. I’m sure I’ve used it sometime or other (I’m not a fanatically pure Bayesian), but I don’t use it as part of my standard operating procedure (as a way, for instance, of setting a regularization parameter) in the way that some people do. I do often reserve some data for a final “sanity check”, which is partly a matter of not trusting the prior formulation completely, but also a check that the program doesn’t have bugs, etc. However, I do agree that in practice almost all priors are “convenience” priors in the sense that we don’t have the intellectual and computation resources to use the absolutely perfect prior. The degree to which this is of any real importance varies a lot from problem to problem.

One terminological point: I don’t use “priors and likelihoods”. I use models, and priors for model parameters. The term “likelihood” refers to a function of the model parameters that is proportional to the probability of the observed data given these parameters. It is not a synonym for “model”.
David Rohde says:

9/29/2007 at 1:15 am

Thanks a lot for the clarification Radford!

Sorry not only should I have said model rather than likelihood I meant hold out not cross validation…

The difficulty I see with taking an infinite data limit is that if I specify my probability over exchangeably extendable observables P(X_1, …, X_N) then I haven’t specified how the discussion should be exchangeably extended to include X_{N+1}. It is certainly possible to remain coherent and use a different model/prior combination when I include X_{N+1} in the discussion. The model/prior was used to generate X_1, …, X_N but the infinite data conversation extends the model well beyond that. It is also not always possible to find an observation that X_{N+1} should correspond to.

The result is of course not pointless. Because we are probabilisticly inarticulate we normally end up using a specification of the model/prior rather than over observables, and it warns us of the danger in doing that. Prior specification need not get easier if more data is available and it can get worse. It doesn’t damage the Bayesian argument, but it warns us not to be cavalier.

I dream one day of being knowledgable enough to have an opinion on finite/countable additivity. It is a long way off…
Justin says:

10/3/2007 at 2:11 pm

I am an computer science undergrad, interested in ML. I am wondering if this book is a good choice if I am, after all, just an undergrad with basic knowledge of probability theory and theoretical computer science. How good is this for beginners?
Balaji Krishnapuram says:

10/4/2007 at 6:19 am

I have not read this book yet, so I cant comment on it.

For an undergrad interested in a first book on ML I would recommend one or both of the following two books. Both are excellent introductions to the topic, suitable for someone new to the field, yet very relevant to the current state-of-the-art

Chris Bishop, Pattern Recognition and Machine Learning, Springer
David Mackay, Information Theory, Inference and Learning algorithms, Cambridge University Press [Full TB is freely available online at David’s web page, if you want to first look through it before you decide to buy the book]

Several other good textbooks are available, and I used them when I was a grad student. However, in my opinion they have either become somewhat out-of-date, since they were written a while back, or else, some of these books dont quite cover much of the ground necessary to acquaint a new reader with much of ML [eg Bernardo & Smith doesnt cover many ML topics necessary to be able to read new papers from NIPS/ICML]. Some of them cover slightly different ground(eg Russel & Norvig), or take a slightly different philosophical/pedagogical position(eg Hastie Tibshirani & Friedman). However, they are all still excellent–if slightly dated–introductions to the field, suitable for someone who has first read the above books, and wants to go deeper into other topics that are not covered in detail. In particular, I liked these TB:

[my personal favorite] Jose Bernardo and Adrian Smith, Bayesian Theory, Wiley Series in Prob & Stat
Tom Mitchell, Machine Learning, McGraw Hill
Stuart Russell and Peter Norvig, Artificial Intelligence, A Modern Approach, Prentice Hall
Larry Wasserman, All of Statistics: A Concise Course in Statistical Inference, Springer Texts in Stats
Trevor Hastie, Rob Tibshirani, Jerry Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Spinger Series in Stats
grunwald says:

10/5/2007 at 9:03 am

I was too busy to get back for a while, but here I am again answering both Radford and Balaji.

In general, I have to admit that (at least as far as I know) there is no general easily checkable condition which tells me whether a code is universal or not.

Still, in many cases, universality of a code is much easier to check than checking whether the statistical procedure corresponding to the code is consistent by more traditional means (i.e. by more traditional methods for proving consistency). This holds especially if we want to prove more than mere consistency: we want to determine rates of convergence, i.e. how close do you get to the truth as a function of sample size. A prime example is our own NIPS 2007 paper, called Catching Up Faster in Bayesian Model Selection and Model Averaging. Here we define
a predictive distribution based on a
universal code that in general provably
gives you faster rates of convergence
than Bayesian codes based on standard
hierarchical priors that have traditionally been used for such problems. The universal code is Bayesian
as well, in the sense that it is based
on a prior distribution – but this prior
is ridiculous if you try to interpret
as representing your beliefs about the situation – it tells you that at fixed
time points, the model changes, even though you know that in reality it does not. Yet, the prior has good universal coding properties (which is very easy to show), and therefore good consistency-rate-of-convergence properties.

Another paper where rates-of convergence results are shown based on universal coding is

M. Seeger, S. Kakade, D. Foster
Information Consistency of Nonparametric Gaussian Process Methods

available on Matthias Seeger’s homepage.

What they’re doing is completely different, and I think a lot easier,
then what has been tried before to prove
Gaussian processes inference consistent.
Peter Grunwald says:

10/8/2007 at 3:14 pm

Dear David:

Just to clarify: I do like many aspects of De Finetti’s approach a lot, and I love the idea that in the end, only observables matter. But to me (as, I guess, to the researchers in the area calling itself ‘imprecise probability’, i.e. Peter Walley and others), his Dutch book/coherence arguments only show that one should always base one’s decisions on a *set* of distributions over observables (eg upper and lower probabilities) rather than a single distribution (which may or may not be arrived at by postulating ‘parameters’ and integrating them out). Working with sets of distributions, one remains coherent, and several problems of the ‘one distribution only approach’ disappear, such as the Ellsberg paradox; or, what, to me personally, is really a problem with a purely Bayesian approach, namely that, no matter how badly you predicted the past, your posterior probability that you will predict the future well, is always 1 (see Dawid’s 1982 calibration paper). Thus “you cannot learn from the data that, with your current model assumptions you don’t know anything about the data”. (I realize that some great minds don’t view this as a problem at all)

Apart from all this, there’s something in your argument which I don’t quite follow: De Finetti’s exchangability theorem states that the only distributions on *infinite* sequences of random variables with finite range, under which outcomes are exchangeable on each initial finite segment, are Bayesian mixtures of multinomial distributions. Here the word *infinite* is essential: on finite sequences, there are also other distributions that make outcomes exchangeable. Therefore, it would seem to me that even De Finetti was not worried too much by large sample limit arguments (as long as they don’t refer to unobservable ‘parameters’ or ‘true distributions’)? (but I don’t know about Frank Lad, maybe he extends or modified De Finetti’s ideas?)
David Rohde says:

11/20/2007 at 6:20 pm

I really appologise for taking so long to get back to this. It was not through lack of interest, I was very engaged in the discussion and realised I would think of nothing else if I continued, but I had to start a new job and work on finshing my thesis (which is still work in progress) and so put it off.

I also like the idea of imprecise probabilities, but they do seem to make often already complicated arguments more complicated still – I hope there is a way to proceed in a way that is both precise and less complicated than the way it is done now. I am not sure if impresise probabilities are useful here or not, but they may well be… I do think the use of partially specified probabilities is critical to statistical practice but if it is through imprecise probabilities or methods based on specifying some but not all expectations (such as Michael Goldstein’s Bayes linear), I am not sure…

If the Dawid calibration result has any broad concerns for statistics has been debated quite a lot. There is a long argument in Frank Lad’s books with many citations arguing that it does not. Dawid himself seems quite measured in acknowledging the difficulty in defining `calibration’ in a meaningful way : http://www.jstor.org/view/00905364/di983928/98p00447/0 . Direct quotation “an entirely subjective approach to the relationship between probabilities and empirical frequencies, based on de Finetti’s exchangability concept is the only logically satisfying one.”

I am not sure of de Finetti’s position on large sample arguments. He did prove the law of large numbers in the context of the representation theorem. However the “finite subset of an infinite exchangeable sequence” criteria doesn’t need to be thought of in the large sample context. Frank Lad uses the term exchangebly extendable which is I think clearer as it doesn’t refer to an infinity. The key issue is that the distribution P(X_1, …, X_N) can be exchangebly extendable to a sequence of P(X_1, …, X_N+K). In other words there exists a set of exchangeble distributions P(X_1, …, X_N+K) that can be marginilised to P(X_1, …, X_N). This constraint has other implications such as this very appealing one :

P(X_2=x_1|X_1=x_1) >= P(X_2=x_1)

i.e. observing X_1=x_1 can only increase or keep the same your probability that X_2=x_1.

For many problems you explicitly want to use exchangebly extendable probabilities. One use of an exchangeble but not extendable probability occurs when you are card counting in blackjack. Here both the above criteria are violated, i.e. you cannot exchangebly extend the discussion beyond the 52 cards in the pack and when you see a high card it makes you believe it is less likely that the next card will be high. [i.e. the hypergeometric distribution is the classic example of an exchangeable but not extendable distribution]

I really don’t think it is necessary or a good idea to think of infinite sample limits in order to understand this constraint on the de Finetti representation.

The practice of Bayesian statistics is hard at the moment. In applications prior beliefs are specified in quite an awkward fashion. It is plain hard to articulate what you believe. In some parts of a problem an imprecise sweeping statement may have strong consequences in other parts you might really need to agonise about what you really believe. We really don’t know how to do this very well at the moment. – although there is a lot of progress… Current Bayesian practise therefore is a compromise with the hope that mis-specification/overspecification won’t matter too much. Large sample limit arguments are a useful reality check that this is somewhat wishful thinking. So these arguments do have important implications for current Bayesian practice.

I personally think that the way out of this however is to develop ways to articulate probability models without over/mis-specifying. If you are committed to the specification of the probability model then the limit argument is irrelevant.

Anyway it sounds like MDL offers a different way to specify prior beliefs, so this is interesting in itself. When I get the chance I will find your book and work out what a `luckiness function’ is.

You also said that MDL is broader than probability theory. This implies that it is possible to be both incoherent and produce inadmissible decisions. This is not necessarily an enormous problem as I might believe that an inadmissible decision might have high expected utility – and if I have difficulty finding the optimal decision it might therefore be ok. On the other hand if MDL differs from probability theory then explaining why does need some attention… I guess it is in your book! Of course objective Bayes isn’t fully compatible with probability theory.

Anyway thanks again for the discussion I have really enjoyed it. And again I am sorry for taking 2 months to get back to it!

Comments are closed.