MaxEnt contradicts Bayes Rule? – Machine Learning (Theory)

A few weeks ago I read this. David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it):

In short I was thinking that Ã¢â‚¬Å“bayesian belief updatingÃ¢â‚¬Â and Ã¢â‚¬Å“maximum entropyÃ¢â‚¬Â were two othogonal principles. But it appear that they are not, and that they can even be in conflict !
Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A=Ã¢â‚¬ÂX is an odd numberÃ¢â‚¬Â
Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more Ã¢â‚¬Ëœ6Ã¢â‚¬Â², big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesnÃ¢â‚¬â„¢t have to have a 3.5 expectation. P(X) and P(X|a) are different distributions.
Conclusion ? MaxEnt and bayesian updating are two different principle leading to different belief distributions. Am I right ?

I don’t believe there is any paradox at all between MaxEnt (perhaps more generally, MinRelEnt) and Bayesian updates. Here, straight MaxEnt make no sense. The implication of the problem is that the ensemble average 3.5 is no longer an active constraint. That is, we no longer believe the contraint E[X]=3.5 once we have the additional data that X is an odd number. The sequential update using minimum relative entropy is identical to Bayes rule and produces the correct answer. These two answers are simply (correct) answers to different questions.

2 Replies to “MaxEnt contradicts Bayes Rule?”

As with all interpretations and unifications, one just has to pick something that makes most sense.

Anyway, a constraint for me is just a parameter in a suitably defined probability distribution. Essentially, E[X] is the parameter, and the distribution is defined to be as the MaxEnt distribution given the value of that parameter. As a Bayesian, however, one would use the data to model the posterior distribution over the parameter. In that sense, I agree with Drew.

That’s a coincidence. I just discussed this example on my blog. The key is to see Bayesian updating as just one case of I-projection in the sense of Csiszar. I give a link there to a paper by Harremoes which explains it all.

Comments are closed.