MaxEnt contradicts Bayes Rule?
A few weeks ago I read this. David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it):
In short I was thinking that “bayesian belief updating†and “maximum entropy†were two othogonal principles. But it appear that they are not, and that they can even be in conflict !
Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A=â€ÂX is an odd numberâ€Â
Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more ‘6′, big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesn’t have to have a 3.5 expectation. P(X) and P(X|a) are different distributions.
Conclusion ? MaxEnt and bayesian updating are two different principle leading to different belief distributions. Am I right ?
I don’t believe there is any paradox at all between MaxEnt (perhaps more generally, MinRelEnt) and Bayesian updates. Here, straight MaxEnt make no sense. The implication of the problem is that the ensemble average 3.5 is no longer an active constraint. That is, we no longer believe the contraint E[X]=3.5 once we have the additional data that X is an odd number. The sequential update using minimum relative entropy is identical to Bayes rule and produces the correct answer. These two answers are simply (correct) answers to different questions.
Some recent papers
It was a fine time for learning in Pittsburgh. John and Sam mentioned some of my favorites. Here’s a few more worth checking out:
Online Multitask Learning
Ofer Dekel, Phil Long, Yoram Singer
This is on my reading list. Definitely an area I’m interested in.
Maximum Entropy Distribution Estimation with Generalized Regularization
Miroslav DudÃÂk, Robert E. Schapire
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
András Antos, Csaba Szepesvári, Rémi Munos
Again, on the list to read. I saw Csaba and Remi talk about this and related work at an ICML Workshop on Kernel Reinforcement Learning. The big question in my head is how this compares/contrasts with existing work in reductions to reinforcement learning. Are there advantages/disadvantages?
Higher Order Learning On Graphs> by Sameer Agarwal, Kristin Branson, and Serge Belongie, looks to be interesteding. They seem to poo-poo “tensorization” of existing graph algorithms.
Cover Trees for Nearest Neighbor (Alina Beygelzimer, Sham Kakade, John Langford) finally seems to have gotten published. It’s an embarrassment to the community that it took this long– and a reminder of how diligent one has to be in ensuring good work gets published. This seems to happen on a regular basis. (See A New View of EM.)
Finally, I thought this one was very cool:
Constructing Informative Priors by Rajat Raina, Andrew Y. Ng, Daphne Koller.
Same interest as the first paper on the list.
Check them out!
Regularization = Robustness
The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,
argmax_p H(p)
s.t. E_p[f_i] = c_i
is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).
A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric†version here.
In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. It’s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin Wainwright’s approximate inference techniques rely (in essence) on this result.)
In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a “prior†that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and it’s a sleight of hand because the penalties don’t come from the motivating problem (the primal) but rather get tacked on at the end. In short: it’s a hack.
So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that I’m not even close to having been the first to figure that out. Miroslav DudÃÂk, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. It’s a great paper and I hope you get a chance to check it out:
Performance guarantees for regularized maximum entropy density estimation.
(Even better: if you’re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)
It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework you’re in luck. Yasemin Altun and Alex Smola have a paper (that I haven’t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If you’re at COLT, you can check this out as well.
Unifying Divergence Minimization and Statistical Inference via Convex Duality
The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.
Yet more nips thoughts
I only managed to make it out to the NIPS workshops this year so
I’ll give my comments on what I saw there.
The Learing and Robotics workshops lives again. I hope it
continues and gets more high quality papers in the future. The
most interesting talk for me was Larry Jackel’s on the LAGR
program (see John’s previous post on said program). I got some
ideas as to what progress has been made. Larry really explained
the types of benchmarks and the tradeoffs that had to be made to
make the goals achievable but challenging.
Hal Daume gave a very interesting talk about structured
prediction using RL techniques, something near and dear to my own
heart. He achieved rather impressive results using only a very
greedy search.
The non-parametric Bayes workshop was great. I enjoyed the entire
morning session I spent there, and particularly (the usually
desultory) discussion periods. One interesting topic was the
Gibbs/Variational inference divide. I won’t try to summarize
especially as no conclusion was reached. It was interesting to
note that samplers are competitive with the variational
approaches for many Dirichlet process problems. One open question
I left with was whether the fast variants of Gibbs sampling could
be made multi-processor as the naive variants can.
I also have a reading list of sorts from the main
conference. Most of the papers mentioned in previous posts on
NIPS are on that list as well as these: (in no particular order)
The Information-Form Data Association Filter
Sebastian Thrun, Brad Schumitsch, Gary Bradski, Kunle Olukotun
[ps.gz][pdf][bibtex]
Divergences, surrogate loss functions and experimental design
XuanLong Nguyen, Martin Wainwright, Michael Jordan [ps.gz][pdf][bibtex]
Generalization to Unseen Cases
Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri [ps.gz][pdf][bibtex]
Gaussian Process Dynamical Models
David Fleet, Jack Wang, Aaron Hertzmann [ps.gz][pdf][bibtex]
Convex Neural Networks
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau,
Patrice Marcotte [ps.gz][pdf][bibtex]
Describing Visual Scenes using Transformed Dirichlet Processes
Erik Sudderth, Antonio Torralba, William Freeman, Alan Willsky
[ps.gz][pdf][bibtex]
Learning vehicular dynamics, with application to modeling helicopters
Pieter Abbeel, Varun Ganapathi, Andrew Ng [ps.gz][pdf][bibtex]
Tensor Subspace Analysis
Xiaofei He, Deng Cai, Partha Niyogi [ps.gz][pdf][bibtex]