*strict saddle.*I first read about the approach here http://arxiv.org/abs/1503.02101, but at ICML he and other authors have demonstrated a remarkable number of problems that have this property that enables efficient optimization via an stochastic gradient descent (and other) procedures.

## MaxEnt contradicts Bayes Rule?

A few weeks ago I read this. David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it):

*In short I was thinking that Ã¢â‚¬Å“bayesian belief updatingÃ¢â‚¬Â and Ã¢â‚¬Å“maximum entropyÃ¢â‚¬Â were two othogonal principles. But it appear that they are not, and that they can even be in conflict !
Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A=Ã¢â‚¬ÂX is an odd numberÃ¢â‚¬Â
Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more Ã¢â‚¬Ëœ6Ã¢â‚¬Â², big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesnÃ¢â‚¬â„¢t have to have a 3.5 expectation. P(X) and P(X|a) are different distributions.
Conclusion ? MaxEnt and bayesian updating are two different principle leading to different belief distributions. Am I right ?
*

I don’t believe there is any paradox at all between MaxEnt (perhaps more generally, MinRelEnt) and Bayesian updates. Here, straight MaxEnt make no sense. The implication of the problem is that the ensemble average 3.5 is *no longer an active constraint*. That is, we no longer believe the contraint E[X]=3.5 once we have the additional data that X is an odd number. The sequential update using minimum relative entropy is identical to Bayes rule and produces the correct answer. These two answers are simply (correct) answers to different questions.

## Some recent papers

It was a fine time for learning in Pittsburgh. John and Sam mentioned some of my favorites. Here’s a few more worth checking out:

Online Multitask Learning

Ofer Dekel, Phil Long, Yoram Singer

This is on my reading list. Definitely an area I’m interested in.

Maximum Entropy Distribution Estimation with Generalized Regularization

Miroslav DudÃƒÂk, Robert E. Schapire

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path

AndrÃƒÂ¡s Antos, Csaba SzepesvÃƒÂ¡ri, RÃƒÂ©mi Munos

Again, on the list to read. I saw Csaba and Remi talk about this and related work at an ICML Workshop on Kernel Reinforcement Learning. The big question in my head is how this compares/contrasts with existing work in reductions to reinforcement learning. Are there advantages/disadvantages?

Higher Order Learning On Graphs> by Sameer Agarwal, Kristin Branson, and Serge Belongie, looks to be interesteding. They seem to poo-poo “tensorization” of existing graph algorithms.

Cover Trees for Nearest Neighbor (Alina Beygelzimer, Sham Kakade, John Langford) finally seems to have gotten published. It’s an embarrassment to the community that it took this long– and a reminder of how diligent one has to be in ensuring good work gets published. This seems to happen on a regular basis. (See A New View of EM.)

Finally, I thought this one was very cool:

Constructing Informative Priors by Rajat Raina, Andrew Y. Ng, Daphne Koller.

Same interest as the first paper on the list.

Check them out!

## Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)

s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the Ã¢â‚¬Å“geometricÃ¢â‚¬Â version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. ItÃ¢â‚¬â„¢s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin WainwrightÃ¢â‚¬â„¢s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a Ã¢â‚¬Å“priorÃ¢â‚¬Â that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and itÃ¢â‚¬â„¢s a sleight of hand because the penalties donÃ¢â‚¬â„¢t come from the motivating problem (the primal) but rather get tacked on at the end. In short: itÃ¢â‚¬â„¢s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that IÃ¢â‚¬â„¢m not even close to having been the first to figure that out. Miroslav DudÃƒÂk, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. ItÃ¢â‚¬â„¢s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if youÃ¢â‚¬â„¢re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework youÃ¢â‚¬â„¢re in luck. Yasemin Altun and Alex Smola have a paper (that I havenÃ¢â‚¬â„¢t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If youÃ¢â‚¬â„¢re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.

## Yet more nips thoughts

I only managed to make it out to the NIPS workshops this year so

I’ll give my comments on what I saw there.

The Learing and Robotics workshops lives again. I hope it

continues and gets more high quality papers in the future. The

most interesting talk for me was Larry Jackel’s on the LAGR

program (see John’s previous post on said program). I got some

ideas as to what progress has been made. Larry really explained

the types of benchmarks and the tradeoffs that had to be made to

make the goals achievable but challenging.

Hal Daume gave a very interesting talk about structured

prediction using RL techniques, something near and dear to my own

heart. He achieved rather impressive results using only a very

greedy search.

The non-parametric Bayes workshop was great. I enjoyed the entire

morning session I spent there, and particularly (the usually

desultory) discussion periods. One interesting topic was the

Gibbs/Variational inference divide. I won’t try to summarize

especially as no conclusion was reached. It was interesting to

note that samplers are competitive with the variational

approaches for many Dirichlet process problems. One open question

I left with was whether the fast variants of Gibbs sampling could

be made multi-processor as the naive variants can.

I also have a reading list of sorts from the main

conference. Most of the papers mentioned in previous posts on

NIPS are on that list as well as these: (in no particular order)

The Information-Form Data Association Filter

Sebastian Thrun, Brad Schumitsch, Gary Bradski, Kunle Olukotun

[ps.gz][pdf][bibtex]

Divergences, surrogate loss functions and experimental design

XuanLong Nguyen, Martin Wainwright, Michael Jordan [ps.gz][pdf][bibtex]

Generalization to Unseen Cases

Teemu Roos, Peter GrÃƒÂ¼nwald, Petri MyllymÃƒÂ¤ki, Henry Tirri [ps.gz][pdf][bibtex]

Gaussian Process Dynamical Models

David Fleet, Jack Wang, Aaron Hertzmann [ps.gz][pdf][bibtex]

Convex Neural Networks

Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau,

Patrice Marcotte [ps.gz][pdf][bibtex]

Describing Visual Scenes using Transformed Dirichlet Processes

Erik Sudderth, Antonio Torralba, William Freeman, Alan Willsky

[ps.gz][pdf][bibtex]

Learning vehicular dynamics, with application to modeling helicopters

Pieter Abbeel, Varun Ganapathi, Andrew Ng [ps.gz][pdf][bibtex]

Tensor Subspace Analysis

Xiaofei He, Deng Cai, Partha Niyogi [ps.gz][pdf][bibtex]