The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)

s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the Ã¢â‚¬Å“geometricÃ¢â‚¬Â version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. ItÃ¢â‚¬â„¢s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin WainwrightÃ¢â‚¬â„¢s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a Ã¢â‚¬Å“priorÃ¢â‚¬Â that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and itÃ¢â‚¬â„¢s a sleight of hand because the penalties donÃ¢â‚¬â„¢t come from the motivating problem (the primal) but rather get tacked on at the end. In short: itÃ¢â‚¬â„¢s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that IÃ¢â‚¬â„¢m not even close to having been the first to figure that out. Miroslav DudÃƒÂk, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. ItÃ¢â‚¬â„¢s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if youÃ¢â‚¬â„¢re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework youÃ¢â‚¬â„¢re in luck. Yasemin Altun and Alex Smola have a paper (that I havenÃ¢â‚¬â„¢t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If youÃ¢â‚¬â„¢re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.

John, I believe that robustness was the original motivation for introducing the concept of regularization. For example, if I remember correctly, Tikhonov used regularizations to make solutions of some differential equations stable with respect to the initial conditions.

misha b

misha b

Misha,

Sure… it seems like we could make a longer analogy that said

regularized–>stable–>robust

Interesting to think about just which way implications should go….

Drew

Hi Drew,

did not realize it was you and not John who posted 🙂

One interesting thing is that we cannot learn from noisy data without some form of regularization — we end up just fitting the noise. For example the story of slack variables in SVM, which says that slack variables are needed when the data is not linearly separable, does not hold water. Even when the data is separable (e.g. when using any universal kernel) slack variables (i.e., the regularization term) are still needed.

misha

There was an earlier discussion about this which is partially relevant.