Machine Learning – Page 59 – Machine Learning (Theory)

7/5/20067/5/2006

more icml papers

Here are a few other papers I enjoyed from ICML06.

Topic Models:

Dynamic Topic Models
David Blei, John Lafferty
A nice model for how topics in LDA type models can evolve over time,
using a linear dynamical system on the natural parameters and a very
clever structured variational approximation (in which the mean field
parameters are pseudo-observations of a virtual LDS). Like all Blei
papers, he makes it look easy, but it is extremely impressive.
Pachinko Allocation
Wei Li, Andrew McCallum
A very elegant (but computationally challenging) model which induces
correlation amongst topics using a multi-level DAG whose interior nodes
are “super-topics” and “sub-topics” and whose leaves are the
vocabulary words. Makes the slumbering monster of structure learning stir.

Sequence Analysis (I missed these talks since I was chairing another session)

Online Decoding of Markov Models with Latency Constraints
Mukund Narasimhan, Paul Viola, Michael Shilman
An “ah-ha!” paper showing how to trade off latency and decoding
accuracy when doing MAP labelling (Viterbi decoding) in sequential
Markovian models. You’ll wish you thought of this yourself.
Efficient inference on sequence segmentation model
Sunita Sarawagi
A smart way to re-represent potentials in segmentation models
to reduce the complexity of inference from cubic in the input sequence
to linear. Also check out her NIPS2004 paper with William Cohen
on “segmentation CRFs”. Moral of the story: segmentation is NOT just
sequence labelling.

Optimal Partitionings/Labellings

The uniqueness of a good optimum for K-means
Marina Meila
Marina shows a stability result for K-means clustering, namely
that if you find a “good” clustering it is not too “different” than the
(unknowable) optimal clustering and that all other good clusterings
are “near” it. So, don’t worry about local minima in K-means as long
as you get a low objective.
Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation
Pradeep Ravikumar, John Lafferty
Paradeep and John introduce QP relaxations for the problem of finding
the best joint labelling of a set of points (connected by a weighted
graph and with a known metric cost between labels and extended
the non-metric case). Surprisingly, they show that the QP relaxation
is both computationally more attractive and more accurate than
the “natural” LP relaxation or than loopy BP approximations.

Distinguished Paper Award Winners

How Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin, Robert Schapire
Trading Convexity for Scalability
Ronan Collobert, Fabian Sinz, Jason Weston, Leon Bottou
Looping Suffix Tree-Based Inference of Partially Observable Hidden State
Michael Holmes, Charles Isbell

6/30/20066/30/2006

ICML papers

Here are some ICML papers which interested me.

Arindam Banerjee had a paper which notes that PAC-Bayes bounds, a core theorem in online learning, and the optimality of Bayesian learning statements share a core inequality in their proof.
Pieter Abbeel, Morgan Quigley and Andrew Y. Ng have a paper discussing RL techniques for learning given a bad (but not too bad) model of the world.
Nina Balcan and Avrim Blum have a paper which discusses how to learn given a similarity function rather than a kernel. A similarity function requires less structure than a kernel, implying that a learning algorithm using a similarity function might be applied in situations where no effective kernel is evident.
Nathan Ratliff, Drew Bagnell, and Marty Zinkevich have a paper describing an algorithm which attempts to fuse A^* path planning with learning of transition costs based on human demonstration.

Papers (2), (3), and (4), all seem like an initial pass at solving interesting problems which push the domain in which learning is applicable.

I’d like to encourage discussion of what papers interested you and why. Maybe we’ll all learn a little bit, and it’s very likely that we all missed interesting papers in a multitrack conference.

6/24/20066/24/2006

Online convex optimization at COLT

At ICML 2003, Marty Zinkevich proposed the online convex optimization setting and showed that a particular gradient descent algorithm has regret O(T^0.5) with respect to the best predictor where T is the number of rounds. This seems to be a nice model for online learning, and there has been some significant follow-up work.

At COLT 2006 Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal presented a modification which takes a Newton step guaranteeing O(log T) regret when the first and second derivatives are bounded. Then they applied these algorithms to portfolio management at ICML 2006 (with Robert Schapire) yielding some very fun graphs.

6/16/20066/17/2006

Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)
s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the Ã¢â‚¬Å“geometricÃ¢â‚¬Â version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. ItÃ¢â‚¬â„¢s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin WainwrightÃ¢â‚¬â„¢s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a Ã¢â‚¬Å“priorÃ¢â‚¬Â that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and itÃ¢â‚¬â„¢s a sleight of hand because the penalties donÃ¢â‚¬â„¢t come from the motivating problem (the primal) but rather get tacked on at the end. In short: itÃ¢â‚¬â„¢s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that IÃ¢â‚¬â„¢m not even close to having been the first to figure that out. Miroslav DudÃƒÂk, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. ItÃ¢â‚¬â„¢s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if youÃ¢â‚¬â„¢re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework youÃ¢â‚¬â„¢re in luck. Yasemin Altun and Alex Smola have a paper (that I havenÃ¢â‚¬â„¢t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If youÃ¢â‚¬â„¢re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.

6/15/2006

IJCAI is out of season

IJCAI is running January 6-12 in Hyderabad India rather than a more traditional summer date. (Presumably, this is to avoid melting people in the Indian summer.)

The paper deadline(June 23 abstract / June 30 submission) are particularly inconvenient if you attend COLT or ICML. But on the other hand, it’s a good excuse to visit India.