Machine Learning (Theory) – Page 72 – Machine learning and learning theory research

7/8/20067/8/2006

MaxEnt contradicts Bayes Rule?

A few weeks ago I read this. David Blei and I spent some time thinking hard about this a few years back (thanks to Kary Myers for pointing us to it):

In short I was thinking that Ã¢â‚¬Å“bayesian belief updatingÃ¢â‚¬Â and Ã¢â‚¬Å“maximum entropyÃ¢â‚¬Â were two othogonal principles. But it appear that they are not, and that they can even be in conflict !
Example (from Kass 1996); consider a Die (6 sides), consider prior knowledge E[X]=3.5.
Maximum entropy leads to P(X)= (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
Now consider a new piece of evidence A=Ã¢â‚¬ÂX is an odd numberÃ¢â‚¬Â
Bayesian posterior P(X|A)= P(A|X) P(X) = (1/3, 0, 1/3, 0, 1/3, 0).
But MaxEnt with the constraints E[X]=3.5 and E[Indicator function of A]=1 leads to (.22, 0, .32, 0, .47, 0) !! (note that E[Indicator function of A]=P(A))
Indeed, for MaxEnt, because there is no more Ã¢â‚¬Ëœ6Ã¢â‚¬Â², big numbers must be more probable to ensure an average of 3.5. For bayesian updating, P(X|A) doesnÃ¢â‚¬â„¢t have to have a 3.5 expectation. P(X) and P(X|a) are different distributions.
Conclusion ? MaxEnt and bayesian updating are two different principle leading to different belief distributions. Am I right ?

I don’t believe there is any paradox at all between MaxEnt (perhaps more generally, MinRelEnt) and Bayesian updates. Here, straight MaxEnt make no sense. The implication of the problem is that the ensemble average 3.5 is no longer an active constraint. That is, we no longer believe the contraint E[X]=3.5 once we have the additional data that X is an odd number. The sequential update using minimum relative entropy is identical to Bayes rule and produces the correct answer. These two answers are simply (correct) answers to different questions.

7/8/20067/9/2006

Some recent papers

It was a fine time for learning in Pittsburgh. John and Sam mentioned some of my favorites. Here’s a few more worth checking out:

Online Multitask Learning
Ofer Dekel, Phil Long, Yoram Singer
This is on my reading list. Definitely an area I’m interested in.

Maximum Entropy Distribution Estimation with Generalized Regularization
Miroslav DudÃƒÂk, Robert E. Schapire

Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path
AndrÃƒÂ¡s Antos, Csaba SzepesvÃƒÂ¡ri, RÃƒÂ©mi Munos
Again, on the list to read. I saw Csaba and Remi talk about this and related work at an ICML Workshop on Kernel Reinforcement Learning. The big question in my head is how this compares/contrasts with existing work in reductions to reinforcement learning. Are there advantages/disadvantages?

Higher Order Learning On Graphs> by Sameer Agarwal, Kristin Branson, and Serge Belongie, looks to be interesteding. They seem to poo-poo “tensorization” of existing graph algorithms.

Cover Trees for Nearest Neighbor (Alina Beygelzimer, Sham Kakade, John Langford) finally seems to have gotten published. It’s an embarrassment to the community that it took this long– and a reminder of how diligent one has to be in ensuring good work gets published. This seems to happen on a regular basis. (See A New View of EM.)

Finally, I thought this one was very cool:
Constructing Informative Priors by Rajat Raina, Andrew Y. Ng, Daphne Koller.
Same interest as the first paper on the list.
Check them out!

7/6/2006

Branch Prediction Competition

Alan Fern points out the second branch prediction challenge (due September 29) which is a follow up to the first branch prediction competition. Branch prediction is one of the fundamental learning problems of the computer age: without it our computers might run an order of magnitude slower. This is a tough problem since there are sharp constraints on time and space complexity in an online environment. For machine learning, the “idealistic track” may fit well. Essentially, they remove these constraints to gain a weak upper bound on what might be done.

7/5/20067/5/2006

more icml papers

Here are a few other papers I enjoyed from ICML06.

Topic Models:

Dynamic Topic Models
David Blei, John Lafferty
A nice model for how topics in LDA type models can evolve over time,
using a linear dynamical system on the natural parameters and a very
clever structured variational approximation (in which the mean field
parameters are pseudo-observations of a virtual LDS). Like all Blei
papers, he makes it look easy, but it is extremely impressive.
Pachinko Allocation
Wei Li, Andrew McCallum
A very elegant (but computationally challenging) model which induces
correlation amongst topics using a multi-level DAG whose interior nodes
are “super-topics” and “sub-topics” and whose leaves are the
vocabulary words. Makes the slumbering monster of structure learning stir.

Sequence Analysis (I missed these talks since I was chairing another session)

Online Decoding of Markov Models with Latency Constraints
Mukund Narasimhan, Paul Viola, Michael Shilman
An “ah-ha!” paper showing how to trade off latency and decoding
accuracy when doing MAP labelling (Viterbi decoding) in sequential
Markovian models. You’ll wish you thought of this yourself.
Efficient inference on sequence segmentation model
Sunita Sarawagi
A smart way to re-represent potentials in segmentation models
to reduce the complexity of inference from cubic in the input sequence
to linear. Also check out her NIPS2004 paper with William Cohen
on “segmentation CRFs”. Moral of the story: segmentation is NOT just
sequence labelling.

Optimal Partitionings/Labellings

The uniqueness of a good optimum for K-means
Marina Meila
Marina shows a stability result for K-means clustering, namely
that if you find a “good” clustering it is not too “different” than the
(unknowable) optimal clustering and that all other good clusterings
are “near” it. So, don’t worry about local minima in K-means as long
as you get a low objective.
Quadratic Programming Relaxations for Metric Labeling and Markov Random Field MAP Estimation
Pradeep Ravikumar, John Lafferty
Paradeep and John introduce QP relaxations for the problem of finding
the best joint labelling of a set of points (connected by a weighted
graph and with a known metric cost between labels and extended
the non-metric case). Surprisingly, they show that the QP relaxation
is both computationally more attractive and more accurate than
the “natural” LP relaxation or than loopy BP approximations.

Distinguished Paper Award Winners

How Boosting the Margin Can Also Boost Classifier Complexity
Lev Reyzin, Robert Schapire
Trading Convexity for Scalability
Ronan Collobert, Fabian Sinz, Jason Weston, Leon Bottou
Looping Suffix Tree-Based Inference of Partially Observable Hidden State
Michael Holmes, Charles Isbell

6/30/20066/30/2006

ICML papers

Here are some ICML papers which interested me.

Arindam Banerjee had a paper which notes that PAC-Bayes bounds, a core theorem in online learning, and the optimality of Bayesian learning statements share a core inequality in their proof.
Pieter Abbeel, Morgan Quigley and Andrew Y. Ng have a paper discussing RL techniques for learning given a bad (but not too bad) model of the world.
Nina Balcan and Avrim Blum have a paper which discusses how to learn given a similarity function rather than a kernel. A similarity function requires less structure than a kernel, implying that a learning algorithm using a similarity function might be applied in situations where no effective kernel is evident.
Nathan Ratliff, Drew Bagnell, and Marty Zinkevich have a paper describing an algorithm which attempts to fuse A^* path planning with learning of transition costs based on human demonstration.

Papers (2), (3), and (4), all seem like an initial pass at solving interesting problems which push the domain in which learning is applicable.

I’d like to encourage discussion of what papers interested you and why. Maybe we’ll all learn a little bit, and it’s very likely that we all missed interesting papers in a multitrack conference.