One nice use for this blog is to consider and discuss papers that that have appeared at recent conferences. I really enjoyed Andrew Ng and Sham Kakade’s paper Online Bounds for Bayesian Algorithms. From the paper:

The philosophy taken in the Bayesian methodology is often at odds with

that in the online learning community…. the online learning setting

makes rather minimal assumptions on the conditions under which the

data are being presented to the learner â€”usually, Nature could provide

examples in an adversarial manner. We study the performance of

Bayesian algorithms in a more adversarial setting… We provide

competitive bounds when the cost function is the log loss, and we

compare our performance to the best model in our model class (as in

the experts setting).

It’s a very nice analysis of some of my favorite algorithms that all hinges around a beautiful theorem:

Let Q be any distribution over parameters theta. Then for all sequences S:

L_{Bayes}(S) leq L_Q(S) + KL(Q|P)

where P is our prior and the losses L are: first, log-loss for the Bayes algorithm (run online) and second, expected log-loss with respect to an arbitrary distribution Q.

Any thoughts? Any other papers you thought we **have** to read?

I’m curious to hear what you thought about the trend towards information-theory in learning (the use of KL-divergences etc for clustering, the information bottleneck method, …)

Hi Suresh,

In some cases, like in the online learning paper above I think it’s quite elegant and exactly the right way to think about things. In other cases, like the information bottleneck, I have mixed feelings about it. On one hand, it has intuitive appeal and is apparently useful for concocting reasonable algorithms. (I’ve use the maximization of empirical mutual information once myself, so I do appreciate what I’m about to criticize.) On the other, I feel like these things like IB are nearly always maximum likelihood for some model. One might argue that who cares– if information theoretic notions give intuitive ideas of what to optimize that ought to be enough. But the ML in a model is to me a more powerful understanding. For instance, if I can write down the model I could apply a Bayesian integration technique instead of an ML approach for the same model. I can use some approximate method, and I can compare the accuracy of the ML approach. In some ways, the information theoretic criteria as they are used in IB (and elsewhere, especially in the EE Pattern Rec community) go too fast– they bind up model and algorithm. It’s more useful to have these separately in my mind.

Again though, this says nothing about perfect uses of information theoretic concepts like in the online learning paper. I’m all of more of those. (PAC-Bayes, for instance, the KL form of the chernoff bound…)

My impression is that the basic theorem was scooped by the earlier STOC 97 paper by Freund, Schapire, Singer, and Warmuth.

I do like working out the various implications for common learning algorithms.

I also tend to have mixed feelings about the information theory. It seems there are several compelling cases where quantities from information theory arise in learning, and there are obvious connections (good predictor = good compressor). I start to wonder when algorithms attempt to optimize directly mutual information or relative entropy because estimating these quantities requires a number of samples related to the size of the domain (see this paper for discussion of the difficulty of testing independence).

thanks. this is very interesting. I have become interested in the algorithmics of relative entropy (how to deal with it efficiently), and I think that these “cautions” are useful in getting a better perspective on the value of information-theoretic notions (since I am not a machine-learning person myself)

I believe that the STOC paper in question only discusses the result for zero/one ouput from the adversary. The generalization is nice because it means we can look at things like linear regression in a Bayesian way as being a good online algorithm. The proof in Ng&Kakde is also particularly elegant in its simplicity as well as its generality. Correct me if I misread the FSSW paper– it was a while ago.

RE: direct optimization of MI/KL– all the applications I’ve seen are assuming a parametric form that they optimize. This shouldn’t suffer any worse sample complexity problem than any other MaxLikelihood type procedure, no?

The information theoretic notions turn out to be useful at least for unsupervised learning problems. The information theoretic formulations are possibly the strongest candidates for giving credibility to unsupervised learning problems.

At some level, it is correct to think of formulations such as IB to be maximum likelihood models. The catch is to figure out what the (sufficient statistic) representation is, and what the modeling assumptions are. An important observation

made by Fosrter and Warmuth

connecting log-likelihoods and (Bregman) divergences can be used to show

that IB basically runs a mixture of multinomial models. We had some comments

on this in our ICML’04 paper.

Also, I think the good compression = good prediction is a valuable intuition.

The situation is of course more interesting for the lossy compression or unrealizable case. The reason we do clustering or vector quantization the way

we do it is because the core lossy compression problem in information theory does not (yet) have a provable polynomial time solution. The lossless case gave rise to the Lempel-Ziv class of algorithms. We still dont know how to extend those results

to the non-zero loss case.

Right. (And these problems can vary between arbitrarily bad and nonexistent, depending on your model.)

Regarding “why information theory is useful” –

Marcus Hutter was showing me a proof that KL-divergence bounds essentially *any* loss function. The proof appears in his Universal Artificial Intelligence book.

Hi Aleks,

Sounds like an interesting formalization of some intuitions in the Bayes/MaxEnt community. Can you give a more specific reference?