ICML papers

Here are some ICML papers which interested me.

  1. Arindam Banerjee had a paper which notes that PAC-Bayes bounds, a core theorem in online learning, and the optimality of Bayesian learning statements share a core inequality in their proof.
  2. Pieter Abbeel, Morgan Quigley and Andrew Y. Ng have a paper discussing RL techniques for learning given a bad (but not too bad) model of the world.
  3. Nina Balcan and Avrim Blum have a paper which discusses how to learn given a similarity function rather than a kernel. A similarity function requires less structure than a kernel, implying that a learning algorithm using a similarity function might be applied in situations where no effective kernel is evident.
  4. Nathan Ratliff, Drew Bagnell, and Marty Zinkevich have a paper describing an algorithm which attempts to fuse A* path planning with learning of transition costs based on human demonstration.

Papers (2), (3), and (4), all seem like an initial pass at solving interesting problems which push the domain in which learning is applicable.

I’d like to encourage discussion of what papers interested you and why. Maybe we’ll all learn a little bit, and it’s very likely that we all missed interesting papers in a multitrack conference.

Online convex optimization at COLT

At ICML 2003, Marty Zinkevich proposed the online convex optimization setting and showed that a particular gradient descent algorithm has regret O(T0.5) with respect to the best predictor where T is the number of rounds. This seems to be a nice model for online learning, and there has been some significant follow-up work.

At COLT 2006 Elad Hazan, Adam Kalai, Satyen Kale, and Amit Agarwal presented a modification which takes a Newton step guaranteeing O(log T) regret when the first and second derivatives are bounded. Then they applied these algorithms to portfolio management at ICML 2006 (with Robert Schapire) yielding some very fun graphs.

Bounds greater than 1

Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are always greater than 1.

Background One branch of learning theory focuses on theorems which

  1. Assume samples are drawn IID from an unknown distribution D.
  2. Fix a set of classifiers
  3. Find a high probability bound on the maximum true error rate (with respect to D) as a function of the empirical error rate on the training set.

Many of these bounds become extremely complex and hairy.

Current Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.)

One of the dangers of striving for “functional tightness” is that the bound can depend on strangely interrelated parameters. In fact, apparently these strange interrelations can become so complex that they end up always larger than 1 (some bounds here and here).

It seems we should ask the question: “Why are we doing the math?” If it is just done to get a paper accepted under review, perhaps this is unsatisfying. The real value of math comes when it guides us in designing learning algorithms. Math from bounds greater than 1 is a dangerously weak motivation for learning algorithm design.

There is a significant danger in taking this “oops” too strongly.

  1. There exist some reasonable arguments (not made here) for aiming at functional tightness.
  2. The value of the research a person does is more related to the best they have done than the worst.

Mad (Neuro)science

One of the questions facing machine learning as a field is “Can we produce a generalized learning system that can solve a wide array of standard learning problems?” The answer is trivial: “yes, just have children”.

Of course, that wasn’t really the question. The refined question is “Are there simple-to-implement generalized learning systems that can solve a wide array of standard learning problems?” The answer to this is less clear. The ability of animals (and people ) to learn might be due to megabytes encoded in the DNA. If this algorithmic complexity is necessary to solve machine learning, the field faces a daunting task in replicating it on a computer.

This observation suggests a possibility: if you can show that few bits of DNA are needed for learning in animals, then this provides evidence that machine learning (as a field) has a hope of big success with relatively little effort.

It is well known that specific portions of the brain have specific functionality across individuals. There are two ways this observation can be explained:

  1. Maybe the specific functionality areas are encoded in the DNA.
  2. Maybe the specific functionality areas arise from the learning process of the brain. This is the answer that machine learning would like to hear because it agrees with the hypothesis that a simple general learning system exists.

It’s important to realize that these choices actually specify a spectrum rather than a dichotomy. There are surely some problem-specific learning hacks in the brain and there is surely some generalized learning ability. The question is: to what degree is learning encoded by genetic heritage vs personal experience?

It is anecdotally well known that people (especially children) can recover from fairly severe brain damage, but of course we would prefer to avoid anecdotal evidence.

There are also neuroscience experiments addressing this question. This paper by Jitendra Sharma, Alessandra Angelucci, and Mriganka Sur provides some evidence. In a nutshell, they rewire the optic nerve of ferrets into the auditory region of the brain. They observe that structures similar to the visual specific region of the brain arise in the auditory region after rewiring (although the new regions may be less capable).

There are doubtless many other experiments addressing this question, but my knowledge of Neuroscience is lacking. (Thanks to Maneesh for pointing this one out.)

“Structural” Learning

Fernando Pereira pointed out Ando and Zhang‘s paper on “structural” learning. Structural learning is multitask learning on subproblems created from unlabeled data.

The basic idea is to take a look at the unlabeled data and create many supervised problems. On text data, which they test on, these subproblems might be of the form “Given surrounding words predict the middle word”. The hope here is that successfully predicting on these subproblems is relevant to the prediction of your core problem.

In the long run, the precise mechanism used (essentially, linear predictors with parameters tied by a common matrix) and the precise problems formed may not be critical. What seems critical is that the hope is realized: the technique provides a significant edge in practice.

Some basic questions about this approach are:

  1. Are there effective automated mechanisms for creating the subproblems?
  2. Is it necessary to use a shared representation?