Interesting Papers at ICML 2007

Here are a few of the papers I enjoyed at ICML.

  1. Steffen Bickel, Michael Brüeckner, Tobias Scheffer, Discriminative Learning for Differing Training and Test Distributions There is a nice trick in this paper: they predict the probability that an unlabeled sample is in the training set vs. the test set, and then use this prediction to importance weight labeled samples in the training set. This paper uses a specific parametric model, but the approach is easily generalized.
  2. Steve Hanneke A Bound on the Label Complexity of Agnostic Active Learning This paper bounds the number of labels required by the A2 algorithm for active learning in the agnostic case. Last year we figured out agnostic active learning was possible. This year, it’s quantified. Hopefull soon, it will be practical.
  3. Sylvian Gelly, David Silver Combining Online and Offline Knowledge in UCT. This paper is about techniques for improving MoGo with various sorts of learning. MoGo has a fair claim at being the world’s best Go algorithm.

There were also a large number of online learning papers this year, especially if you count papers which use online learning techniques for optimization on batch datasets (as I do). This is expected, because larger datasets are becoming more common, and online learning makes more sense the larger the dataset. Many of these papers are of interest if your goal is learning fast while others are about extending online learning into new domains.

(Feel free to add any other papers of interest in the comments.)

Machine Learning Jobs are Growing on Trees

The consensus of several discussions at ICML is that the number of jobs for people knowing machine learning well substantially exceeds supply. This is my experience as well. Demand comes from many places, but I’ve seen particularly strong demand from trading companies and internet startups.

Like all interest bursts, this one will probably pass because of economic recession or other distractions. Nevertheless, the general outlook for machine learning in business seems to be good. Machine learning is all about optimization when there is uncertainty and lots of data. The quantity of data available is growing quickly as computer-run processes and sensors become more common, and the quality of the data is dropping since there is little editorial control in it’s collection. Machine Learning is a difficult subject to master (*), so those who do should remain in demand over the long term.

(*) In fact, it would be reasonable to claim that no one has mastered it—there are just some people who know a bit more than others.

Presentation Preparation

A big part of doing research is presenting it at a conference. Since many people start out shy of public presentations, this can be a substantial challenge. Here are a few notes which might be helpful when thinking about preparing a presentation on research.

  1. Motivate. Talks which don’t start by describing the problem to solve cause many people to zone out.
  2. Prioritize. It is typical that you have more things to say than time to say them, and many presenters fall into the failure mode of trying to say too much. This is an easy-to-understand failure mode as it’s very natural to want to include everything. A basic fact is: you can’t. Example of this are:
    1. Your slides are so densely full of equations and words that you can’t cover them.
    2. Your talk runs over and a moderator prioritizes for you by cutting you off.
    3. You motor-mouth through the presentation, and the information absorption rate of the audience prioritizes in some uncontrolled fashion.
    4. The rate of flow of concepts simply exceeds the information capacity of the audience. Even with nondense slides and an easy succinct delivery, this can often happen.

    One way to prioritize is figure out: “What would I present in 1 minute?” or “What would I present in 5 minutes?”, and then let this guide your overall presentation.

  3. Unassume. When you are working in an area, it’s typical to buildup an internal shorthand for concepts. This needs to be peeled away when preparing a presentation. Decide what the minimal set of concepts are, and then be sure to define them as they are introduced. For people familiar with the basic concepts, this gives them a way to reconcile choices of language, and others at least have a prayer of following.
  4. Practice Well. Some people try to get a talk right by practicing it relentlessly until it is memorized, and then deliver it as a memorized monologue. This is terrible, because people in the audience know it is a memorized monologue and zone out. A good talk is delivered like a conversation, where it happens to be your turn to speak for awhile, and practicing that is more difficult. Some practice by yourself can be helpful—but not too much. A much better method is to practice on your friends by delivering to them before delivering it to the wider world.

The points above avoid the common failure modes which seem to come up with first-time presenters. There is much more advice to give (and for me to learn) about giving better presentations.

How is Compressed Sensing going to change Machine Learning ?

Compressed Sensing (CS) is a new framework developed by Emmanuel Candes, Terry Tao and David Donoho. To summarize, if you acquire a signal in some basis that is incoherent with the basis in which you know the signal to be sparse in, it is very likely you will be able to reconstruct the signal from these incoherent projections.

Terry Tao, the recent Fields medalist, does a very nice job at explaining the framework here. He goes further in the theory description in this post where he mentions the central issue of the Uniform Uncertainty Principle. It so happens that random projections are on average incoherent, within the UUP meaning, with most known basis (sines, polynomials, splines, wavelets, curvelets …) and are therefore an ideal basis for Compressed Sensing. [ For more in-depth information on the subject, the Rice group has done a very good job at providing a central library of papers relevant to the growing subject: http://www.dsp.ece.rice.edu/cs/ ]

The Machine Learning community has looked at Random Projections before, for instance:

  • Experiments with Random Projections for Machine Learning by Fradkin and Madigan (KDD-03.)
  • Face Recognition Experiments with Random Projection by Goel, Bebis and Nefian
  • Dimensionality reduction by random mapping: Fast similarity computation for clustering by S. Kaski (Proceedings of IEEE International Joint Conference on Neural Networks, 1998.)
  • but while they seem to give somewhat comparable results with regards to PCA, the number of contributions on the subject does not seem overwhelming. Maybe one of the reason is that in most papers cited above, the main theoretical reason for using Random projections lies with the Johnson-Lindenstrauss (JL) lemma. As a consequence, most random matrices used in these publications come from the Database world and not from the newer framework of Compressed Sensing (a list of these matrices and their properties can be found in the middle of this page). The uncanny reliance on Random projections within the JL lemma and in the Compressed Sensing setting was explained by Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin in this paper entitled: A simple proof of the restricted isometry property for random matrices. However, the most interesting fallout from this comparison between JL and CS comes in the form of the contribution by Richard Baraniuk and Michael Wakin in Random projections of smooth manifolds. I’ll let the abstract speak for itself:

    We propose a new approach for nonadaptive dimensionality reduction of manifold-modeled data, demonstrating that a small number of random linear projections can preserve key information about a manifold-modeled signal……As our main theoretical contribution, we establish a sufficient number M of random projections to guarantee that, with high probability, all pairwise Euclidean and geodesic distances between points on M are well-preserved under the mapping \phi. Our results bear strong resemblance to the emerging theory of Compressed Sensing (CS), in which sparse signals can be recovered from small numbers of random linear measurements. As in CS, the random measurements we propose can be used to recover the original data in RN. Moreover, like the fundamental bound in CS, our requisite M is linear in the “information level” K and logarithmic in the ambient dimension N; we also identify a logarithmic dependence on the volume and curvature of the manifold. In addition to recovering faithful approximations to manifold-modeled signals, however, the random projections we propose can also be used to discern key properties about the manifold. We discuss connections and contrasts with existing techniques in manifold learning, a setting where dimensionality reducing mappings are typically nonlinear and constructed adaptively from a set of sampled training data.

    It looks as though, as a result, Universal Dimensionality Reduction is achieved by some properly chosen random projections. In the case of data living in a low dimensional manifold, the JL lemma states that the number of random projections is proportional to the number of points or samples from that manifold, on the other hand, CS seems to show that the number of random projections is proportional to the characteristic of the manifold only.

    The results highlighted by Wakin and Baraniuk are very compelling but there is another appealing reason to Random Projections: Robustness. While trying to mimick nature in the learning process, one cannot but be amazed at the reliability of the biological system. On the other hand, even the researchers that model these processes do not realize or point out that this robustness is in part due to random projections. Case in point, the excellent work of Thomas Serre, Aude Oliva and Tomaso Poggio culminating in a paper describing a biology inspired model of brain that shows its ability to process information in a feedforward fashion. The modeling is new in that, in this area of science, there is a central issue as to whether the brain works in a one-way process or with many loops. In the paper, the feature dimension reduction model (which is what this process is) uses random projections as I pointed out recently.

    Because of the intrinsic dimension reduction capability, Mike Wakin has also shown efficient nearest neighbor searches using few random projections (see figure 3 of this paper). I could go on but the point is that since CS is a revolution in the world of signal/feature acquisition and processing (see the analog-to-information A2I site ) one cannot but wonder aloud how this will affect Machine Learning in general.

    Interesting Papers at COLT 2007

    Here are two papers that seem particularly interesting at this year’s COLT.

    1. Gilles Blanchard and François Fleuret, Occam’s Hammer. When we are interested in very tight bounds on the true error rate of a classifier, it is tempting to use a PAC-Bayes bound which can (empirically) be quite tight. A disadvantage of the PAC-Bayes bound is that it applies to a classifier which is randomized over a set of base classifiers rather than a single classifier. This paper shows that a similar bound can be proved which holds for a single classifier drawn from the set. The ability to safely use a single classifier is very nice. This technique applies generically to any base bound, so it has other applications covered in the paper.
    2. Adam Tauman Kalai. Learning Nested Halfspaces and Uphill Decision Trees. Classification PAC-learning, where you prove that any problem amongst some set is polytime learnable with respect to any distribution over the input X is extraordinarily challenging as judged by lack of progress over a long period of time. This paper is about regression PAC-learning, and the results appear much more encouraging than exist in classification PAC-learning. Under the assumption that:
      1. The level sets of the correct regressed value are halfspaces.
      2. The level sets obey a Lipschitz condition.

      this paper proves that a good regressor can be PAC-learned using a boosting algorithm. (The “uphill decision trees” part of the paper is about one special case where you don’t need the Lipschitz condition.)