Interesting papers, ICML 2008

Here are some papers from ICML 2008 that I found interesting.

  1. Risi Kondor and Karsten Borgwardt, The Skew Spectrum of Graphs. This paper is about a new family of functions on graphs which is invariant under node label permutation. They show that these quantities appear to yield good features for learning.
  2. Sanjoy Dasgupta and Daniel Hsu. Hierarchical sampling for active learning. This is the first published practical consistent active learning algorithm. The abstract is also pretty impressive.
  3. Lihong Li, Michael Littman, and Thomas Walsh Knows What It Knows: A Framework For Self-Aware Learning. This is an attempt to create learning algorithms that know when they err, (other work includes Vovk). It’s not yet clear to me what the right model for feature-dependent confidence intervals is.
  4. Novi Quadrianto, Alex Smola, TIberio Caetano, and Quoc Viet Le Estimating Labels from Label Proportions. This is an example of learning in a specialization of the offline contextual bandit setting.
  5. Filip Radlinski, Robert Kleinberg and Thorsten Joachims
    Learning Diverse Rankings with Multi-Armed Bandits. Learning should be used to solve the diversity problem, and doing it in an online bandit-like setting is quite natural. I believe the setting can be generalized to a setting with features without too much work.
  6. Rich Caruana, Nikos Karampatziakis, Ainur Yessenalina An Empirical Evaluation of Supervised Learning in High Dimensions. This paper doesn’t need an abstract given the title :). I hadn’t previously appreciated how well a random forest works in high dimensions.
  7. Sham M. Kakade, Shai Shalev-Shwartz, and Ambuj Tewari. Efficient Bandit Algorithms for Online Multiclass Prediction. A paper about an online contextual bandit setting specialized to a multiclass realizable yet otherwise adversarial setting that yields a practical algorithm.

I’d like to add that I thought the conference organization (and the colocation with COLT and UAI) are particularly well done, the best I’ve seen. The key seems to be tight integration of the colocating conference programs and hordes of local volunteers making sure everything is working. I was also happy to see a 10 years award for best paper 10 years ago.

To Dual or Not

Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”

The basic setting is learning a weight vector wi so that the function f(x)= sumi wi xi optimizes some convex loss function.

The functional view of the dual is that instead of (or in addition to) keeping track of wi over the feature space, you keep track of a vector aj over the examples and define wi = sumj aj xji.

The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.

  1. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.
  2. Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.
  3. Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of ai, which is handy when the weight vector is only implicitly defined.

A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.

More Presentation Preparation

We’ve discussed presentation preparation before, but I have one more thing to add: transitioning. For a research presentation, it is substantially helpful for the audience if transitions are clear. A common outline for a research presentation in machine leanring is:

  1. The problem. Presentations which don’t describe the problem almost immediately lose people, because the context is missing to understand the detail.
  2. Prior relevant work. In many cases, a paper builds on some previous bit of work which must be understood in order to understand what the paper does. A common failure mode seems to be spending too much time on prior work. Discuss just the relevant aspects of prior work in the language of your work. Sometimes this is missing when unneeded.
  3. What we did. For theory papers in particular, it is often not possible to really cover the details. Prioritizing what you present can be very important.
  4. How it worked. Many papers in Machine Learning have some sort of experimental test of the algorithm. Sometimes this is missing when the work is theoretical.

What seems to often happen, is that there is no transitioning in the presentation. This can happen in one of two ways:

  1. Content Confusion. Sometimes the problem description is merged into (2), and (3). Sometimes (2) and (3) are merged. When this happens, it can be very difficult to follow. The solution is to rewrite to isolate the presentation components.
  2. Untransition. Sometimes the presentation does have a reasonable structure as above, but there are just no transitions in the delivery, creating apparent content confusion. This is easy to fix. An approach I often use is to just have an outline slide with the next subject highlighted between pieces of the transition. The delivery of the presentation can also handle this well. For example, have an extra long pause after stating the problem and check to see if the audience has questions.

Proprietary Data in Academic Research?

Should results of experiments on proprietary datasets be in the academic research literature?

The arguments I can imagine in the “against” column are:

  1. Experiments are not repeatable. Repeatability in experiments is essential to science because it allows others to compare new methods with old and discover which is better.
  2. It’s unfair. Academics who don’t have insider access to proprietary data are at a substantial disadvantage when competing with others who do.

I’m unsympathetic to argument (2). To me, it looks like their are simply some resource constraints, and these should not prevent research progress. For example, we wouldn’t prevent publishing about particle accelerator experiments by physicists at CERN because physicists at CMU couldn’t run their own experiments.

Argument (1) seems like a real issue.

The argument for is:

  1. Yes, they are another form of evidence that an algorithm is good. The degree to which they are evidence is less than for publicly repeatable experiments, but greater than nothing.
  2. What if research can only be done in a proprietary setting? It has to be good for society at large to know what works.
  3. Consider the game theory perspective. For example, suppose ICML decides to reject all papers with experiments on proprietary datasets. And suppose KDD decides to consider them as weak evidence. The long term result may be that beginning research on new topics which is only really doable in companies starts and then grows at KDD.

I consider the arguments for to be stronger than the arguments against, but I’m aware others have other beliefs. I think it would be good to have a policy statement from machine learning conferences in their call for papers, as trends suggest this becoming a more serious problem in the mid-term future.

ICML has a comment system

Mark Reid has stepped up and created a comment system for ICML papers which Greger Linden has tightly integrated.

My understanding is that Mark spent quite a bit of time on the details, and there are some cool features like working latex math mode. This is an excellent chance for the ICML community to experiment with making ICML year-round, so I hope it works out. Please do consider experimenting with it.