To Dual or Not

Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”

The basic setting is learning a weight vector wi so that the function f(x)= sumi wi xi optimizes some convex loss function.

The functional view of the dual is that instead of (or in addition to) keeping track of wi over the feature space, you keep track of a vector aj over the examples and define wi = sumj aj xji.

The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.

  1. Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.
  2. Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.
  3. Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of ai, which is handy when the weight vector is only implicitly defined.

A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.

More Presentation Preparation

We’ve discussed presentation preparation before, but I have one more thing to add: transitioning. For a research presentation, it is substantially helpful for the audience if transitions are clear. A common outline for a research presentation in machine leanring is:

  1. The problem. Presentations which don’t describe the problem almost immediately lose people, because the context is missing to understand the detail.
  2. Prior relevant work. In many cases, a paper builds on some previous bit of work which must be understood in order to understand what the paper does. A common failure mode seems to be spending too much time on prior work. Discuss just the relevant aspects of prior work in the language of your work. Sometimes this is missing when unneeded.
  3. What we did. For theory papers in particular, it is often not possible to really cover the details. Prioritizing what you present can be very important.
  4. How it worked. Many papers in Machine Learning have some sort of experimental test of the algorithm. Sometimes this is missing when the work is theoretical.

What seems to often happen, is that there is no transitioning in the presentation. This can happen in one of two ways:

  1. Content Confusion. Sometimes the problem description is merged into (2), and (3). Sometimes (2) and (3) are merged. When this happens, it can be very difficult to follow. The solution is to rewrite to isolate the presentation components.
  2. Untransition. Sometimes the presentation does have a reasonable structure as above, but there are just no transitions in the delivery, creating apparent content confusion. This is easy to fix. An approach I often use is to just have an outline slide with the next subject highlighted between pieces of the transition. The delivery of the presentation can also handle this well. For example, have an extra long pause after stating the problem and check to see if the audience has questions.

Proprietary Data in Academic Research?

Should results of experiments on proprietary datasets be in the academic research literature?

The arguments I can imagine in the “against” column are:

  1. Experiments are not repeatable. Repeatability in experiments is essential to science because it allows others to compare new methods with old and discover which is better.
  2. It’s unfair. Academics who don’t have insider access to proprietary data are at a substantial disadvantage when competing with others who do.

I’m unsympathetic to argument (2). To me, it looks like their are simply some resource constraints, and these should not prevent research progress. For example, we wouldn’t prevent publishing about particle accelerator experiments by physicists at CERN because physicists at CMU couldn’t run their own experiments.

Argument (1) seems like a real issue.

The argument for is:

  1. Yes, they are another form of evidence that an algorithm is good. The degree to which they are evidence is less than for publicly repeatable experiments, but greater than nothing.
  2. What if research can only be done in a proprietary setting? It has to be good for society at large to know what works.
  3. Consider the game theory perspective. For example, suppose ICML decides to reject all papers with experiments on proprietary datasets. And suppose KDD decides to consider them as weak evidence. The long term result may be that beginning research on new topics which is only really doable in companies starts and then grows at KDD.

I consider the arguments for to be stronger than the arguments against, but I’m aware others have other beliefs. I think it would be good to have a policy statement from machine learning conferences in their call for papers, as trends suggest this becoming a more serious problem in the mid-term future.

ICML has a comment system

Mark Reid has stepped up and created a comment system for ICML papers which Greger Linden has tightly integrated.

My understanding is that Mark spent quite a bit of time on the details, and there are some cool features like working latex math mode. This is an excellent chance for the ICML community to experiment with making ICML year-round, so I hope it works out. Please do consider experimenting with it.

The Minimum Sample Complexity of Importance Weighting

This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time.

The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding:
N = Ex ~ D f(x)
given samples from D and knowledge of f.

Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q. In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula:
Ex ~ Q f(x) D(x)/Q(x)

A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions.
Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1, which is typically substantially better than the sample complexity of the original problem.

This observation underlies the motivation for voluntary importance weighting algorithms. Even under pretty terrible approximations, the logic of “Q(x) is something like f(x) D(x)” often yields substantial improvements over sampling directly from D(x).