Machine Learning – Page 39 – Machine Learning (Theory)

7/6/2008

To Dual or Not

Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”

The basic setting is learning a weight vector w_i so that the function f(x)= sum_i w_i x_i optimizes some convex loss function.

The functional view of the dual is that instead of (or in addition to) keeping track of w_i over the feature space, you keep track of a vector a_j over the examples and define w_i = sum_j a_j x_ji.

The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.

Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.
Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.
Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of a_i, which is handy when the weight vector is only implicitly defined.

A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.

7/4/20087/25/2008

Proprietary Data in Academic Research?

Should results of experiments on proprietary datasets be in the academic research literature?

The arguments I can imagine in the “against” column are:

Experiments are not repeatable. Repeatability in experiments is essential to science because it allows others to compare new methods with old and discover which is better.
It’s unfair. Academics who don’t have insider access to proprietary data are at a substantial disadvantage when competing with others who do.

I’m unsympathetic to argument (2). To me, it looks like their are simply some resource constraints, and these should not prevent research progress. For example, we wouldn’t prevent publishing about particle accelerator experiments by physicists at CERN because physicists at CMU couldn’t run their own experiments.

Argument (1) seems like a real issue.

The argument for is:

Yes, they are another form of evidence that an algorithm is good. The degree to which they are evidence is less than for publicly repeatable experiments, but greater than nothing.
What if research can only be done in a proprietary setting? It has to be good for society at large to know what works.
Consider the game theory perspective. For example, suppose ICML decides to reject all papers with experiments on proprietary datasets. And suppose KDD decides to consider them as weak evidence. The long term result may be that beginning research on new topics which is only really doable in companies starts and then grows at KDD.

I consider the arguments for to be stronger than the arguments against, but I’m aware others have other beliefs. I think it would be good to have a policy statement from machine learning conferences in their call for papers, as trends suggest this becoming a more serious problem in the mid-term future.

6/30/2008

ICML has a comment system

Mark Reid has stepped up and created a comment system for ICML papers which Greger Linden has tightly integrated.

My understanding is that Mark spent quite a bit of time on the details, and there are some cool features like working latex math mode. This is an excellent chance for the ICML community to experiment with making ICML year-round, so I hope it works out. Please do consider experimenting with it.

6/9/2008

The Minimum Sample Complexity of Importance Weighting

This post is about a trick that I learned from Dale Schuurmans which has been repeatedly useful for me over time.

The basic trick has to do with importance weighting for monte carlo integration. Consider the problem of finding:
N = E_{x ~ D} f(x)
given samples from D and knowledge of f.

Often, we don’t have samples from D available. Instead, we must make do with samples from some other distribution Q. In that case, we can still often solve the problem, as long as Q(x) isn’t 0 when D(x) is nonzero, using the importance weighting formula:
E_{x ~ Q} f(x) D(x)/Q(x)

A basic question is: How many samples from Q are required in order to estimate N to some precision? In general the convergence rate is not bounded, because f(x) D(x)/Q(x) is not bounded given the assumptions.
Nevertheless, there is one special value Q(x) = f(x) D(x) / N where the sample complexity turns out to be 1, which is typically substantially better than the sample complexity of the original problem.

This observation underlies the motivation for voluntary importance weighting algorithms. Even under pretty terrible approximations, the logic of “Q(x) is something like f(x) D(x)” often yields substantial improvements over sampling directly from D(x).

Category: Machine Learning

To Dual or Not

More Presentation Preparation

Proprietary Data in Academic Research?

ICML has a comment system

The Minimum Sample Complexity of Importance Weighting