To Dual or Not – Machine Learning (Theory)

Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”

The basic setting is learning a weight vector w_i so that the function f(x)= sum_i w_i x_i optimizes some convex loss function.

The functional view of the dual is that instead of (or in addition to) keeping track of w_i over the feature space, you keep track of a vector a_j over the examples and define w_i = sum_j a_j x_ji.

The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.

Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.
Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.
Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of a_i, which is handy when the weight vector is only implicitly defined.

A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.

3 Replies to “To Dual or Not”

Hi John,

Interesting post. I skimmed the slides, and wonder if you can fill in some blanks.

On #1, why so?
#2 seems interesting– the slides hint a bit at this, but it isn’t obvious to me that you can’t tune step sizes in the primal with any less
precision/ease.
Kernelization seems just as natural in the primal. (For instance Online Learning with Kernels).

I’m pretty unclear what dual updates really buy a) for online learning and b) especially for SVMs.

A number of people, (including Tong, Nati, and I) have argued that the primal is both practically and theoretically the right place to optimize.

What do we really gain out of the dual?

–d

I was also at the tutorial and from my point of view the dual provides the following benefits:
– It provides a principled way to do early stopping by looking at the duality gap.
– It easily accommodates learning with infinite attributes. This is useful for deriving AdaBoost and other boosting algorithms in the same framework. For more information, see Shai’s very well written PhD thesis (available at his webpage).
– It provides a recipe for new online algorithms (see slide 70). For example, the recipe easily leads to online algorithms for structured prediction.

I was wondering, is a video of the tutorial available?

Comments are closed.