Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”
The basic setting is learning a weight vector wi so that the function f(x)= sumi wi xi optimizes some convex loss function.
The functional view of the dual is that instead of (or in addition to) keeping track of wi over the feature space, you keep track of a vector aj over the examples and define wi = sumj aj xji.
The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.
- Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.
- Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.
- Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of ai, which is handy when the weight vector is only implicitly defined.
A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.
Hi John,
Interesting post. I skimmed the slides, and wonder if you can fill in some blanks.
On #1, why so?
#2 seems interesting– the slides hint a bit at this, but it isn’t obvious to me that you can’t tune step sizes in the primal with any less
precision/ease.
Kernelization seems just as natural in the primal. (For instance Online Learning with Kernels).
I’m pretty unclear what dual updates really buy a) for online learning and b) especially for SVMs.
A number of people, (including Tong, Nati, and I) have argued that the primal is both practically and theoretically the right place to optimize.
What do we really gain out of the dual?
–d
I was also at the tutorial and from my point of view the dual provides the following benefits:
– It provides a principled way to do early stopping by looking at the duality gap.
– It easily accommodates learning with infinite attributes. This is useful for deriving AdaBoost and other boosting algorithms in the same framework. For more information, see Shai’s very well written PhD thesis (available at his webpage).
– It provides a recipe for new online algorithms (see slide 70). For example, the recipe easily leads to online algorithms for structured prediction.
I was wondering, is a video of the tutorial available?