Yoram and Shai‘s online learning tutorial at ICML brings up a question for me, “Why use the dual?”
The basic setting is learning a weight vector wi so that the function f(x)= sumi wi xi optimizes some convex loss function.
The functional view of the dual is that instead of (or in addition to) keeping track of wi over the feature space, you keep track of a vector aj over the examples and define wi = sumj aj xji.
The above view of duality makes operating in the dual appear unnecessary, because in the end a weight vector is always used. The tutorial suggests that thinking about the dual gives a unified algorithmic font for deriving online learning algorithms. I haven’t worked with the dual representation much myself, but I have seen a few examples where it appears helpful.
- Noise When doing online optimization (i.e. online learning where you are allowed to look at individual examples multiple times), the dual representation may be helpful in dealing with noisy labels.
- Rates One annoyance of working in the primal space with a gradient descent style algorithm is that finding the right learning rate can be a bit tricky.
- Kernelization Instead of explicitly computing the weight vector, the representation of the solution can be left in terms of ai, which is handy when the weight vector is only implicitly defined.
A substantial drawback of dual analysis is that it doesn’t seem to apply well in nonconvex situations. There is a reasonable (but not bullet-proof) argument that learning over nonconvex functions is unavoidable in solving some problems.