Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.
- Functionally. Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l1 or l2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.
- Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S, e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of f(). We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff.
- Computationally Regularization can be thought of as a computational shortcut to computing the f() above. Hence, smoothness, convexity, and other computational constraints are important issues.
One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.