Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.

**Functionally**. Regularization is optimizing some representation to fit the data*and*minimize some notion of predictor complexity. This notion of complexity is often the l_{1}or l_{2}norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.**Statistical Learning Viewpoint**Regularization is about the failiure of statistical learning to adequately predict generalization error. Let*e(c,D)*be the expected error rate with respect to*D*of classifier*c*and*e(c,S)*the observed error rate on a sample*S*. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples*S*,*e(c,D) less than e(c,S) + f(complexity)*where*complexity*is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of*f()*. We can note that*f()*is always monotonically increasing with the complexity measure and so there exists a unique constant*C*such that*f(complexity)=C*complexity*at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the*C*constant in a support vector machine can be regarded as searching for this “right” tradeoff.**Computationally**Regularization can be thought of as a computational shortcut to computing the*f()*above. Hence, smoothness, convexity, and other computational constraints are important issues.

One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it *perfectly* implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.