Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.
- Functionally. Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l1 or l2 norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.
- Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S, e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of f(). We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff.
- Computationally Regularization can be thought of as a computational shortcut to computing the f() above. Hence, smoothness, convexity, and other computational constraints are important issues.
One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.
I guess I’d view it through another set of facets: the Bayesian (unsurprisingly) and the online view. For the Bayesian, it regularization is simply a MAP approximation and it makes sense that there are more hypothesis than we can put our prior beliefs uniformly over. We regularize to enforce prior beliefs. The online analysis of that procedure tells us that there are too many experts in the space of all parameter settings and that we should try to compete with some limited set. The more regularization penalizes them the longer it takes to compete with that expert. We can either compete with a lot of hypothesis and learn slowly (maybe even never learn) or learn quickly and compete with a smaller set of hypothesis. In the online view, then, we therefore should, much as a Bayesian, try and compete with the experts we think are most likely to be good. Ideally, it would be as John said: a prior of probabilty 1 at the optimal point and 0 elsewhere. Otherwise we simply try to encode our prior beliefs.
One can also view regularization from the point of view of numerical stability, and this view has surprising connections to other views.
For instance, consider a finite, discrete linear, pure inverse problem
y = Xp where X is a known matrix, y’s are observed and p’s are unknown
Numerical methods for solving such system strongly depend on the condition number of X, ie the ratio of largest to smallest eigenvalue, log of which gives the number of bits of precision lost in solving this system in the worst case. To remedy this numerical problem, one can regularize, ie solve a different but related problem which is numerically better behaved. You can do it by adding a multiple of identity matrix to X, which is essentially what ridge regression does.
However, in addition to solving problems of numerical stability, this also makes our solution better from statistical point of view. As Sam Roweis wrote in his 2003 JLMR — “The regularization term acts to penalize large weights that exploit correlations beyond some level of precision in the data sampling process. It may also introduce some robustness to noise and outliers.”
Another example:
Suppose you are doing Maximum Likelihood estimation. Often this converts to a convex optimization problem which you can solve using an of-the-shelf optimizer like conjugate gradient. Here the rate of convergence will greately depend on the condition number of the Hessian matrix. Very ill-conditioned Hessians will cause slow convergence, and you may even fail to converge.
In this setting, the Hessian is also the Fisher Information Matrix. Cramer-Rao bound says that variance matrix of any unbiased estimator is the inverse of the Fisher Information Matrix. So, here we ill-conditioned Hessian translates into high variance for best unbiased estimator. For singular Hessian, the variance will become infinite along some components.
We regularize by solving a different but related problem which is better behaved — instead of doing ML estimation, we can do MAP estimation by assuming that parameters come from a Gaussian with a diagonal covariance matrix. The new estimator will be biased, but now ill-conditioning of the Fisher Information Matrix is not such a great problem. For instance resulting estimator may have finite variance even though the Information Matrix is singular. This same remedy also fixes the numerical precision problems — it’s equivalent to adding some diagonal matrix to the Hessian at every point except 0, which is like the method for dealing with numerical instability in ridge regression in my previous example.
I regard the Bayesian, online learning, and sample complexit bound viewpoints as roughly equivalent in motivating regularization. Each of them motivates sticking in some sort of “prior” over the predictors which is translated into a regularizer.
I had not appreciated the numerical stability issues—thanks. Having a fix for prediction performance also fix numerical issues seems nicely reassuring.
Numerical stability is closely related to generalization.
There are various stability-based bounds for the generalization error. Check out, e.g., Bousquet and Elisseeff’s paper. Roughly, stable algorithms have small generalization gap.
mb
I think stability bridges regularization to genelization .stability is the reason that regularization be usful to improve generalization ability.
stability here doesn’t merely refer to numerical stability ,and more means
statistical stability which is close connection with random process that mostly we encounter in study .