(I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.)
Since Nemirovski’s Saint Flour lecture notes, numerous researchers have studied the following problem in least squares regression: predict as well as
(MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions)
(C) the best convex combination of these functions (i.e., model = convex hull of the d functions)
(L) the best linear combination of these functions (i.e., model = linear span of the d functions)
It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample.
The practical use of these results seems rather limited to trivial statements like: do not use the OLS estimator when the dimension d of the input vector is larger than n (here the d functions are the projections on each of the d components). Nevertheless, it provides a rather easy way to prove that there exists a learning algorithm having an excess risk of order s (log d)/n, with respect to the best linear combination of s of the d functions (s-sparse linear model). Indeed, it suffices to consider the algorithm which
- cuts the training set into two parts, say of equal size for simplicity,
- uses the first part to train linear estimators corresponding to every possible subset of s features. Here you can use your favorite linear estimator (the empirical risk minimizer on a compact set or robust but more involved ones are possible rather than the OLS), as long as it solves (L) with minimal excess risk.
- uses the second part to predict as well as the “d choose s” linear estimators built on the first part. Here you choose your favorite aggregate solving (MS). The one I prefer is described in p.5 of my NIPS’07 paper, but you might prefer the progressive mixture rule or the algorithm of Guillaume Lecué and Shahar Mendelson. Note that empirical risk minimization and cross-validation completely fail for this task with excess risk of order sqrt((log d)/n) instead of (log d)/n.
It is an easy exercise to combine the different excess risk bounds and obtain that the above procedure achieves an excess risk of s (log d)/n. The nice thing compared to works on Lasso, Dantzig selectors and their variants is that you do not need all these assumptions saying that your features should be “not too much” correlated. Naturally, the important limitation of the above procedure, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental problem:
is it possible to design a computationally efficient algorithm with the s (log d)/n guarantee without assuming low correlation between the explanatory variables?
If we want to consider computationally intractable methods, why not just compute the solution of the ell_0 penalized problem? Isn’t that minimax optimal?
Penalization proportional to the number of nonzero coefficients is indeed the basic algorithm. Typical criterion to tune this penalty are the well-known
Mallows Cp, AIC and BIC. The relevant references I know for the theoretical guarantees of this type of methods are (and are contained in) the work of Bunea, Tsybakov and Wegkamp (Aggregation for Gaussian regression, 2007, Annals of Statistics) and the work of Birgé and Massart (Minimal penalties for Gaussian model selection, 2007, Probability theory and related fields). To my knowledge, the results concerning the ell_0 regularization are not that good compared to the ones of the procedure in the post since the Cst (s log d) / n excess risk bound holds only when the regression function, that is the conditional expectation of the output knowing the input, is inside the model (i.e., is a linear combination of at most s of the d basis functions).
Besides they concern regression with fixed design (with Gaussian noise) and it is unclear how they generalize to the random design setting considered in the post.
Jean-Yves
There is some work on ell_0 penalized estimator for random design too. I don’t know of older literature, but the most recent one is due to Raskutti, Wainwright and Yu http://arxiv.org/abs/0910.2042. The comment about model being true is right, I don’t know of any analysis for that case, but it would be a little surprising if you cannot prove an excess risk bound there. The assumption about model being true is more in order to get convergence in parameter instead of convergence in risk, rather than being a fundamental limitation of the estimator in my understanding.
Alekh
The assumption about model being true is strong and does influence the rate. For instance, consider the MS problem described in the post. If we assume that one of the d functions is the true regression function, then it is easy to show that the empirical risk minimizer on the set of d functions has an excess risk of order (log d)/n, whereas, without this assumption, it only has a sqrt((log d)/n) excess risk rate (in bad cases; see, e.g., Theorem 2 of my NIPS’07 paper). Here it is likely that the same thing occurs: sqrt((s log d)/n) excess risk rate (instead of (s log d)/n) for ell_0 regularization if we do not assume that the conditional expectation of the output knowing the input is a linear combination of at most s of the d basis functions.
Jean-Yves
I would argue that the issue isn’t whether the model is true or not, but the continuity of the space. So the MS problem isn’t going to work well for general LS due to the digital nature of the space. But for the problems C and L, i would be surprized if you could get much of a gap between the “model is true” bounds and the worst case bounds.
I’m curious if there are any approximation bounds on model selection that aren’t either exponential in s or assume some sort of orthoganlity. I’ve always taken that problem as simply NP complete and assumed we had to go with much weaker statements. Have I missed any cool new results? I.e. no assumption on colinearlity. But still Poly(s). And some sort of risk efficiency?
later,
dean
A small precision: the MS problem is mainly due to non-convexity. The problem of suboptimality of empirical risk minimization (and its variants) typically occurs if the model is non-convex.
Jean-Yves
I believe there is an NP-Hardness result for this problem. See:
Natara jan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput.
24 227-234.
I vaguely recall an earlier hardness result as well, but not sure where I saw it.