# Machine Learning (Theory)

## 5/10/2010

### Aggregation of estimators, sparsity in high dimension and computational feasibility

Tags: Machine Learning,Statistics jl@ 2:04 pm

(I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.)

Since Nemirovski’s Saint Flour lecture notes, numerous researchers have studied the following problem in least squares regression: predict as well as
(MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions)
(C) the best convex combination of these functions (i.e., model = convex hull of the d functions)
(L) the best linear combination of these functions (i.e., model = linear span of the d functions)
It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample.

The practical use of these results seems rather limited to trivial statements like: do not use the OLS estimator when the dimension d of the input vector is larger than n (here the d functions are the projections on each of the d components). Nevertheless, it provides a rather easy way to prove that there exists a learning algorithm having an excess risk of order s (log d)/n, with respect to the best linear combination of s of the d functions (s-sparse linear model). Indeed, it suffices to consider the algorithm which

1. cuts the training set into two parts, say of equal size for simplicity,
2. uses the first part to train linear estimators corresponding to every possible subset of s features. Here you can use your favorite linear estimator (the empirical risk minimizer on a compact set or robust but more involved ones are possible rather than the OLS), as long as it solves (L) with minimal excess risk.
3. uses the second part to predict as well as the “d choose s” linear estimators built on the first part. Here you choose your favorite aggregate solving (MS). The one I prefer is described in p.5 of my NIPS’07 paper, but you might prefer the progressive mixture rule or the algorithm of Guillaume LecuĂ© and Shahar Mendelson. Note that empirical risk minimization and cross-validation completely fail for this task with excess risk of order sqrt((log d)/n) instead of (log d)/n.

It is an easy exercise to combine the different excess risk bounds and obtain that the above procedure achieves an excess risk of s (log d)/n. The nice thing compared to works on Lasso, Dantzig selectors and their variants is that you do not need all these assumptions saying that your features should be “not too much” correlated. Naturally, the important limitation of the above procedure, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental problem:
is it possible to design a computationally efficient algorithm with the s (log d)/n guarantee without assuming low correlation between the explanatory variables?

## 3/8/2009

### Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

1. Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
2. Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
3. Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
4. Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
5. Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
6. Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

1. Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
2. Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
3. Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
4. Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
5. Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.

## 1/27/2009

### Key Scientific Challenges

Yahoo released the Key Scientific Challenges program. There is a Machine Learning list I worked on and a Statistics list which Deepak worked on.

I’m hoping this is taken quite seriously by graduate students. The primary value, is that it gave us a chance to sit down and publicly specify directions of research which would be valuable to make progress on. A good strategy for a beginning graduate student is to pick one of these directions, pursue it, and make substantial advances for a PhD. The directions are sufficiently general that I’m sure any serious advance has applications well beyond Yahoo.

A secondary point, (which I’m sure is primary for many ) is that there is money for graduate students here. It’s unrestricted, so you can use it for any reasonable travel, supplies, etc…

## 2/27/2008

### The Stats Handicap

Graduating students in Statistics appear to be at a substantial handicap compared to graduating students in Machine Learning, despite being in substantially overlapping subjects.

The problem seems to be cultural. Statistics comes from a mathematics background which emphasizes large publications slowly published under review at journals. Machine Learning comes from a Computer Science background which emphasizes quick publishing at reviewed conferences. This has a number of implications:

1. Graduating statistics PhDs often have 0-2 publications while graduating machine learning PhDs might have 5-15.
2. Graduating ML students have had a chance for others to build on their work. Stats students have had no such chance.
3. Graduating ML students have attended a number of conferences and presented their work, giving them a chance to meet people. Stats students have had fewer chances of this sort.

In short, Stats students have had relatively few chances to distinguish themselves and are heavily reliant on their advisors for jobs afterwards. This is a poor situation, because advisors have a strong incentive to place students well, implying that recommendation letters must always be considered with a grain of salt.

This problem is more or less prevalent depending on which Stats department students go to. In some places the difference is substantial, and in other places not.

One practical implication of this, is that when considering graduating stats PhDs for hire, some amount of affirmative action is in order. At a minimum, this implies spending extra time getting to know the candidate and what the candidate can do is in order.

## 1/15/2007

### The Machine Learning Department

Tags: Machine Learning,Statistics jl@ 7:40 pm

Carnegie Mellon School of Computer Science has the first academic Machine Learning department. This department already existed as the Center for Automated Learning and Discovery, but recently changed it’s name.

The reason for changing the name is obvious: very few people think of themselves as “Automated Learner and Discoverers”, but there are number of people who think of themselves as “Machine Learners”. Machine learning is both more succinct and recognizable—good properties for a name.

A more interesting question is “Should there be a Machine Learning Department?”. Tom Mitchell has a relevant whitepaper claiming that machine learning is answering a different question than other fields or departments. The fundamental debate here is “Is machine learning different from statistics?”

At a cultural level, there is no real debate: they are different. Machine learning is characterized by several very active large peer reviewed conferences, operating in a computer science mode. Statistics tends to function with a greater emphasis on journals and a lesser emphasis on conferences which often implies a much longer publishing cycle.

In terms of the basic questions driving the field, the answer seems less clear. It is true that the core problems of statistics in the past have typically differed from the core problems of machine learning today. Yet, there has been some substantial overlap, and there are a number of statisticians nowadays that are actively doing machine learning. It’s reasonably plausible that in the long term statistics departments will adopt the core problems of machine learning, removing the reasons for a separate machine learning department.

The parallel question for computer science comes up less often perhaps because computer science is a notoriously broad field.

The practical implication of a new department is the ability to create a more specific curricula, admit more specific students, and hire faculty based upon more specific interests. Compared to a computer science program, classes on programming languages, computer architecture, or graphics might be dropped in favor of classes on learning theory, statistics, etc… Compared to a statistics program, classes on advanced parameter estimation and measure theory might be dropped in favor of algorithms and programming experience.

An alternative solution like “learn everything from computer science and statistics” is personally appealing to me, and I have benefitted from and recommend a broad education. However this is not practical for everyone. In my experience, a machine learning skill set is an effective specialization with which people can do important things in the world. Given this, having a department with a machine learning centered curricula seems like a good idea. At Carnegie Mellon, this is the Machine Learning department. In the future and elsewhere it may have a different name, but the value of the machine learning skill set should grow with research, improving computers, and improving data sources.

## 10/8/2006

### Incompatibilities between classical confidence intervals and learning.

Classical confidence intervals satisfy a theorem of the form: For some data sources D,

PrS ~ D(f(D) > g(S)) > 1-d

where f is some function of the distribution (such as the mean) and g is some function of the observed sample S. The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D“. There are even some confidence intervals which do not require IID samples.

Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter.

There are a number of results in theoretical machine learning which use confidence intervals. For example,

1. The E3 algorithm uses confidence intervals to learn a near optimal policy for any MDP with high probability.
2. Set Covering Machines minimize a confidence interval upper bound on the true error rate of a learned classifier.
3. The A2 uses confidence intervals to safely deal with arbitrary noise while taking advantage of active learning.

Suppose that we want to generalize thse algorithms in a reductive style. The goal would be to train a regressor to predict the output of g(S) for new situations. For example, a good regression prediction of g(S) might allow E3 to be applied to much larger state spaces. Unfortunately, this approach seems to fail badly due to a mismatch between the semantics of learning and the semantics of a classical confidence interval.

1. It’s difficult to imagine a constructive sampling mechanism. In a large state space, we may never encounter the same state twice, so we can not form meaningful examples of the form “for this state-action, the correct confidence interval is y“.
2. When we think of succesful learning, we typically think of it in an l1 sense—the expected error rate over the data generating distribution is small. Confidence intervals have a much stronger meaning as we would like to apply them: with high probability, in all applications, the confidence interval holds. This mismatch appears unaddressable.

It is tempting to start plugging in other notions such as Bayesian confidence intervals or quantile regression systems. Making these approaches work at a theoretical level on even simple systems is an open problem, but there is plenty of motivation to do so.