March 2009 – Machine Learning (Theory)

3/26/2009

Machine Learning is too easy

One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others.

There are two fundamental reasons why this is possible.

For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples.
There is nothing like a unifying problem defining the field. In many other areas there are unifying problems for which finer distinctions in approaches matter.

The implications of this observation agrees with inspection of the field.

Particular problems are often “solved” by the first technique applied to them. This is particularly exacerbated when some other field first uses machine learning, as people there may not be aware of other approaches. A popular example of this is Naive Bayes for spam. In other fields, the baseline method is a neural network, an SVM, a graphical model, etc…
The analysis of new learning algorithms is often subject to assumptions designed for the learning algorithm. Examples include large margins for support vector machines, a ‘correct’ Bayesian prior, correct conditional independence assumptions for graphical models, etc… Given such assumptions, it’s unsurprising to learn that the algorithm is the right method, and justifying a new algorithm becomes an exercise in figuring out an assumption which seems natural sounding under which the algorithm performs well. This assumption set selection problem is the theoretician’s version of the data set selection problem.

A basic problem is: How do you make progress in a field with this (lack of) structure? And what does progress even mean? Some possibilities are:

Pick an approach and push it. [Insert your favorite technique] everywhere.
Find new real problems and apply ML. The fact that ML is easy means there is a real potential for doing great things this way.
Find a hard problem and work on it. Although almost anyone can do something nontrivial on most problems, achieving best-possible performance on some problems is not at all easy.
Make the problem harder. Create algorithms that work fast online, in real time, with very few or no labeled examples, but very many examples, very many features, and very many things to predict.

I am least fond of approach (1), although many people successfully follow approach (1) for their career. What’s frustrating about approach (1), is that there does not seem to be any single simple philosophy capable of solving all the problems we might recognize as machine learning problems. Consequently, people following approach (1) are at risk of being outpersuaded by someone sometime in the future.

Approach (2) is perhaps the easiest way to accomplish great things, and in some sense much advance comes from new applications.

Approach (3) seems solid, promoting a different kind of progress than approach (2).

Approach (4) seems particularly cool to me at the moment. It is not as specialized as (2) or (3), and it seems many constraints are complementary. For example, there is large scale learning = online learning.

3/18/2009

Parallel ML primitives

Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML.

Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second?

One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit. Antoine Bordes showed a variant was competitive in the large scale learning challenge. It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimization, implying the algorithm can be used in a host of situations where batch algorithms are awkward or unviable.

If we start with a fast learning algorithm as a baseline, there seem to be two directions we can go with parallel ML:

(easier) Try to do more in the same amount of time. For example, Paul and Neil suggest using parallelism to support ensemble methods.
(harder) Try to use parallelism to reduce the amount of time required to effectively learn on large amounts of data. For this approach, bandwidth and latency become uncomfortably critical implementation details. Due to these issues, it appears important to at least loosen the goal to competing with learning on large amounts of data. Alternatively, we could consider this as evidence some other learning primitive is desirable, although I’m not sure which one.

3/8/2009

Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.