Machine Learning (Theory)

3/8/2009

Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

  1. Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
  2. Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
  3. Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
  4. Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
  5. Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
  6. Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

  1. Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
  2. Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
  3. Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
  4. Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
  5. Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.

7 Comments to “Prediction Science”
  1. [...] Machine Learning (Theory) » Prediction Science (tags: web blog ai) [...]

  2. Ravi Sastry says:

    What do you think about using ideas from Kernel estimators in Kernel Machines and vice-versa?

  3. Cyrl says:

    If by “kernel estimators” you mean things like Kernel Smoothing(*) (Parzen windows, non-parametric regression, etc.), then the two are very closely related.

    Kernel estimators / Gaussian Processes / Kernel machines all predict using a combination of labels weighted by a similarity function. They differ in how the combination is optimized. Kernel estimators have little to no parameter estimation but are costly at prediction time, while kernel machines optimize the kernel combination at training time, but predict faster (and often better)…

    I have no pointer to offer but I assume this is well known in the relevant literature as it was already discussed about 10 years ago when kernel machines got popular in Machine Learning.

    (*) I hope the link works. This blog lacks a preview function imho!

  4. Ravi Sastry says:

    Sure. However the idea of kernel in kernel estimators (smoothing) is completely different from the idea od kernel in kernel machine learning literature. For kernel machines one needs a positive definite kernel whereas there are almost no restrictions for the kernel function to be used in kernel smoothing. I was thinking if one could exploit ideas from one field to develop techniques in the other. A somewhat related paper is by Kristin Palckman
    A Risk Minimization Principle for a Class of Parzen Estimators

  5. Cyrl says:

    Well, I would disagree that the notion of kernel is “completely different” in the two contexts. Kernel machines tend to use positive definite kernels because they are convenient both practically and theoretically, but there has actually been some work on relaxing that constraints (X. Mary and S. Canu had some papers on that in 2003-2004).

    Both notions of kernel are so closely related that it’s hard to think of one without the other.

  6. Cyrl says:

    Sorry for the duplicate — can the last comment be removed?

  7. Anonymous says:

    are you sure

Leave a Reply


Powered by WordPress