April 2005 – Machine Learning (Theory)

4/28/20054/28/2005

Science Fiction and Research

A big part of doing research is imagining how things could be different, and then trying to figure out how to get there.

A big part of science fiction is imagining how things could be different, and then working through the implications.

Because of the similarity here, reading science fiction can sometimes be helpful in understanding and doing research. (And, hey, it’s fun.) Here’s some list of science fiction books I enjoyed which seem particularly relevant to computer science and (sometimes) learning systems:

Vernor Vinge, “True Names”, “A Fire Upon the Deep”
Marc Stiegler, “David’s Sling”, “Earthweb”
Charles Stross, “Singularity Sky”
Greg Egan, “Diaspora”
Joe Haldeman, “Forever Peace”

(There are surely many others.)

Incidentally, the nature of science fiction itself has changed. Decades ago, science fiction projected great increases in the power humans control (example: E.E. Smith Lensman series). That didn’t really happen in the last 50 years. Instead, we gradually refined the degree to which we can control various kinds of power. Science fiction has changed to reflect this. This can be understood as a shift from physics-based progress to engineering or computer science based progress.

4/27/20054/27/2005

DARPA project: LAGR

Larry Jackal has set up the LAGR (“Learning Applied to Ground Robotics”) project (and competition) which seems to be quite well designed. Features include:

Many participants (8 going on 12?)
Standardized hardware. In the DARPA grand challenge contestants entering with motorcycles are at a severe disadvantage to those entering with a Hummer. Similarly, contestants using more powerful sensors can gain huge advantages.
Monthly contests, with full feedback (but since the hardware is standardized, only code is shipped). One of the premises of the program is that robust systems are desired. Monthly evaluations at different locations can help measure this and provide data.
Attacks a known hard problem. (cross country driving)

4/26/20054/26/2005

To calibrate or not?

A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p, the proportion of the time that 1 is observed is p.

Since there are infinitely many p, this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p.

A great deal of effort has been devoted to strategies for achieving calibrated (such as here) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions.

Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way:

We want good probabilistic predictions.
Good probabilistic predictions are calibrated.
Therefore, we want calibrated predictions.

The “Therefore” step misses the fact that calibration is a necessary but not a sufficient characterization of good probabilities. For example on the sequence “010101010…”, always predicting p=0.5 is calibrated.

This leads to the question: What is a sufficient characterization of good probabilities? There are several candidates:

From Vohra: Calibrated on all simple subsequences.
Small squared error: sum_x (x-p_x)².
Small log probability: sum_x log (1/p_x)

I don’t yet understand which of these candidates is preferrable.

There is a sense in which none of them can be preferred. In any complete prediction system, the probabilities are used in some manner, and there is some loss (or utility) associated with it’s use. The “real” goal is minimizing that loss. Depending on the sanity of the method using the probabilities, this may even imply that lieing about the probabilities is preferred. Nevertheless, we can hope for a sane use of probabilities and a sufficient mechanism for predicting good probabilities might eventually result in good performance for any sane use.

4/25/2005

Embeddings: what are they good for?

I’ve been looking at some recent embeddings work, and am struck by how beautiful the theory and algorithms are. It also makes me wonder, what are embeddings good for?

A few things immediately come to mind:

(1) For visualization of high-dimensional data sets.

In this case, one would like good algorithms for embedding specifically into 2- and 3-dimensional Euclidean spaces.

(2) For nonparametric modeling.

The usual nonparametric models (histograms, nearest neighbor) often require resources which are exponential in the dimension. So if the data actually lie close to some low-dimensional
surface, it might be a good idea to first identify this surface and embed the data before applying the model.

Incidentally, for applications like these, it’s important to have a functional mapping from high to low dimension, which some techniques do not yield up.

(3) As a prelude to classifier learning.

The hope here is presumably that learning will be easier in the low-dimensional space, because of (i) better generalization and (ii) a more “natural” layout of the data.

I’d be curious to know of other uses for embeddings.

4/23/20054/23/2005

Advantages and Disadvantages of Bayesian Learning

I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of:

Specify a prior over world models.
Integrate using Bayes law with respect to all observed information to compute a posterior over world models.
Predict according to the posterior.

Bayesian learning has many advantages over other learning programs:

Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe of “think harder” (about specifying a prior over world models) and “compute harder” (to calculate a posterior) will eventually succeed. Many other machine learning approaches don’t have this guarantee.
Language Bayesian and near-Bayesian methods have an associated language for specifying priors and posteriors. This is significantly helpful when working on the “think harder” part of a solution.
Intuitions Bayesian learning involves specifying a prior and integration, two activities which seem to be universally useful. (see intuitions).

With all of these advantages, Bayesian learning is a strong program. However, there are also some very significant disadvantages.

Information theoretically infeasible It turns out that specifying a prior is extremely difficult. Roughly speaking, we must specify a real number for every setting of the world model parameters. Many people well-versed in Bayesian learning don’t notice this difficulty for two reasons:
1. They know languages allowing more compact specification of priors. Acquiring this knowledge takes some signficant effort.
2. They lie. They don’t specify their actual prior, but rather one which is convenient. (This shouldn’t be taken too badly, because it often works.)
Computationally infeasible Let’s suppose I could accurately specify a prior over every air molecule in a room. Even then, computing a posterior may be extremely difficult. This difficulty implies that computational approximation is required.
Unautomatic The “think harder” part of the Bayesian research program is (in some sense) a “Bayesian employment” act. It guarantees that as long as new learning problems exist, there will be a need for Bayesian engineers to solve them. (Zoubin likes to counter that a superprior over all priors can be employed for automation, but this seems to add to the other disadvantages.)

Overall, if a learning problem must be solved a Bayesian should probably be working on it and has a good chance of solving it.
I wish I knew whether or not the drawbacks can be convincingly addressed. My impression so far is “not always”.