Don’t mix the solution into the problem

A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning:

  1. “The learning problem is finding a good seperating hyperplane.”
  2. “The goal of learning is to minimize (y-p)2 + C w2 where y = the observation, p = the prediction and w = a parameter vector.”
  3. Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world.

The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially limit your learning kungfu.

The way to avoid this defect while still limiting the scope of investigation to something you can manage is to be explicit.

  1. Admit what the real world-imposed learning problem is. For example “The problem is to find a classifier minimizing error rate”.
  2. Be explicit about where the problem ends and the solution begins. For example “We use a support vector machine with a l2 loss to train a classifier. We use the l2 loss because it is an upper bound on the error rate which is computationally tractable to optimize.”
  3. Finish the solution. For example “The error rate on our test set was 0.34.”

It is important to note that this is not a critique about any particular method for solving learning problems, but rather about the process of thinking about learning problems. Eliminating this thinking-bug will leave people more capable of appreciating and using different approaches to solve the real learning problem.

Science Fiction and Research

A big part of doing research is imagining how things could be different, and then trying to figure out how to get there.

A big part of science fiction is imagining how things could be different, and then working through the implications.

Because of the similarity here, reading science fiction can sometimes be helpful in understanding and doing research. (And, hey, it’s fun.) Here’s some list of science fiction books I enjoyed which seem particularly relevant to computer science and (sometimes) learning systems:

  1. Vernor Vinge, “True Names”, “A Fire Upon the Deep”
  2. Marc Stiegler, “David’s Sling”, “Earthweb”
  3. Charles Stross, “Singularity Sky”
  4. Greg Egan, “Diaspora”
  5. Joe Haldeman, “Forever Peace”

(There are surely many others.)

Incidentally, the nature of science fiction itself has changed. Decades ago, science fiction projected great increases in the power humans control (example: E.E. Smith Lensman series). That didn’t really happen in the last 50 years. Instead, we gradually refined the degree to which we can control various kinds of power. Science fiction has changed to reflect this. This can be understood as a shift from physics-based progress to engineering or computer science based progress.

To calibrate or not?

A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p, the proportion of the time that 1 is observed is p.

Since there are infinitely many p, this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p.

A great deal of effort has been devoted to strategies for achieving calibrated (such as here) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions.

Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way:

  1. We want good probabilistic predictions.
  2. Good probabilistic predictions are calibrated.
  3. Therefore, we want calibrated predictions.

The “Therefore” step misses the fact that calibration is a necessary but not a sufficient characterization of good probabilities. For example on the sequence “010101010…”, always predicting p=0.5 is calibrated.

This leads to the question: What is a sufficient characterization of good probabilities? There are several candidates:

  1. From Vohra: Calibrated on all simple subsequences.
  2. Small squared error: sumx (x-px)2.
  3. Small log probability: sumx log (1/px)

I don’t yet understand which of these candidates is preferrable.

There is a sense in which none of them can be preferred. In any complete prediction system, the probabilities are used in some manner, and there is some loss (or utility) associated with it’s use. The “real” goal is minimizing that loss. Depending on the sanity of the method using the probabilities, this may even imply that lieing about the probabilities is preferred. Nevertheless, we can hope for a sane use of probabilities and a sufficient mechanism for predicting good probabilities might eventually result in good performance for any sane use.

Embeddings: what are they good for?

I’ve been looking at some recent embeddings work, and am struck by how beautiful the theory and algorithms are. It also makes me wonder, what are embeddings good for?

A few things immediately come to mind:

(1) For visualization of high-dimensional data sets.

In this case, one would like good algorithms for embedding specifically into 2- and 3-dimensional Euclidean spaces.

(2) For nonparametric modeling.

The usual nonparametric models (histograms, nearest neighbor) often require resources which are exponential in the dimension. So if the data actually lie close to some low-dimensional
surface, it might be a good idea to first identify this surface and embed the data before applying the model.

Incidentally, for applications like these, it’s important to have a functional mapping from high to low dimension, which some techniques do not yield up.

(3) As a prelude to classifier learning.

The hope here is presumably that learning will be easier in the low-dimensional space, because of (i) better generalization and (ii) a more “natural” layout of the data.

I’d be curious to know of other uses for embeddings.