What is the right form of modularity in structured prediction?

Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.

We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.

A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:

This form of modularity seems to be preferred for several reasons:

  1. Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
  2. There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
  3. There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.

The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.

So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?

Regret minimizing vs error limiting reductions

This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration.

  1. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate.
  2. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0. An analagous result for reget is that for a probabilistic error correcting output code, multiclass regret is at most 4 * (binary regret)0.5.

The use of “regret” is more desirable than the use of error rates, because (for example) the ECOC error rate bound implies nothing when there is enough noise so that the binary classifiers always have error rate 0.25. However the square root dependence introduced when analyzing regret is not desirable. A basic question is: Can we have the best of both worlds? Can we find some algorithm doing multiclass classification with binary classifiers such that average regret r for the binary classifiers implies average regret bounded by 4r for the multiclass classifier?

If the answer is “yes”, that reduction algorithm may be empirically superior to the one we use now.
If the answer is “no”, that is a sharp and unexpected distinction between error rate analysis and regret analysis.

Learning Reductions are Reductionist

This is about a fundamental motivation for the investigation of reductions in learning. It applies to many pieces of work other than my own.

The reductionist approach to problem solving is characterized by taking a problem, decomposing it into as-small-as-possible subproblems, discovering how to solve the subproblems, and then discovering how to use the solutions to the subproblems to solve larger problems. The reductionist approach to solving problems has often payed off very well. Computer science related examples of the reductionist approach include:

  1. Reducing computation to the transistor. All of our CPUs are built from transistors.
  2. Reducing rendering of images to rendering a triangle (or other simple polygons). Computers can now render near-realistic scenes in real time. The big breakthrough came from learning how to render many triangles quickly.

This approach to problem solving extends well beyond computer science. Many fields of science focus on theories making predictions about very simple systems. These predictions are then composed to make predictions about where space craft go, how large a cannonball needs to be, etc… Obviously this approach has been quite successful.

It is an open question whether or not this approach can really succeed at learning.

  1. Against: We know that succesful learning requires the incorporation of prior knowledge in fairly arbitrary forms. This suggests that we can not easily decompose the process of learning.
  2. For: We know that humans can succeed at general purpose learning. It may be that arbitrary prior knowledge is required to solve arbitrary learning problems, but perhaps there are specific learning algorithms incorporating specific prior knowledge capable of solving the specific problems we encounter.
  3. Neutral: We know that learning reductions sometimes work. We don’t yet have a good comparison of how well they work with other approaches.

Problem: Reductions and Relative Ranking Metrics

This, again, is something of a research direction rather than a single problem.

There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC, “F1”, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class (google‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these.

Problem What does the ability to classify well imply about performance under these metrics?

Past Work

  1. Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC.
  2. Sample complexity bounds for AROC (and here).
  3. A paper on “Learning to Order Things“.

Difficulty Several of these may be easy. Some of them may be hard.

Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. It might also yield new algorithms (via the reduction) for solving these problems.

Solution: Reinforcement Learning with Classification

I realized that the tools needed to solve the problem just posted were just created. I tried to sketch out the solution here (also in .lyx and .tex). It is still quite sketchy (and probably only the few people who understand reductions well can follow).

One of the reasons why I started this weblog was to experiment with “research in the open”, and this is an opportunity to do so. Over the next few days, I’ll be filling in details and trying to get things to make sense. If you have additions or ideas, please propose them.