Lower Bounds for Learning Reductions

Learning reductions transform a solver of one type of learning problem into a solver of another type of learning problem. When we analyze these for robustness we can make statement of the form “Reduction R has the property that regret r (or loss) on subproblems of type A implies regret at most f ( r ) on the original problem of type B“.

A lower bound for a learning reduction would have the form “for all reductions R, there exists a learning problem of type B and learning algorithm for problems of type A where regret r on induced problems implies at least regret f ( r ) for B“.

The pursuit of lower bounds is often questionable because, unlike upper bounds, they do not yield practical algorithms. Nevertheless, they may be helpful as a tool for thinking about what is learnable and how learnable it is. This has already come up here and here.

At the moment, there is no coherent theory of lower bounds for learning reductions, and we have little understanding of how feasible they are or which techniques may be useful in proving them. Here is a rough summary of what I know:

  1. For structured prediction, we have a partially worked out lower bound for all reductions using the structure to only carry single bits. A proof for reductions using the structure in others ways seems tricky at the moment.
  2. For Reinforcement learning it may (this is unclear) be possible to prove a lower bound showing that prediction ability alone can not solve RL well.
  3. There are various results which can be thought of as lower bounds for more limited families of reductions. One example is analyzing exactly how badly margin optimization can underperform for 0-1 loss when there is noise.

Overall, this is a moderately interesting direction of research which has not been much investigated.

Reopening RL->Classification

In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.

The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).

So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.

Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.

What is the right form of modularity in structured prediction?

Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.

We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.

A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:

This form of modularity seems to be preferred for several reasons:

  1. Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
  2. There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
  3. There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.

The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.

So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?

Regret minimizing vs error limiting reductions

This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration.

  1. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate.
  2. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0. An analagous result for reget is that for a probabilistic error correcting output code, multiclass regret is at most 4 * (binary regret)0.5.

The use of “regret” is more desirable than the use of error rates, because (for example) the ECOC error rate bound implies nothing when there is enough noise so that the binary classifiers always have error rate 0.25. However the square root dependence introduced when analyzing regret is not desirable. A basic question is: Can we have the best of both worlds? Can we find some algorithm doing multiclass classification with binary classifiers such that average regret r for the binary classifiers implies average regret bounded by 4r for the multiclass classifier?

If the answer is “yes”, that reduction algorithm may be empirically superior to the one we use now.
If the answer is “no”, that is a sharp and unexpected distinction between error rate analysis and regret analysis.

DARPA project: LAGR

Larry Jackal has set up the LAGR (“Learning Applied to Ground Robotics”) project (and competition) which seems to be quite well designed. Features include:

  1. Many participants (8 going on 12?)
  2. Standardized hardware. In the DARPA grand challenge contestants entering with motorcycles are at a severe disadvantage to those entering with a Hummer. Similarly, contestants using more powerful sensors can gain huge advantages.
  3. Monthly contests, with full feedback (but since the hardware is standardized, only code is shipped). One of the premises of the program is that robust systems are desired. Monthly evaluations at different locations can help measure this and provide data.
  4. Attacks a known hard problem. (cross country driving)