Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.

A question of quantification

This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same.

  1. For all sequences of examples. This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.”
  2. For all training sets. This is the standard quantification for boosting analysis such as adaboost or multiclass boosting.
    Standard theorems have the form “for all training sets the error rate inequalities … hold”.
  3. For all distributions over examples. This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”.

It is not quite true that each of these is equivalent. For example, in the online learning setting, quantifying “for all sequences of examples” implies “for all distributions over examples”, but not vice-versa.

However, in the context of either boosting or reductions these are equivalent because the algorithms operate in an element-wise fashion. To see the equivalence, note that:

  1. “For any training set” is equivalent to “For any sequence of examples” because a training set is a sequence and vice versa.
  2. “For any sequence of examples” is equivalent to “For any distribution over examples” when the theorems are about unconditional example transformations because:
    1. The uniform distribution over a sufficiently long sequence of examples can approximate any distribution we care about arbitrarily well.
    2. If the theorem holds “for all distributions”, it holds for the uniform distribution over the elements in any sequence of examples.

The natural debate here is “how should the theorems be quantified?” It is difficult to answer this debate based upon mathematical grounds because we just showed an equivalence. It is nevertheless important because it strongly influences how we think about algorithms and how easy it is to integrate the knowledge across different theories. Here are the arguments I know.

  1. For all sequences of examples.
    1. Learning theory people (at least) are used to thinking about “For all sequences of examples”.
    2. (Applied) Machine learning people are not so familiar with this form of quantification.
    3. When the algorithm is example-conditional such as in online learning, the quantification is more general than “for all distributions”.
  2. For all training sets.
    1. This is very simple.
    2. It is misleadingly simple. For example, a version of the adaboost theorem also applies to test sets using the test error rates of the base classifiers. It is fairly common for this to be misunderstood.
  3. For all distributions over examples.
    1. Distributions over examples is simply how most people think about learning problems.
    2. “For all distributions over examples” is easily and often confused with “For all distributions over examples accessed by IID draws”. It seems most common to encounter this confusion amongst learning theory folks.

What quantification should be used and why?
(My thanks to Yishay Mansour for clarifying the debate.)

SVM Adaptability

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

  1. 2002 multiclass SVM including arbitrary cost matrices
  2. ICML 2003 Hidden Markov Models
  3. NIPS 2003 Markov Networks (see some discussion)
  4. EMNLP 2004 Context free grammars
  5. ICML 2004 Any loss (with much computation)
  6. ICML 2005 Any constrained linear prediction model (that’s my own name).
  7. ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

  1. Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
  2. How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
  3. Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?

Lower Bounds for Learning Reductions

Learning reductions transform a solver of one type of learning problem into a solver of another type of learning problem. When we analyze these for robustness we can make statement of the form “Reduction R has the property that regret r (or loss) on subproblems of type A implies regret at most f ( r ) on the original problem of type B“.

A lower bound for a learning reduction would have the form “for all reductions R, there exists a learning problem of type B and learning algorithm for problems of type A where regret r on induced problems implies at least regret f ( r ) for B“.

The pursuit of lower bounds is often questionable because, unlike upper bounds, they do not yield practical algorithms. Nevertheless, they may be helpful as a tool for thinking about what is learnable and how learnable it is. This has already come up here and here.

At the moment, there is no coherent theory of lower bounds for learning reductions, and we have little understanding of how feasible they are or which techniques may be useful in proving them. Here is a rough summary of what I know:

  1. For structured prediction, we have a partially worked out lower bound for all reductions using the structure to only carry single bits. A proof for reductions using the structure in others ways seems tricky at the moment.
  2. For Reinforcement learning it may (this is unclear) be possible to prove a lower bound showing that prediction ability alone can not solve RL well.
  3. There are various results which can be thought of as lower bounds for more limited families of reductions. One example is analyzing exactly how badly margin optimization can underperform for 0-1 loss when there is noise.

Overall, this is a moderately interesting direction of research which has not been much investigated.

Reopening RL->Classification

In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.

The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).

So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.

Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.