Machine Learning (Theory)


Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.


A question of quantification

Tags: Definitions,Online,Reductions jl@ 7:39 am

This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same.

  1. For all sequences of examples. This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.”
  2. For all training sets. This is the standard quantification for boosting analysis such as adaboost or multiclass boosting.
    Standard theorems have the form “for all training sets the error rate inequalities … hold”.
  3. For all distributions over examples. This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”.

It is not quite true that each of these is equivalent. For example, in the online learning setting, quantifying “for all sequences of examples” implies “for all distributions over examples”, but not vice-versa.

However, in the context of either boosting or reductions these are equivalent because the algorithms operate in an element-wise fashion. To see the equivalence, note that:

  1. “For any training set” is equivalent to “For any sequence of examples” because a training set is a sequence and vice versa.
  2. “For any sequence of examples” is equivalent to “For any distribution over examples” when the theorems are about unconditional example transformations because:
    1. The uniform distribution over a sufficiently long sequence of examples can approximate any distribution we care about arbitrarily well.
    2. If the theorem holds “for all distributions”, it holds for the uniform distribution over the elements in any sequence of examples.

The natural debate here is “how should the theorems be quantified?” It is difficult to answer this debate based upon mathematical grounds because we just showed an equivalence. It is nevertheless important because it strongly influences how we think about algorithms and how easy it is to integrate the knowledge across different theories. Here are the arguments I know.

  1. For all sequences of examples.
    1. Learning theory people (at least) are used to thinking about “For all sequences of examples”.
    2. (Applied) Machine learning people are not so familiar with this form of quantification.
    3. When the algorithm is example-conditional such as in online learning, the quantification is more general than “for all distributions”.
  2. For all training sets.
    1. This is very simple.
    2. It is misleadingly simple. For example, a version of the adaboost theorem also applies to test sets using the test error rates of the base classifiers. It is fairly common for this to be misunderstood.
  3. For all distributions over examples.
    1. Distributions over examples is simply how most people think about learning problems.
    2. “For all distributions over examples” is easily and often confused with “For all distributions over examples accessed by IID draws”. It seems most common to encounter this confusion amongst learning theory folks.

What quantification should be used and why?
(My thanks to Yishay Mansour for clarifying the debate.)


SVM Adaptability

Tags: Papers,Reductions,structured jl@ 12:33 am

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

  1. 2002 multiclass SVM including arbitrary cost matrices
  2. ICML 2003 Hidden Markov Models
  3. NIPS 2003 Markov Networks (see some discussion)
  4. EMNLP 2004 Context free grammars
  5. ICML 2004 Any loss (with much computation)
  6. ICML 2005 Any constrained linear prediction model (that’s my own name).
  7. ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

  1. Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
  2. How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
  3. Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?


Lower Bounds for Learning Reductions

Tags: Problems,Reductions jl@ 10:40 pm

Learning reductions transform a solver of one type of learning problem into a solver of another type of learning problem. When we analyze these for robustness we can make statement of the form “Reduction R has the property that regret r (or loss) on subproblems of type A implies regret at most f ( r ) on the original problem of type B“.

A lower bound for a learning reduction would have the form “for all reductions R, there exists a learning problem of type B and learning algorithm for problems of type A where regret r on induced problems implies at least regret f ( r ) for B“.

The pursuit of lower bounds is often questionable because, unlike upper bounds, they do not yield practical algorithms. Nevertheless, they may be helpful as a tool for thinking about what is learnable and how learnable it is. This has already come up here and here.

At the moment, there is no coherent theory of lower bounds for learning reductions, and we have little understanding of how feasible they are or which techniques may be useful in proving them. Here is a rough summary of what I know:

  1. For structured prediction, we have a partially worked out lower bound for all reductions using the structure to only carry single bits. A proof for reductions using the structure in others ways seems tricky at the moment.
  2. For Reinforcement learning it may (this is unclear) be possible to prove a lower bound showing that prediction ability alone can not solve RL well.
  3. There are various results which can be thought of as lower bounds for more limited families of reductions. One example is analyzing exactly how badly margin optimization can underperform for 0-1 loss when there is noise.

Overall, this is a moderately interesting direction of research which has not been much investigated.


Reopening RL->Classification

In research, it’s often the case that solving a problem helps you realize that it wasn’t the right problem to solve. This is the case for the “reduce RL to classification” problem with the solution hinted at here and turned into a paper here.

The essential difficulty is that the method of stating and analyzing reductions ends up being nonalgorithmic (unlike previous reductions) unless you work with learning from teleoperated robots as Greg Grudic does. The difficulty here is due to the reduction being dependent on the optimal policy (which a human teleoperator might simulate, but which is otherwise unavailable).

So, this problem is “open” again with the caveat that this time we want a more algorithmic solution.

Whether or not this is feasible at all is still unclear and evidence in either direction would greatly interest me. A positive answer might have many practical implications in the long run.


What is the right form of modularity in structured prediction?

Tags: Problems,Reductions,structured jl@ 10:57 pm

Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.

We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.

A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:

This form of modularity seems to be preferred for several reasons:

  1. Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
  2. There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
  3. There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.

The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.

So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?


Regret minimizing vs error limiting reductions

Tags: Problems,Reductions jl@ 2:21 pm

This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration.

  1. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate.
  2. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0. An analagous result for reget is that for a probabilistic error correcting output code, multiclass regret is at most 4 * (binary regret)0.5.

The use of “regret” is more desirable than the use of error rates, because (for example) the ECOC error rate bound implies nothing when there is enough noise so that the binary classifiers always have error rate 0.25. However the square root dependence introduced when analyzing regret is not desirable. A basic question is: Can we have the best of both worlds? Can we find some algorithm doing multiclass classification with binary classifiers such that average regret r for the binary classifiers implies average regret bounded by 4r for the multiclass classifier?

If the answer is “yes”, that reduction algorithm may be empirically superior to the one we use now.
If the answer is “no”, that is a sharp and unexpected distinction between error rate analysis and regret analysis.


Learning Reductions are Reductionist

Tags: Reductions jl@ 8:23 am

This is about a fundamental motivation for the investigation of reductions in learning. It applies to many pieces of work other than my own.

The reductionist approach to problem solving is characterized by taking a problem, decomposing it into as-small-as-possible subproblems, discovering how to solve the subproblems, and then discovering how to use the solutions to the subproblems to solve larger problems. The reductionist approach to solving problems has often payed off very well. Computer science related examples of the reductionist approach include:

  1. Reducing computation to the transistor. All of our CPUs are built from transistors.
  2. Reducing rendering of images to rendering a triangle (or other simple polygons). Computers can now render near-realistic scenes in real time. The big breakthrough came from learning how to render many triangles quickly.

This approach to problem solving extends well beyond computer science. Many fields of science focus on theories making predictions about very simple systems. These predictions are then composed to make predictions about where space craft go, how large a cannonball needs to be, etc… Obviously this approach has been quite successful.

It is an open question whether or not this approach can really succeed at learning.

  1. Against: We know that succesful learning requires the incorporation of prior knowledge in fairly arbitrary forms. This suggests that we can not easily decompose the process of learning.
  2. For: We know that humans can succeed at general purpose learning. It may be that arbitrary prior knowledge is required to solve arbitrary learning problems, but perhaps there are specific learning algorithms incorporating specific prior knowledge capable of solving the specific problems we encounter.
  3. Neutral: We know that learning reductions sometimes work. We don’t yet have a good comparison of how well they work with other approaches.


Problem: Reductions and Relative Ranking Metrics

Tags: Problems,Reductions jl@ 11:34 am

This, again, is something of a research direction rather than a single problem.

There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC, “F1″, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class (google‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these.

Problem What does the ability to classify well imply about performance under these metrics?

Past Work

  1. Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC.
  2. Sample complexity bounds for AROC (and here).
  3. A paper on “Learning to Order Things“.

Difficulty Several of these may be easy. Some of them may be hard.

Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. It might also yield new algorithms (via the reduction) for solving these problems.


Solution: Reinforcement Learning with Classification

I realized that the tools needed to solve the problem just posted were just created. I tried to sketch out the solution here (also in .lyx and .tex). It is still quite sketchy (and probably only the few people who understand reductions well can follow).

One of the reasons why I started this weblog was to experiment with “research in the open”, and this is an opportunity to do so. Over the next few days, I’ll be filling in details and trying to get things to make sense. If you have additions or ideas, please propose them.


Problem: Reinforcement Learning with Classification

Tags: Problems,Reductions,Reinforcement jl@ 12:49 pm

At an intuitive level, the question here is “Can reinforcement learning be solved with classification?”

Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples.

Past Work

  1. There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model.
  2. Other work on learning reductions may be important.
  3. Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E3, Factored-E3, and metric-E3 and Rmax require that the observation be the state. Recent work extends this approach to POMDPs.
  4. This problem is related to predictive state representations, because we are trying to solve reinforcement learning with prediction ability.

Difficulty It is not clear whether this problem is solvable or not. A proof that it is not solvable would be extremely interesting, and even partial success one way or another could be important.

Impact At the theoretical level, it would be very nice to know if the ability to generalize implies the ability to solve reinforcement learning because (in a general sense) all problems can be cast as reinforcement learning. Even if the solution is exponential in the horizon time it can only motivate relaxations of the core algorithm like RLgen.


The State of the Reduction

Tags: Reductions jl@ 8:58 pm

What? Reductions are machines which turn solvers for one problem into solvers for another problem.
Why? Reductions are useful for several reasons.

  1. Laziness. Reducing a problem to classification make at least 10 learning algorithms available to solve a problem. Inventing 10 learning algorithms is quite a bit of work. Similarly, programming a reduction is often trivial, while programming a learning algorithm is a great deal of work.
  2. Crystallization. The problems we often want to solve in learning are worst-case-impossible, but average case feasible. By reducing all problems onto one or a few primitives, we can fine tune these primitives to perform well on real-world problems with greater precision due to the greater number of problems to validate on.
  3. Theoretical Organization. By studying what reductions are easy vs. hard vs. impossible, we can learn which problems are roughly equivalent in difficulty and which are much harder.

What we know now.

Typesafe reductions. In the beginning, there was the observation that every complex object on a computer can be written as a sequence of bits. This observation leads to the notion that a classifier (which predicts a single bit) can be used to predict any complex object. Using this observation, we can make the following statements:

  1. Any prediction problem which can be broken into examples can be solved with a classifier.
  2. In particular, reinforcement learning can be decomposed into examples given a generative model (see Lagoudakis & Parr and Fern, Yoon, & Givan).

This observation also often doesn’t work well in practice, because the classifiers are sometimes wrong, so one of many classifiers are often wrong.

Error Transform Reductions. Worrying about errors leads to the notion of robust reductions (= ways of using simple predictors such as classifiers to make complex predictions). Error correcting output codes were proposed in analogy to coding theory. These were analyzed in terms of error rates on training sets and general losses on training sets. The robustness can be (more generally) analyzed with respect to arbitrary test distributions, and algorithms optimized with respect to this notion are often very simple and yield good performance. Solving created classification problems up to error rate e implies:

  1. Solving importance weighed classifications up to error rate eN where N is the expected importance. Costing
  2. Solving multiclass classification up to error rate 4e using ECOC. Error limiting reductions paper
  3. Solving Cost sensitive classification up to loss 2eZ where Z is the sum of costs. Weighted All Pairs algorithm
  4. Finding a policy within expected reward (T+1)e/2 of the optimal policy for T step reinforcement learning with a generative model. RLgen paper
  5. The same statement holds much more efficiently when the distribution of states of a near optimal policy is also known. PSDP paper

A new problem arises: sometimes the subproblems created are inherently hard, for example when estimating class probability from a classifier. In this situation saying “good performance implies good performance” is vacuous.

Regret Transform Reductions To cope with this, we can analyze how good performance minus the best possible performance (called “regret”) is transformed under reduction. Solving created binary classification problems to regret r implies:

  1. Solving importance weighted regret up to r N using the same algorithm as for errors. Costing
  2. Solving class membership probability up to l2 regret 2r. Probing paper
  3. Solving multiclass classification to regret 4 r0.5. SECOC paper
  4. Predicting costs in cost sensitive classification up to l2 regret 4r SECOC again
  5. Solving cost sensitive classification up to regret 4(r Z)0.5 where Z is the sum of the costs of each choice. SECOC again

There are several reduction-related problems currently being worked on which I’ll discuss in the future.

« Newer Posts

Powered by WordPress