SVM Adaptability

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

  1. 2002 multiclass SVM including arbitrary cost matrices
  2. ICML 2003 Hidden Markov Models
  3. NIPS 2003 Markov Networks (see some discussion)
  4. EMNLP 2004 Context free grammars
  5. ICML 2004 Any loss (with much computation)
  6. ICML 2005 Any constrained linear prediction model (that’s my own name).
  7. ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

  1. Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
  2. How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
  3. Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?

Interesting papers at ACL

A recent discussion indicated that one goal of this blog might be to allow people to post comments about recent papers that they liked. I think this could potentially be very useful, especially for those with diverse interests but only finite time to read through conference proceedings. ACL 2005 recently completed, and here are four papers from that conference that I thought were either good or perhaps of interest to a machine learning audience.

David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation. (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ‘s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, which our group is working strongly on, but doesn’t require any linguists to sit down and write out grammars or parse sentences.

Rie Kubota Ando and Tong Zhang, A High-Performance Semi-Supervised Learning Method for Text Chunking. This is more of a machine learning style paper, where they improve a sequence labeling task by augmenting it with models from related tasks for which data is free. I.e., I might train a model that, given a context with a missing word, will predict the word (eg., “The ____ gave a speech” might want you to insert “president”.) By doing so, you can use these other models to give additional useful information to your main task.

Noah A. Smith and Jason Eisner, Contrastive Estimation: Training Log-Linear Models on Unlabeled Data. This paper talks about training sequence labeling models in an unsupervised fashion, basically by contrasting what the model does on the correct string with what the model does on a corrupted version of the string. They get significantly better results than just by using EM in an HMM, and the idea is pretty nice.

Patrick Pantel, Inducing Ontological Co-occurrence Vectors. This is a pretty neat idea (though I’m biased — Patrick is a friend) where one attempts to come up with feature vectors that describe nodes in a semantic hierarchy (ontology) that could enable you to figure out where to insert new words that are not in your ontology. The results are pretty good, and the method is fairly simple; I’d imagine that a more complex model/learning framework could improve the model even further.

Text Entailment at AAAI

Rajat Raina presented a paper on the technique they used for the PASCAL Recognizing Textual Entailment challenge.

“Text entailment” is the problem of deciding if one sentence implies another. For example the previous sentence entails:

  1. Text entailment is a decision problem.
  2. One sentence can imply another.

The challenge was of the form: given an original sentence and another sentence predict whether there was an entailment. All current techniques for predicting correctness of an entailment are at the “flail” stage—accuracies of around 58% where humans could achieve near 100% accuracy, so there is much room to improve. Apparently, there may be another PASCAL challenge on this problem in the near future.

Not EM for clustering at COLT

One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”.

One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting.

A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form:

  1. Project to low dimensional space.
  2. Pick out gross structure.
  3. Project gross structure into the high dimensional space.
  4. Run EM (or some other local improvement algorithm) to find a final fit.

The effects of steps 1-3 is to “seed” the local optimization algorithm in a good place from which a global optima is plausibly reachable.