Progress in Machine Translation

I just visited ISI where Daniel Marcu and others are working on machine translation. Apparently, machine translation is rapidly improving. A particularly dramatic year was 2002->2003 when systems switched from word-based translation to phrase-based translation. From a (now famous) slide by Charles Wayne at DARPA (which funds much of the work on machine translation) here is some anecdotal evidence:

2002 2003
insistent Wednesday may recurred her trips to Libya tomorrow for flying.

Cairo 6-4 ( AFP ) – An official announced today in the Egyptian lines company for flying Tuesday is a company “insistent for flying” may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.

And said the official “the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air, a situation her recieving replying are so a trip will pull to Libya a morning Wednesday.”

Egyptair has tomorrow to Resume Its flight to Libya.

Cairo 4-6 (AFP) – said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flight to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

“The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the firt take off a trip to Libya on Wednesday morning”.

The machine translation systems are becoming effective at the “produces mostly understandable although broken output”. Two obvious application arise:

  1. Web browsing. A service might deliver translations of web pages into your native language. babelfish is a first attempt. When properly integrated into
    the web browser, it will appear as if every webpage uses your native language (although maybe in a broken-but-understandable way).
  2. Instant messaging. An instant message service might deliver translations into whichever language you specify allowing communication with more people.

At this point, the feasibility of these applications is a matter of engineering and “who pays for it” coordination rather than technology development. There remain significant research challenges in tackling nonstudied language pairs and in improving the existing technology. We could imagine a point in the near future (10 years?) where the machine translation version of a Turing test is passed: humans can not distinguish between a machine translated sentence and a human translated sentence. A key observation here is that machine translation does not require full machine understanding of natural language.

The source of machine translation success seems to be a combination of better models (switching to phrase-based translation made a huge leap), application of machine learning technology, and big increases in the quantity of data available.

Interesting papers at ACL

A recent discussion indicated that one goal of this blog might be to allow people to post comments about recent papers that they liked. I think this could potentially be very useful, especially for those with diverse interests but only finite time to read through conference proceedings. ACL 2005 recently completed, and here are four papers from that conference that I thought were either good or perhaps of interest to a machine learning audience.

David Chiang, A Hierarchical Phrase-Based Model for Statistical Machine Translation. (Best paper award.) This paper takes the standard phrase-based MT model that is popular in our field (basically, translate a sentence by individually translating phrases and reordering them according to a complicated statistical model) and extends it to take into account hierarchy in phrases, so that you can learn things like “X ‘s Y” -> “Y de X” in chinese, where X and Y are arbitrary phrases. This takes a step toward linguistic syntax for MT, which our group is working strongly on, but doesn’t require any linguists to sit down and write out grammars or parse sentences.

Rie Kubota Ando and Tong Zhang, A High-Performance Semi-Supervised Learning Method for Text Chunking. This is more of a machine learning style paper, where they improve a sequence labeling task by augmenting it with models from related tasks for which data is free. I.e., I might train a model that, given a context with a missing word, will predict the word (eg., “The ____ gave a speech” might want you to insert “president”.) By doing so, you can use these other models to give additional useful information to your main task.

Noah A. Smith and Jason Eisner, Contrastive Estimation: Training Log-Linear Models on Unlabeled Data. This paper talks about training sequence labeling models in an unsupervised fashion, basically by contrasting what the model does on the correct string with what the model does on a corrupted version of the string. They get significantly better results than just by using EM in an HMM, and the idea is pretty nice.

Patrick Pantel, Inducing Ontological Co-occurrence Vectors. This is a pretty neat idea (though I’m biased — Patrick is a friend) where one attempts to come up with feature vectors that describe nodes in a semantic hierarchy (ontology) that could enable you to figure out where to insert new words that are not in your ontology. The results are pretty good, and the method is fairly simple; I’d imagine that a more complex model/learning framework could improve the model even further.

Text Entailment at AAAI

Rajat Raina presented a paper on the technique they used for the PASCAL Recognizing Textual Entailment challenge.

“Text entailment” is the problem of deciding if one sentence implies another. For example the previous sentence entails:

  1. Text entailment is a decision problem.
  2. One sentence can imply another.

The challenge was of the form: given an original sentence and another sentence predict whether there was an entailment. All current techniques for predicting correctness of an entailment are at the “flail” stage—accuracies of around 58% where humans could achieve near 100% accuracy, so there is much room to improve. Apparently, there may be another PASCAL challenge on this problem in the near future.

Maximum Margin Mismatch?

John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003.

There are some things I find odd about the paper. For instance, it says of probabilistic models

“cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.”

I’m aware of no such limitations. Also:

“Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.”

This is quite interesting and contradicts my own experience as well as that of a number of people I greatly
respect. I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with?

The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. Further, the authors carry out some nice experiments that speak well for the practicality of their approach. In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. And I think this gets us to the crux of the matter, and ties back to John’s post. CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable.

The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that.)