Progress in Machine Translation

I just visited ISI where Daniel Marcu and others are working on machine translation. Apparently, machine translation is rapidly improving. A particularly dramatic year was 2002->2003 when systems switched from word-based translation to phrase-based translation. From a (now famous) slide by Charles Wayne at DARPA (which funds much of the work on machine translation) here is some anecdotal evidence:

2002 2003
insistent Wednesday may recurred her trips to Libya tomorrow for flying.

Cairo 6-4 ( AFP ) – An official announced today in the Egyptian lines company for flying Tuesday is a company “insistent for flying” may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.

And said the official “the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air, a situation her recieving replying are so a trip will pull to Libya a morning Wednesday.”

Egyptair has tomorrow to Resume Its flight to Libya.

Cairo 4-6 (AFP) – said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flight to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

“The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the firt take off a trip to Libya on Wednesday morning”.

The machine translation systems are becoming effective at the “produces mostly understandable although broken output”. Two obvious application arise:

  1. Web browsing. A service might deliver translations of web pages into your native language. babelfish is a first attempt. When properly integrated into
    the web browser, it will appear as if every webpage uses your native language (although maybe in a broken-but-understandable way).
  2. Instant messaging. An instant message service might deliver translations into whichever language you specify allowing communication with more people.

At this point, the feasibility of these applications is a matter of engineering and “who pays for it” coordination rather than technology development. There remain significant research challenges in tackling nonstudied language pairs and in improving the existing technology. We could imagine a point in the near future (10 years?) where the machine translation version of a Turing test is passed: humans can not distinguish between a machine translated sentence and a human translated sentence. A key observation here is that machine translation does not require full machine understanding of natural language.

The source of machine translation success seems to be a combination of better models (switching to phrase-based translation made a huge leap), application of machine learning technology, and big increases in the quantity of data available.

Bounds greater than 1

Nati Srebro and Shai Ben-David have a paper at COLT which, in the appendix, proves something very striking: several previous error bounds are always greater than 1.

Background One branch of learning theory focuses on theorems which

  1. Assume samples are drawn IID from an unknown distribution D.
  2. Fix a set of classifiers
  3. Find a high probability bound on the maximum true error rate (with respect to D) as a function of the empirical error rate on the training set.

Many of these bounds become extremely complex and hairy.

Current Everyone working on this subject wants “tighter bounds”, however there are different definitions of “tighter”. Some groups focus on “functional tightness” (getting the right functional dependency between the size of the training set and a parameterization of the hypothesis space) while others focus on “practical tightness” (finding bounds which work well on practical problems). (I am definitely in the second camp.)

One of the dangers of striving for “functional tightness” is that the bound can depend on strangely interrelated parameters. In fact, apparently these strange interrelations can become so complex that they end up always larger than 1 (some bounds here and here).

It seems we should ask the question: “Why are we doing the math?” If it is just done to get a paper accepted under review, perhaps this is unsatisfying. The real value of math comes when it guides us in designing learning algorithms. Math from bounds greater than 1 is a dangerously weak motivation for learning algorithm design.

There is a significant danger in taking this “oops” too strongly.

  1. There exist some reasonable arguments (not made here) for aiming at functional tightness.
  2. The value of the research a person does is more related to the best they have done than the worst.

What is state?

In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work.

  1. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games.
  2. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?”
  3. Mathematical State. State is a sufficient statistic of observations for making necessary predictions.
  4. Internal State. State is the internal belief/understanding/etc… which changes an agent’s actions in different circumstances. A natural question is: “How do you learn a state?” This is like the mathematical version of state, except that portions of the statistic which can not be learned are irrelevant.
  5. There are no states. There are only observations (one of which might be a reward) and actions. This is more reasonable than it might sound because state is a derived quantity and the necessity of that derivation is unclear. PSRs are an example.

The different questions these notions of state motivate can have large practical consequences on the algorithms used. It is not clear yet what the “right” notion of state is—we just don’t know what works well. I am most interested in “internal state” at the moment.

Mad (Neuro)science

One of the questions facing machine learning as a field is “Can we produce a generalized learning system that can solve a wide array of standard learning problems?” The answer is trivial: “yes, just have children”.

Of course, that wasn’t really the question. The refined question is “Are there simple-to-implement generalized learning systems that can solve a wide array of standard learning problems?” The answer to this is less clear. The ability of animals (and people ) to learn might be due to megabytes encoded in the DNA. If this algorithmic complexity is necessary to solve machine learning, the field faces a daunting task in replicating it on a computer.

This observation suggests a possibility: if you can show that few bits of DNA are needed for learning in animals, then this provides evidence that machine learning (as a field) has a hope of big success with relatively little effort.

It is well known that specific portions of the brain have specific functionality across individuals. There are two ways this observation can be explained:

  1. Maybe the specific functionality areas are encoded in the DNA.
  2. Maybe the specific functionality areas arise from the learning process of the brain. This is the answer that machine learning would like to hear because it agrees with the hypothesis that a simple general learning system exists.

It’s important to realize that these choices actually specify a spectrum rather than a dichotomy. There are surely some problem-specific learning hacks in the brain and there is surely some generalized learning ability. The question is: to what degree is learning encoded by genetic heritage vs personal experience?

It is anecdotally well known that people (especially children) can recover from fairly severe brain damage, but of course we would prefer to avoid anecdotal evidence.

There are also neuroscience experiments addressing this question. This paper by Jitendra Sharma, Alessandra Angelucci, and Mriganka Sur provides some evidence. In a nutshell, they rewire the optic nerve of ferrets into the auditory region of the brain. They observe that structures similar to the visual specific region of the brain arise in the auditory region after rewiring (although the new regions may be less capable).

There are doubtless many other experiments addressing this question, but my knowledge of Neuroscience is lacking. (Thanks to Maneesh for pointing this one out.)

Gradients everywhere

One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance.

It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down.

  1. Sometimes there are clever approximations of the gradient. (Simon Osindero)
  2. Sometimes we can compute constrained gradients via iterated gradient/project steps. (Ben Taskar)
  3. Sometimes we can compute gradients anyways over mildly nondifferentiable functions. (Drew Bagnell)
  4. Even given a gradient, the choice of update is unclear, and might be cleverly chosen (Nic Schraudolph)

Perhaps a more extreme example of this is Adaboost which repeatedly reuses a classifier learner to implicitly optimize a gradient. Viewed as a gradient optimization algorithm, Adaboost is a sublinear algorithm (in the number of implicit parameters) when applied to decision trees.