John Langford – Page 68 – Machine Learning (Theory)

4/5/20064/5/2006

What is state?

In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work.

Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games.
Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?”
Mathematical State. State is a sufficient statistic of observations for making necessary predictions.
Internal State. State is the internal belief/understanding/etc… which changes an agent’s actions in different circumstances. A natural question is: “How do you learn a state?” This is like the mathematical version of state, except that portions of the statistic which can not be learned are irrelevant.
There are no states. There are only observations (one of which might be a reward) and actions. This is more reasonable than it might sound because state is a derived quantity and the necessity of that derivation is unclear. PSRs are an example.

The different questions these notions of state motivate can have large practical consequences on the algorithms used. It is not clear yet what the “right” notion of state is—we just don’t know what works well. I am most interested in “internal state” at the moment.

4/2/20064/2/2006

Mad (Neuro)science

One of the questions facing machine learning as a field is “Can we produce a generalized learning system that can solve a wide array of standard learning problems?” The answer is trivial: “yes, just have children”.

Of course, that wasn’t really the question. The refined question is “Are there simple-to-implement generalized learning systems that can solve a wide array of standard learning problems?” The answer to this is less clear. The ability of animals (and people ) to learn might be due to megabytes encoded in the DNA. If this algorithmic complexity is necessary to solve machine learning, the field faces a daunting task in replicating it on a computer.

This observation suggests a possibility: if you can show that few bits of DNA are needed for learning in animals, then this provides evidence that machine learning (as a field) has a hope of big success with relatively little effort.

It is well known that specific portions of the brain have specific functionality across individuals. There are two ways this observation can be explained:

Maybe the specific functionality areas are encoded in the DNA.
Maybe the specific functionality areas arise from the learning process of the brain. This is the answer that machine learning would like to hear because it agrees with the hypothesis that a simple general learning system exists.

It’s important to realize that these choices actually specify a spectrum rather than a dichotomy. There are surely some problem-specific learning hacks in the brain and there is surely some generalized learning ability. The question is: to what degree is learning encoded by genetic heritage vs personal experience?

It is anecdotally well known that people (especially children) can recover from fairly severe brain damage, but of course we would prefer to avoid anecdotal evidence.

There are also neuroscience experiments addressing this question. This paper by Jitendra Sharma, Alessandra Angelucci, and Mriganka Sur provides some evidence. In a nutshell, they rewire the optic nerve of ferrets into the auditory region of the brain. They observe that structures similar to the visual specific region of the brain arise in the auditory region after rewiring (although the new regions may be less capable).

There are doubtless many other experiments addressing this question, but my knowledge of Neuroscience is lacking. (Thanks to Maneesh for pointing this one out.)

3/27/20063/27/2006

Gradients everywhere

One of the basic observations from the atomic learning workshop is that gradient-based optimization is pervasive. For example, at least 7 (of 12) speakers used the word ‘gradient’ in their talk and several others may be approximating a gradient. The essential useful quality of a gradient is that it decouples local updates from global optimization. Restated: Given a gradient, we can determine how to change individual parameters of the system so as to improve overall performance.

It’s easy to feel depressed about this and think “nothing has happened”, but that appears untrue. Many of the talks were about clever techniques for computing gradients where your calculus textbook breaks down.

Sometimes there are clever approximations of the gradient. (Simon Osindero)
Sometimes we can compute constrained gradients via iterated gradient/project steps. (Ben Taskar)
Sometimes we can compute gradients anyways over mildly nondifferentiable functions. (Drew Bagnell)
Even given a gradient, the choice of update is unclear, and might be cleverly chosen (Nic Schraudolph)

Perhaps a more extreme example of this is Adaboost which repeatedly reuses a classifier learner to implicitly optimize a gradient. Viewed as a gradient optimization algorithm, Adaboost is a sublinear algorithm (in the number of implicit parameters) when applied to decision trees.

3/24/2006

NLPers

Hal Daume has started the NLPers blog to discuss learning for language problems.

3/23/20063/23/2006

The Approximation Argument

An argument is sometimes made that the Bayesian way is the “right” way to do machine learning. This is a serious argument which deserves a serious reply. The approximation argument is a serious reply for which I have not yet seen a reply².

The idea for the Bayesian approach is quite simple, elegant, and general. Essentially, you first specify a prior P(D) over possible processes D producing the data, observe the data, then condition on the data according to Bayes law to construct a posterior:

P(D|x) = P(x|D)P(D)/P(x)
After this, hard decisions are made (such as “turn left” or “turn right”) by choosing the one which minimizes the expected (with respect to the posterior) loss.

This basic idea is reused thousands of times with various choices of P(D) and loss functions which is unsurprising given the many nice properties:

There is an extremely strong associated guarantee: If the actual distribution generating the data is drawn from P(D) there is no better method. One way to think about this is that in the Bayesian setting, the worst case analysis is the average case analysis.
The Bayesian method is a straightforward extension of the engineering method for designing a solution to a problem.
The Bayesian method is modular. The three information sources are prior P(D), data x, and loss, but loss only interacts with P(D) and x via the posterior P(D|x).

The fly in the ointment is approximation. The basic claim of the approximation argument is that approximation is unavoidable in all real-world problems that we care about. There are several ways in which approximation necessarily invades applications of Bayes Law.

When specifying the prior, the number of bits needed to describe the “real” P(D) is typically too large. The meaning of “real” P(D) actually varies, but this statement appears to hold true across all of them. What happens instead is that people take short-cuts specifying something which isn’t quite the real prior.
Even if the real P(D) is somehow specifiable, computing the posterior P(D|x) is often computationally intractable. Again, the common short-cut is to alter the prior so as to make it computationally tractable. (There are a few people who instead attempt to approximately compute the posterior via monte carlo methods.)

Consider for example the problem of speech recognition. A “real” prior P(D) (according to some definitions) might involve a distribution over the placement of air molecules, the shape of the throat producing the sound, and what is being pronounced. This prior might be both inarticulable (prior elicitation is nontrivial) and unrepresentable (because too many bits are required to store on a modern machine).

If the necessity of approximation is accepted, the question becomes “what do you do about it?” There are many answers:

Ignore the problem. This works well sometimes but can not be a universal prescription.
Avoid approximation and work (or at least work a computer) very hard. This also can work well, at least for some problems.
Use an approximate Bayesian method and leave a test set on the side to sanity check results. This is a common practical approach.
Violate the modularity of loss and attempt to minimize approximation errors near the decision boundary. There seems to be little deep understanding of the viability and universality of this approach but there are examples where this approach can provide significant benefits.

Some non-Bayesian approaches can be thought of as attempts at (4).