Machine Learning (Theory) – Page 105 – Machine learning and learning theory research

2/23/20052/23/2005

Problem: Reinforcement Learning with Classification

At an intuitive level, the question here is “Can reinforcement learning be solved with classification?”

Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples.

Past Work

There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model.
Other work on learning reductions may be important.
Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E³, Factored-E³, and metric-E³ and Rmax require that the observation be the state. Recent work extends this approach to POMDPs.
This problem is related to predictive state representations, because we are trying to solve reinforcement learning with prediction ability.

Difficulty It is not clear whether this problem is solvable or not. A proof that it is not solvable would be extremely interesting, and even partial success one way or another could be important.

Impact At the theoretical level, it would be very nice to know if the ability to generalize implies the ability to solve reinforcement learning because (in a general sense) all problems can be cast as reinforcement learning. Even if the solution is exponential in the horizon time it can only motivate relaxations of the core algorithm like RLgen.

2/21/20052/21/2005

Problem: Cross Validation

The essential problem here is the large gap between experimental observation and theoretical understanding.

Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K. For each fold, a classifier is trained on the other folds and then test on the fold.

Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate.

Past Work (I’ll add more as I remember/learn.)

Devroye, Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper.
Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K.
Avrim Blum, Adam Kalai, and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K.
Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance.
Matti KÃ¤Ã¤riÃ¤inen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions.

Some Extreme Cases to Sharpen Intuition

Suppose on every fold the learned classifier is the same. Then, the cross-validation error should behave something like a test set of size m. This is radically superior to a test set of size m/K.
Consider leave-one-out cross validation. Suppose we have a “learning” algorithm that uses the classification rule “always predict the parity of the labels on the training set”. Suppose the learning problem is defined by a distribution which picks y=1 with probability 0.5. Then, with probability 0.5, all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip).

(some discussion is here)

Difficulty I consider this a hard problem. I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. Analyzing the dependency structure of cross validation is quite difficult.

Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. But, because cross validation is used extensively, the overall impact of a good solution might be very significant.

2/20/20052/20/2005

At One Month

This is near the one month point, so it seems appropriate to consider meta-issues for the moment.

The number of posts is a bit over 20.
The number of people speaking up in discussions is about 10.
The number of people viewing the site is somewhat more than 100.

I am (naturally) dissatisfied with many things.

Many of the potential uses haven’t been realized. This is partly a matter of opportunity (no conferences in the last month), partly a matter of will (no open problems because it’s hard to give them up), and partly a matter of tradition. In academia, there is a strong tradition of trying to get everything perfectly right before presentation. This is somewhat contradictory to the nature of making many posts, and it’s definitely contradictory to the idea of doing “public research”. If that sort of idea is to pay off, it must be significantly more succesful than previous methods. In an effort to continue experimenting, I’m going to use the next week as “open problems week”.
Spam is a problem. WordPress allows you to block specific posts by match, but there seems to be some minor bug (or maybe a misuse) in how it matches. This resulted in everything being blocked pending approval, which is highly unnatural for any conversation. I approved all posts by real people, and I think the ‘everything blocked pending approval’ problem has been solved. A site discussing learning ought to have a better system for coping with what is spam and what is not. Today’s problem is to solve the spam problem with learning techniques. (It’s not clear this is research instead of just engineering, but it is clear that it would be very valuable here and in many other places.)
Information organization is a problem. This comes up in many ways. Threading would be helpful in comments because it would help localize discussion to particular contexts. Tagging of posts with categories seems inadequate because it’s hard to anticipate all the ways something might be thought about. Accessing old posts via “archives” is cumbersome. Idealy, the sequence of posts would create a well-organized virtual site. In many cases there are very good comments and it seems altering the post to summarize the comments is appropriate, but doing so leaves the comments out of context. Some mechanism of refinement which avoids this problem would be great. Many comments develop into something that should (essentially) be their own post on a new topic. Doing so is currently cumbersome, and a mechanism for making that shift would be helpful.
Time commitment is a problem. Making a stream of good posts is hard and takes awhile. Naturally, some were (and even still are) stored up, but that store is finite, and eventually will be exhausted. Since I’m unwilling to compromise quality, this means the rate of posts may eventually fall. The obvious solution to this is to invite other posters. (Which I have with Alex Gray and Drew Bagnell.) Consider yourself invited. Email me (jl@hunch.net) with anything you consider worth posting.

It’s not all bad and I plan to continue experimenting. Several of the discussions have been quite interesting, and I often find that the process of writing posts helps clarify my understanding. Creating a little bit more understanding seems like it is worthwhile.

2/19/20052/20/2005

Machine learning reading groups

Yaroslav collected an extensive list of machine learning reading groups.

2/19/20052/19/2005

Loss Functions for Discriminative Training of Energy-Based Models

This is a paper by Yann LeCun and Fu Jie Huang published at AISTAT 2005. I found this paper very difficult to read, but it does have some point about a computational shortcut.

This paper takes for granted that the method of solving a problem is gradient descent on parameters. Given this assumption, the question arises: Do you want to do gradient descent on a probabilistic model or something else?

All (conditional) probabilistic models have the form p(y|x) = f(x,y)/Z(x) where Z(x) = sum_y f(x,y) (the paper calls – log f(x,y) an “energy”). If f is parameterized by some w, the gradient has a term for Z(x), and hence for every value of y. The paper claims, that such models can be optimized for classification purposes using only the correct y and the other y’ not y which maximizes f(x,y). This can even be done on unnormalizable models. The paper further claims that this can be done with an approximate maximum. These claims are plausible based on experimental results and intuition.

It wouldn’t surprise me to learn that ignoring Z(x) (and renormalizing later) is common in fast implementations of some probabilistic model fitting algorithms, but I haven’t seen this previously discussed. Ability to use an approximate maximum y’ seems potentially very useful.

With that encouragement, I had significant difficulties with the paper, including the following:

Lack of a theorem. A good theorem proving these things work would be quite helpful. It isn’t clear whether the claims are always true, just true on the examples encountered, or true with some small modification.
Definition of Loss. For better or worse, the paper uses the second definition of loss, “Loss is part of the solution”, which I find unnatural.
Claims I don’t understand or which aren’t technically true. None of these seem to be related to the main point of the paper, but they are very distracting. For example, there is a claim that log-loss is the “only well-justified loss function”. The meaning of well-justified is unclear, and I can think of several meanings where other losses (such as squared error) are well-justified.

With the above difficulties, this paper seems lucky to have been accepted. This isn’t a criticism of AISTAT because it also seems plausible that this computational shortcut may eventually help many optimizations.