Workshop Summary—Principles of Learning Problem Design

This is a summary of the workshop on Learning Problem Design which Alina and I ran at NIPS this year.

The first question many people have is “What is learning problem design?” This workshop is about admitting that solving learning problems does not start with labeled data, but rather somewhere before. When humans are hired to produce labels, this is usually not a serious problem because you can tell them precisely what semantics you want the labels to have, and we can fix some set of features in advance. However, when other methods are used this becomes more problematic. This focus is important for Machine Learning because there are very large quantities of data which are not labeled by a hired human.

The title of the workshop was a bit ambitious, because a workshop is not long enough to synthesize a diversity of approaches into a coherent set of principles. For me, the posters at the end of the workshop were quite helpful in getting approaches to gel.

Here are some answers to “where do the labels come from?”:

  1. Simulation Use a simulator (which need not be that good) to predict the cost of various choices and turn that into label information. Ashutosh had some cool demos showing the power of this approach. Gregory also presented a poster which might be viewed this way.
  2. Agreement A label is a point of agreement. Luis often used an agreement mechanism to induce labels with games. Sham discussed the power of agreement to constrain learning algorithms. Huzefa‘s work on bioprediction can be thought of as partly using agreement with previous structures to simulate the label of a new structure.
  3. Compilation Labels can be found by compiling one learning problem into another. Mark and I both talked about reductions a bit, which come with some nice formal guarantees.
  4. Backprop Labels are the signals in generalized backpropagation (David Bradley‘s talk).

Some answers to “where do the data come from” are:

  1. Everywhere The essential idea is to integrate as many data sources as possible. Rakesh had several algorithms which (in combination) allowed him to use a large number of diverse data sources in a text domain.
  2. Sparsity A representation is formed by finding a sparse set of basis functions on otherwise totally unlabeled data. Rajat discussed self-taught learning algorithms which achieve this.
  3. Self-prediction A representation is formed by learning to self-predict a set of raw features. Hal‘s talk covered this idea.

A workshop like this is successful if it informs the questions we ask (and answer) in the future. Some natural questions (some of which were discussed) are:

  1. What is a natural, sufficient langauge for adding prior information into a learning system? Which languages are insufficient? Shai described a sense in which kernels are insufficient as a language for prior information. Bayesian analysis emphasizes reasoning about the parameters of the model, but the language of examples or maybe label expectations may be more natural.
  2. What is missing from the above lists? And are the elements of the lists actually distinct?
  3. How do we modularize? Many of the approaches use problem-specific tricks. That’s to be expected for a direction of research which is just starting, but it’s important to modularize these techniques so they can be repeatedly and easily applied. Achieving modularity in a manner which supports prior information properly seems tricky.
  4. How do we formalize and analyze? Of the items listed above, I feel like we only have some reasonable understanding of the compilation approach. The other approaches and questions are essentially unexplored territory where some serious thinking may be helpful.

Learning Track of International Planning Competition

The International Planning Competition (IPC) is a biennial event organized in the context of the International Conference on Automated Planning and Scheduling (ICAPS). This year, for the first time, there will a learning track of the competition. For more information you can go to the competition web-site.

The competitions are typically organized around a number of planning domains that can vary from year to year, where a planning domain is simply a class of problems that share a common action schema—e.g. Blocksworld is a well-known planning domain that contains a problem instance each possible initial tower configuration and goal configuration. Some other domains have included Logistics, Airport, Freecell, PipesWorld, and many others. For each domain the competition includes a number of problems (say 40-50) and the planners are run on each problem with a time limit for each problem (around 30 minutes). The problems are hard enough that many problems are not solved within the time limit.

Given that the planners are asked to solve many problems from individual domains, and that problems within a domain generally have common solution structures, it makes sense to consider learning from previous problem-solving experience in a domain to better solve future problems in the same domain. Here “better solve” could mean either solve the problems significantly more quickly or finding better quality plans in a similar time frame. However, no planner in any of the competitions has included a learning component. Rather, to quote Foreigner, for these planners each problem “feels like the first time”.

Perhaps one reason that planners have not incorporated learning into the competition setting is that there has not been much overlap between the ICML and ICAPS communities, although that is changing. Another reason is perhaps that the structure of the competition would deduct any “learning time” from a planner’s 30mins per problem, which could reduce the potential benefits.

The learning track for the 2008 competition is being designed so that learning time is not counted against planners. Rather, there will be a distinct learning phase and a distinct evaluation phase. During the learning phase the planners will be provided with the set of domains to be used in the competition and example problems from each. The evaluation phase will be conducted like the current competition, with the exception that the learning-based planners will be allowed to read in a learned domain-specific “knowledge file” when solving the problems. This structure is designed to help answer the following question:

Do we have techniques that can leverage a learning period to outperform state-of-the-art non-learning techniques across a wide range of domains?

My current belief is that the answer is “no”. I certainly have never seen any such demonstration. This is not because of lack of work in the area of “learning to plan” as there has been a long history dating back to some of the early planners (see my horribly outdated resource page for a taste). While many of the learning approaches have shown some degree of success, the evaluations have typically been very narrow, focusing on only 2 to 3 domains and often only a few problems. My intuition, grounded in personal experience, is that most (all) of these systems would be quite brittle when taken to new domains. The hope is that the learning track of the competition will force us to take the issue of robustness seriously and soon lead to learning systems that convincingly outperform non-learning planners across a wide range of domains given proper training experience.

I hope to see a wide range of approaches entered into the competition. I’ll mention two styles of approaches that might be particular interesting to readers of this blog.

First, one might consider applying reinforcement learning to learn “generalized policies” that can be applied to any problem from a domain. Recall that here the domain model is provided to us, so applying RL would mean that the domain model is used as a sort of simulator in which an RL algorithm is run. RL is particularly difficult in these domains due to the challenges in developing an appropriate representation for learning value functions and/or policies—the states can be viewed as sets of ground relational atoms, rather than the more typical n-dimensional vectors common in RL. Another challenge is the extremely sparse reward, which is obtained only at goal states. There has been some work on applying RL to IPC-style domains (e.g. relational reinforcement learning, approximate policy iteration, policy gradient) but much improvement is needed to compete with non-learning planners.

Second, one might consider structured-classification techniques for this problem. Here one might view the planning problem as an input X and the plan as the structured output Y. Training data can be generated by solving example planning problems using state-of-the-art planners perhaps using significant resources. This approach has been studied under the name max-margin planning, but applied to a very different class of planning problems. Other work has considered applying the Learning as Search Optimization (LaSO) framework to IPC-style domains with some success. Some of the challenges here are to automatically produce an appropriate feature set given a planning domain and ambiguity in the training data. Ambiguity here refers to the fact that there are often a huge number of equally good plans for a given problem and the training data has only one or a small number of them, making the training data incomplete.

The Netflix Crack

A couple security researchers claim to have cracked the netflix dataset. The claims of success appear somewhat overstated to me, but the method of attack is valid and could plausibly be substantially improved so as to reveal the movie preferences of a small fraction of Netflix users.

The basic idea is to use a heuristic similarity function between ratings in a public database (from IMDB) and an anonymized database (Netflix) to link ratings in the private database to public identities (in IMDB). They claim to have linked two of a few dozen IMDB users to anonymized netflix users.

The claims seem a bit inflated to me, because (a) knowing the IMDB identity isn’t equivalent to knowing the person and (b) the claims of statistical significance are with respect to a model of the world they created (rather than one they created).

Overall, this is another example showing that complete privacy is hard. It may be worth remembering that there are some substantial benefits from the Netflix challenge as well—we (as a society) have learned something about how to do collaborative filtering which is useful beyond just recommending movies.

Slashdot has some discussion.

Computational Consequences of Classification

In the regression vs classification debate, I’m adding a new “pro” to classification. It seems there are computational shortcuts available for classification which simply aren’t available for regression. This arises in several situations.

  1. In active learning it is sometimes possible to find an e error classifier with just log(e) labeled samples. Only much more modest improvements appear to be achievable for squared loss regression. The essential reason is that the loss function on many examples is flat with respect to large variations in the parameter spaces of a learned classifier, which implies that many of these classifiers do not need to be considered. In contrast, for squared loss regression, most substantial variations in the parameter space influence the loss at most points.
  2. In budgeted learning, where there is either a computational time constraint or a feature cost constraint, a classifier can sometimes be learned to very high accuracy under the constraints while a squared loss regressor could not. For example, if there is one feature which determines whether a binary label has probability less than or greater than 0.5, a great classifier exists using just one feature. Because squared loss is sensitive to the exact probability, many more features may be required to learn well with respect to squared loss.