SVM Adaptability

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

  1. 2002 multiclass SVM including arbitrary cost matrices
  2. ICML 2003 Hidden Markov Models
  3. NIPS 2003 Markov Networks (see some discussion)
  4. EMNLP 2004 Context free grammars
  5. ICML 2004 Any loss (with much computation)
  6. ICML 2005 Any constrained linear prediction model (that’s my own name).
  7. ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

  1. Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
  2. How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
  3. Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?

Why Manifold-Based Dimension Reduction Techniques?

Manifold based dimension-reduction algorithms share the following general outline.

Given: a metric d() and a set of points S

  1. Construct a graph with a point in every node and every edge connecting to the node of one of the k-nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes.
  2. Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors.
  3. Find a set of points in a low dimensional space which preserve the digested properties.

Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures.

Despite doing some early work in this direction, I suffer from a motivational problem: Why do we want to recover the low dimensional structure? One answer is “for better data visualization”. This is compelling if you have data visualization problems. However, I don’t — I want to make machines that can better predict the future, which generally appears to be a sound goal of learning. Reducing the dimensionality of a dataset is not obviously helpful in accomplishing this. In fact, doing so violates one of the basic intuitions of applied learning algorithms “avoid double approximation”. (One approximation = the projection into the low dimensional space, another approximation = the classifier learned on that space.)

Another answer is “for robots”. Several people have experimented with using a vision sensor and a dimension reduction technique in an attempt to extract the manifold of pose space. These attempts have not generally worked well, basically because the euclidean distance on pixels is not particularly good at predicting which things are “nearby”. However, we might be able to do considerably better if we learn the distance. At the 1-bit level, we might learn a predictor from image pairs to “nearby” or “far”. Any stream S of images i1, i2, i3, …, in can be transformed into a binary problem according to:
{((ij,ik),1 – I(j = k+1 or k = j+1): ij,ik in S}. In unmath “the binary problem formed by predicting whether images are adjacent in the chain of experience”. (*) A good solution to this binary problem would give us an
interesting 1-bit metric. Using regression and counting numbers of transitions might provide a more conventional multibit metric.

This metric, if well solved, has a concrete meaning: the minimum distance in terms of actuator transitions between positions. A shortest path in this space is a sequence of actuator movements leading from a position A to a position B. A projection of this space into low dimensions provides some common format which both the human and the robot can understand. Commanding the robot to go to some location is just a matter of pointing out that location in the low dimensional projection.

This is a possible use for manifold based dimension reduction techniques which I find compelling, if it works out. (Anyone interested in playing with this should talk to Dana Wilkinson who is considering experimenting with this approach.)

(*) We probably would want to tweak the positive/negative ratio to reflect the pattern encountered in usage.
(**) Post tweaked to fix an oversight.

Apprenticeship Reinforcement Learning for Control

Pieter Abbeel presented a paper with Andrew Ng at ICML on Exploration and Apprenticeship Learning in Reinforcement Learning. The basic idea of this algorithm is:

  1. Collect data from a human controlling a machine.
  2. Build a transition model based upon the experience.
  3. Build a policy which optimizes the transition model.
  4. Evaluate the policy. If it works well, halt, otherwise add the experience into the pool and go to (2).

The paper proves that this technique will converge to some policy with expected performance near human expected performance assuming the world fits certain assumptions (MDP or linear dynamics).

This general idea of apprenticeship learning (i.e. incorporating data from an expert) seems very compelling because (a) humans often learn this way and (b) much harder problems can be solved. For (a), the notion of teaching is about transferring knowledge from an expert to novices, often via demonstration. To see (b), note that we can create intricate reinforcement learning problems where a particular sequence of actions must be taken to achieve a goal. A novice might be able to memorize this sequence given just one demonstration even though it would require experience exponential in the length of the sequence to discover the key sequence accidentally.

Andrew Ng’s group has exploited this to make this very fun picture.
(Yeah, that’s a helicopter flying upside down, under computer control.)

As far as this particular paper, one question occurs to me. There is a general principle of learning which says we should avoid “double approximation”, such as occurs in step (3) where we build an approximate policy on an approximate model. Is there a way to fuse steps (2) and (3) to achieve faster or better learning?

Why Reinforcement Learning is Important

One prescription for solving a problem well is:

  1. State the problem, in the simplest way possible. In particular, this statement should involve no contamination with or anticipation of the solution.
  2. Think about solutions to the stated problem.

Stating a problem in a succinct and crisp manner tends to invite a simple elegant solution. When a problem can not be stated succinctly, we wonder if the problem is even understood. (And when a problem is not understood, we wonder if a solution can be meaningful.)

Reinforcement learning does step (1) well. It provides a clean simple language to state general AI problems. In reinforcement learning there is a set of actions A, a set of observations O, and a reward r. The reinforcement learning problem, in general, is defined by a conditional measure D( o, r | (o,r,a)*) which produces an observation o and a reward r given a history (o,r,a)*. The goal in reinforcement learning is to find a policy pi:(o,r,a)* -> a mapping histories to actions so as to maximize (or approximately maximize) the expected sum of observed rewards.

This formulation is capable of capturing almost any (all?) AI problems. (Are there any other formulations capable of capturing a similar generality?) I don’t believe we yet have good RL solutions from step (2), but that is unsurprising given the generality of the problem.

Note that solving RL in this generality is impossible (for example, it can encode classification). The two approaches that can be taken are:

  1. Simplify the problem. It is very common to consider the restricted problem where the history is summarized by the previous observation. (aka a “Markov Decision Process”). In many cases, other restrictions are added.
  2. Think about relativized solutions (such as reductions).

Both methods are options are under active investigation.

Peekaboom

Luis has released Peekaboom a successor to ESPgame (game site). The purpose of the game is similar—using the actions of people playing a game to gather data helpful in solving AI.

Peekaboom gathers more detailed, and perhaps more useful, data about vision. For ESPgame, the byproduct of the game was mutually agreed upon labels for common images. For Peekaboom, the location of the subimage generating the label is revealed by the game as well. Given knowledge about what portion of the image is related to a label it may be more feasible learn to recognize the appropriate parts.

There isn’t a dataset yet available for this game as there is for ESPgame, but hopefully a significant number of people will play and we’ll have one to work wtih soon.