August 2005 – Machine Learning (Theory)

8/23/20058/23/2005

(Dis)similarities between academia and open source programmers

Martin Pool and I recently discussed the similarities and differences between academia and open source programming.

Similarities:

Cost profile Research and programming share approximately the same cost profile: A large upfront effort is required to produce something useful, and then “anyone” can use it. (The “anyone” is not quite right for either group because only sufficiently technical people could use it.)
Wealth profile A “wealthy” academic or open source programmer is someone who has contributed a lot to other people in research or programs. Much of academia is a “gift culture”: whoever gives the most is most respected.
Problems Both academia and open source programming suffer from similar problems.
1. Whether or not (and which) open source program is used are perhaps too-often personality driven rather than driven by capability or usefulness. Similar phenomena can happen in academia with respect to directions of research.
2. Funding is often a problem for both groups. Academics often invest many hours in writing grants while open source programmers simply often are not paid.
Both groups of people work in a mixed competitive/collaborative environment.
Both groups use conferences as a significant mechanism of communication.

Given the similarities, it is not too surprising that there is significant cooperation between academia and open source programming, and it is relatively common to crossover from one to the other.

The differences are perhaps more interesting to examine because they may point out where one group can learn from the other.

A few open source projects have achieved significantly larger scales than academia as far as coordination amongst many people over a long time. Big project examples include linux, apache, and mozilla. Groups of people of this scale in academia are typically things like “the ICML community”, or “people working on Bayesian learning”, which are significantly less tightly coupled than any of the above projects. This suggests it may be possible to achieve significantly larger close collaborations in academia.
Academia has managed to secure significantly more funding than open source programmers. Funding typically comes from a mixture of student tuition and government grants. Part of the reason for better funding in academia is that it has been around longer and so been able to accomplish more. Perhaps governments will start funding open source programming more seriously if they produce an equivalent (with respect to societal impact) of the atom bomb.
Academia has a relatively standard career path: grade school education, undergraduate education, graduate education, then apply for a job as a professor at a university. In contrast the closest thing to a career path for open source programmers is something like “do a bunch of open source projects and become so wildly succesful that some company hires you to do the same thing”. This is a difficult path but perhaps it is slowly becoming easier and there is still much room for improvement.
Open source programmers take significantly more advantage of modern tools for communication. As an example of this, Martin mentioned that perhaps half the people working on Ubuntu have blogs. In academia, they are still a rarity.
Open source programmers have considerably more freedom of location. Academic research is almost always tied to a particular university or lab, while many people who work on open source projects can choose to live esssentially anywhere with reasonable internet access.

8/22/20058/22/2005

Do you believe in induction?

Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing.

As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings.

The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to think about it. In the real world, we do not care about the expectation over all possible sequences, but perhaps instead about some (weighted) expectation over the set of problems we actually encounter. It is enitrely possible that we can form a prediction algorithm with good performance over this set of problems.

This is one of the fundamental reasons why experiments are done in machine learning. If we want to access the set of problems we actually encounter, we must do this empirically. Although we must work with the world to understand what a good general-purpose learning algorithm is, quantifying how good the algorithm is may be difficult. In particular, performing well on the last 100 encountered learning problems may say nothing about performing well on the next encountered learning problem.

This is where induction comes in. It has been noted by Hume that there is no mathematical proof that the sun will rise tomorrow which does not rely on unverifiable assumptions about the world. Nevertheless, the belief in sunrise tomorrow is essentially universal. A good general purpose learning algorithm is similar to ‘sunrise’: we can’t prove that we will succeed on the next learning problem encountered, but nevertheless we might believe it for inductive reasons. And we might be right.

8/18/20058/18/2005

SVM Adaptability

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

2002 multiclass SVM including arbitrary cost matrices
ICML 2003 Hidden Markov Models
NIPS 2003 Markov Networks (see some discussion)
EMNLP 2004 Context free grammars
ICML 2004 Any loss (with much computation)
ICML 2005 Any constrained linear prediction model (that’s my own name).
ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?

8/11/20058/14/2005

Why Manifold-Based Dimension Reduction Techniques?

Manifold based dimension-reduction algorithms share the following general outline.

Given: a metric d() and a set of points S

Construct a graph with a point in every node and every edge connecting to the node of one of the k-nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes.
Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors.
Find a set of points in a low dimensional space which preserve the digested properties.

Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures.

Despite doing some early work in this direction, I suffer from a motivational problem: Why do we want to recover the low dimensional structure? One answer is “for better data visualization”. This is compelling if you have data visualization problems. However, I don’t — I want to make machines that can better predict the future, which generally appears to be a sound goal of learning. Reducing the dimensionality of a dataset is not obviously helpful in accomplishing this. In fact, doing so violates one of the basic intuitions of applied learning algorithms “avoid double approximation”. (One approximation = the projection into the low dimensional space, another approximation = the classifier learned on that space.)

Another answer is “for robots”. Several people have experimented with using a vision sensor and a dimension reduction technique in an attempt to extract the manifold of pose space. These attempts have not generally worked well, basically because the euclidean distance on pixels is not particularly good at predicting which things are “nearby”. However, we might be able to do considerably better if we learn the distance. At the 1-bit level, we might learn a predictor from image pairs to “nearby” or “far”. Any stream S of images i₁, i₂, i₃, …, i_n can be transformed into a binary problem according to:
{((i_j,i_k),1 – I(j = k+1 or k = j+1): i_j,i_k in S}. In unmath “the binary problem formed by predicting whether images are adjacent in the chain of experience”. (*) A good solution to this binary problem would give us an
interesting 1-bit metric. Using regression and counting numbers of transitions might provide a more conventional multibit metric.

This metric, if well solved, has a concrete meaning: the minimum distance in terms of actuator transitions between positions. A shortest path in this space is a sequence of actuator movements leading from a position A to a position B. A projection of this space into low dimensions provides some common format which both the human and the robot can understand. Commanding the robot to go to some location is just a matter of pointing out that location in the low dimensional projection.

This is a possible use for manifold based dimension reduction techniques which I find compelling, if it works out. (Anyone interested in playing with this should talk to Dana Wilkinson who is considering experimenting with this approach.)

(*) We probably would want to tweak the positive/negative ratio to reflect the pattern encountered in usage.
(**) Post tweaked to fix an oversight.

8/8/20058/8/2005

Apprenticeship Reinforcement Learning for Control

Pieter Abbeel presented a paper with Andrew Ng at ICML on Exploration and Apprenticeship Learning in Reinforcement Learning. The basic idea of this algorithm is:

Collect data from a human controlling a machine.
Build a transition model based upon the experience.
Build a policy which optimizes the transition model.
Evaluate the policy. If it works well, halt, otherwise add the experience into the pool and go to (2).

The paper proves that this technique will converge to some policy with expected performance near human expected performance assuming the world fits certain assumptions (MDP or linear dynamics).

This general idea of apprenticeship learning (i.e. incorporating data from an expert) seems very compelling because (a) humans often learn this way and (b) much harder problems can be solved. For (a), the notion of teaching is about transferring knowledge from an expert to novices, often via demonstration. To see (b), note that we can create intricate reinforcement learning problems where a particular sequence of actions must be taken to achieve a goal. A novice might be able to memorize this sequence given just one demonstration even though it would require experience exponential in the length of the sequence to discover the key sequence accidentally.

Andrew Ng’s group has exploited this to make this very fun picture.
(Yeah, that’s a helicopter flying upside down, under computer control.)

As far as this particular paper, one question occurs to me. There is a general principle of learning which says we should avoid “double approximation”, such as occurs in step (3) where we build an approximate policy on an approximate model. Is there a way to fuse steps (2) and (3) to achieve faster or better learning?