John Langford – Page 57 – Machine Learning (Theory)

3/15/2007

Alternative Machine Learning Reductions Definitions

A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1}, X x [0,1], X x {1,2,3,4,5}, etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero.

For this post, we can think of a learning reduction as

A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification).
A mapping Q from predictors for type T’ to predictors for type T.

The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E_{(x,y) ~ D} L_T(y,Q(b,x)) <= f(E_{(x’,y’)~R(D)} L_T’(y’,b(x’)))
Here L_T is the loss for the type T problem and L_T’ is the loss for the type T’ problem. Also, R(D) is the distribution over samples induced by first drawing from D and then mapping the sample via R. The function f() is the loss transform function—we try to find reductions R,Q which minimize it’s value.

If R,Q are deterministic, then there always exists a choice of D,b such that the loss rate on the right hand side is 0. However, it’s common to encounter real-world learning problems D which are inherently noisy, implying that the induced problem D’ is often inherently noisy. Distinguishing between errors due to environmental noise and errors due to base predictor mistakes seems important (and experimentally, it has been). Regret transform reductions can get at this. They have theorems of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E_{(x,y) ~ D} L_T(y,Q(b,x)) – min_c E_{(x,y) ~ D} L_T(y,c(x)) <= f(E_{(x’,y’)~R(D)} L_T’(y’,b(x’)) – min_b’ E_{(x’,y’)~R(D)} L_T’(y’,b'(x’)))
The essential idea in regret transform reductions is that we subtract off the inherent noise in both the induced and original problem, and bound the excess loss due to suboptimal prediction directly.

The skeletons of the theory for these families of reductions have been layed out at this point. There remain some open problems, but another interesting direction to consider is other families of reductions. The hope is that by placing more stringent requirements on reductions, we limit ourselves to algorithms which tend to perform better in practice. This hope is pretty reasonable—empirically, we have observed a consistent step up in performance going from loss transform to regret transform reductions.

Limited Regret Transform Reductions. The fact that the minimum is taken over all predictors in regret transforms is counterintuitive to some people, who are used to “Empirical Risk Minimization” statements where a minimum is taken over a limited set of predictors. We could imagine theorem statements of the form:
Theorem For all sets of base predictors B, For all base predictors b, for all distributions D over examples of type T:
E_{(x,y) ~ D} L_T(y,Q(b,x)) – min_{b’ in B} E_{(x,y) ~ D} L_T(y,Q(b’,x)) <= f(E_{(x’,y’)~R(D)} L_T’(y’,b(x’)) – min_{b’ in B} E_{(x’,y’)~R(D)} L_T’(y’,b'(x’)))
This is a more general statement than a regret transform reduction—when B is the set of all base predictors, we recover standard regret transforms
One case where it’s easy to see that this kind of statement holds is for the reduction from importance weighted binary classification to binary classification. However, little more is currently known.
Reversible Reductions. This is an idea which Russell Impagliazzo first mentioned to me. Essentially, we limit ourselves to reductions with the property that they are reversible. Reversibility can be tested by mapping from one problem to another, and then back. There are a several variant theorem statements we could imagine. The most tractable variant for analysis might be the following:
Theorem There exists R^-1,Q^-1 such that for all base predictors b, for base learning problems D’:
E_{(x’,y’)~D’} L_T’(y’,b(x’)) = E_{(x’,y’) ~ R(R^-1(D’))} L_T’(y’,b(x’))
and Q^-1(Q(b))=b
Closely related (but different) is the following:
Theorem There exists R^-1,Q^-1 such that for all type T predictors h, for all type T distributions D:
E_{(x,y) ~ D} L_T(y,h(x)) = E_{(x,y)~R^-1(R(D))} L_T(y,h(x))
and Q(Q^-1(h)) = h
Bayesian Reductions This is an idea which Simon Osindero mentioned. The basic observation is that Bayes Law is pretty important to the process of learning. We would like it to be the case that Bayes Law and reductions compose. A theorem statement of the following form might be about right.
Theorem For some large family of priors P over distributions D of type T:
Bayes(P,(x,y)~D~P) = Q(Bayes(R(P),(x’,y’)~D’~R(P)))
Here “Bayes” is a learning algorithm which takes as input a prior P (or R(P)), and a sample (x,y) drawn by first drawing a D from P and then drawing from D (and similarly for the induced problem). Also, R(P) is the prior induced by mapping D to R(D) after drawing from P.

The two missing components for these kinds of reductions are:

Theoretical evidence that we can satisfy these definitions of reduction between interesting types of learning problems.
Empirical evidence that algorithmic modifications driven by the theory are useful.

My experience is that analyzing reductions has yielded significant insight into how to solve learning problems, so I would encourage anyone with a bit of theoretical inclination in Machine Learning to consider the above (or other) families of reductions.

3/3/2007

All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.

Name	Methodology	What’s right	What’s wrong
Bayesian Learning	You specify a prior probability distribution over data-makers, P(datamaker) then use Bayes law to find a posterior P(datamaker\|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.	Handles the small data limit. Very flexible. Interpolates to engineering.	Information theoretically problematic. Explicitly specifying a reasonable prior is often hard. Computationally difficult problems are commonly encountered. Human intensive. Partly due to the difficulties above and partly because “first specify a prior” is built into framework this approach is not very automatable.
Graphical/generative Models	Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.	Relative to pure Bayesian systems, this approach is sometimes computationally tractable. More importantly, the graph language is natural, which aids prior elicitation.	Often (still) fails to fix problems with the Bayesian approach. In real world applications, true conditional independence is rare, and results degrade rapidly with systematic misspecification of conditional independence.
Convex Loss Optimization	Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.	Mathematically clean solutions where computational tractability is partly taken into account. Relatively automatable.	The temptation to forget that the world imposes nonconvex loss functions is sometimes overwhelming, and the mismatch is always dangerous. Limited models. Although switching to a convex loss means that some optimizations become convex, optimization on representations which aren’t single layer linear combinations is often difficult.
Gradient Descent	Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.	Relatively computationally tractable due to (a) modularity of gradient descent (b) directly optimizing the quantity you want to predict.	Finicky. There are issues with paremeter initialization, step size, and representation. It helps a great deal to have accumulated experience using this sort of system and there is little theoretical guidance. Overfitting is a significant issue.
Kernel-based learning	You chose a kernel K(x,x’) between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.	People often find the specification of a similarity function between objects a natural way to incorporate prior information for machine learning problems. Algorithms (like SVMs) for training are reasonably practical—O(n²) for instance.	Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n²) is not efficient enough when there is much data.
Boosting	You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.	A focus on getting something that works quickly is natural. This approach is relatively automated and (hence) easy to apply for beginners.	The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts	You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.	This is an effective automated method to extract performance from a pool of predictors.	Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Learning Reductions	You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner.	The reductions approach can yield highly automated learning algorithms.	The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
PAC Learning	You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner.	The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute.	There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory	You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses.	There are substantially more positive results than for PAC Learning, and there are a few examples of practical algorithms directly motivated by this analysis.	The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning	Learning is a process of cutting up the input space and assigning predictions to pieces of the space.	Decision tree algorithms are well automated and can be quite fast.	There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Algorithmic complexity	Learning is about finding a program which correctly predicts the outputs given the inputs.	Any reasonable problem is learnable with a number of samples related to the description length of the program.	The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning	Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.	We can learn and act with an amount of summed regret related to O(SA) where S is the number of states and A is the number of actions per state.	Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning	Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process	In a sense, we’ve made no assumptions, so algorithms have wide applicability.	All known algorithms scale badly with the number of hidden states.

This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)

2/16/2007

The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year	2006	2005	2002	1997	1987	1967
count	4	10	5	1	0	0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?

2/10/2007

Best Practices for Collaboration

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3” rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

1/26/20071/27/2007

Parallel Machine Learning Problems

Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because:

Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet.
Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore.
1. Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future.
2. IBM’s Cell processor has (essentially) 9 cores.
3. Modern graphics chips can have an order of magnitude more separate execution units.
The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear.

So, how do we parallelize machine learning algorithms?

The simplest and most common technique is to simply run the same learning algorithm with different parameters on different processors. Cluster management software like OpenMosix, Condor, or Torque are helpful here. This approach doesn’t speed up any individual run of a learning algorithm.
The next simplest technique is to decompose a learning algorithm into an adaptive sequence of statistical queries and parallelize the queries over the sample. This paper (updated from the term paper according to a comment) points out that statistical integration can be implemented via MapReduce which Google popularized (the open source version is Hadoop). The general technique of parallel statistical integration is already used by many people including IBM’s Parallel Machine Learning Toolbox. The disadvantage of this approach is that it is particularly good at speeding up slow algorithms. One example of a fast sequential algorithm is perceptron. The perceptron works on a per example basis making individual updates extremely fast. It is explicitly not a statistical query algorithm.
The most difficult method for speeding up an algorithm is fine-grained structural parallelism. The brain works in this way: each individual neuron operates on it’s own. The reason why this is difficult is that the programming is particularly tricky—you must carefully optimize to avoid latency in network communication. The full complexity of parallel programming is exposed.

A basic question is: are there other approaches to speeding up learning algorithms which don’t incur the full complexity of option (3)? One approach is discussed here.