John Langford – Page 56 – Machine Learning (Theory)

4/21/2007

Videolectures.net

Davor has been working to setup videolectures.net which is the new site for the many lectures mentioned here. (Tragically, they seem to only be available in windows media format.) I went through my own projects and added a few links to the videos. The day when every result is a set of {paper, slides, video} isn’t quite here yet, but it’s within sight. (For many papers, of course, code is a 4th component.)

4/18/2007

$50K Spock Challenge

Apparently, the company Spock is setting up a $50k entity resolution challenge. $50k is much less than the Netflix challenge, but it’s effectively the same as Netflix until someone reaches 10%. It’s also nice that the Spock challenge has a short duration. The (visible) test set is of size 25k and the training set has size 75k.

4/13/20074/13/2007

What to do with an unreasonable conditional accept

Last year about this time, we received a conditional accept for the searn paper, which asked us to reference a paper that was not reasonable to cite because there was strictly more relevant work by the same authors that we already cited. We wrote a response explaining this, and didn’t cite it in the final draft, giving the SPC an excuse to reject the paper, leading to unhappiness for all.

Later, Sanjoy Dasgupta suggested that an alternative was to talk to the PC chair instead, as soon as you see that a conditional accept is unreasonable. William Cohen and I spoke about this by email, the relevant bit of which is:

If an SPC asks for a revision that is inappropriate, the correct
action is to contact the chairs as soon as the decision is made,
clearly explaining what the problem is, so we can decide whether or
not to over-rule the SPC. As you say, this is extra work for us
chairs, but that’s part of the job, and we’re willing to do that sort
of work to improve the overall quality of the reviewing process and
the conference. In short, Sanjoy was right.

At the time, I operated under the belief that the PC chair’s job was simply too heavy to bother with something like this, but that was wrong. William invited me to post this, and I hope we all learn a little bit from it. Obviously, this should only be used if there is a real flaw in the conditions for a conditional accept paper.

4/2/2007

Contextual Scaling

Machine learning has a new kind of “scaling to larger problems” to worry about: scaling with the amount of contextual information. The standard development path for a machine learning application in practice seems to be the following:

Marginal. In the beginning, there was “majority vote”. At this stage, it isn’t necessary to understand that you have a prediction problem. People just realize that one answer is right sometimes and another answer other times. In machine learning terms, this corresponds to making a prediction without side information.
First context. A clever person realizes that some bit of information x₁ could be helpful. If x₁ is discrete, they condition on it and make a predictor h(x₁), typically by counting. If they are clever, then they also do some smoothing. If x₁ is some real valued parameter, it’s very common to make a threshold cutoff. Often, these tasks are simply done by hand.
Second. Another clever person (or perhaps the same one) realizes that some other bit of information x₂ could be helpful. As long as the space of (x₁, x₂) remains small and discrete, they continue to form a predictor by counting. When (x₁, x₂) are real valued, the space remains visualizable, and so a hand crafted decision boundary works fine.
… The previous step repeats for information x₃,…,x₁₀₀. It’s no longer possible to visualize the data but a human can still function as a learning algorithm by carefully tweaking parameters and testing with the right software support to learn h(x₁,…,x₁₀₀). Graphical models can sometimes help scale up counting based approaches. Overfitting becomes a very serious issue. The “human learning algorithm” approach starts breaking down, because it becomes hard to integrate new information sources in the context of all others.
Automation. People realize “we must automate this process of including new information to keep up”, and a learning algorithm is adopted. The precise choice is dependent on the characteristics of the learning problem (How many examples are there at training time? Is this online or batch? How fast must it be at test time?) and the familiarity of the people involved. This can be a real breakthrough—automation can greatly ease the inclusion of new information, and sometimes it can even improve results given the limited original information.

Understanding the process of contextual scaling seems particularly helpful for teaching about machine learning. It’s often the case that the switch to the last step could and should have happened before the the 100th bit of information was integrated.

We can also judge learning algorithms according to their ease of contextual scaling. In order from “least” to “most”, we might have:

Counting based approaches. Number of examples required is generally exponential in the number of features.
Counting based approaches with smoothing. Still exponential, but with saner defaults.
Counting based approaches with smoothing and some prior language (graphical models, bayes nets, etc…). Number of examples required is no longer exponential, but can still be intractably large. Prior specification from a human is required.
Prior based systems (Many Bayesian Learning algorithms). No particular number of examples required, but sane prior specification from a human may be required.
Similarity based systems (nearest neighbor, kernel based algorithms). A similarity measure is a weaker form of prior information which can be substantially easier to specify.
Highly automated approaches. “Just throw the new information as a feature and let the learning algorithms sort it out”.

At each step in this order, less effort is required to integrate new information.

Designing a learning algorithm which can be useful in many different contextual scales is fundamentally challenging. Obviously, when specific prior information is available, we want to incorporate it. Equally obviously, when specific prior information is not available, we want to be able to take advantage of new information which happens to be easily useful. When we have so much information that counting could work, a learning algorithm should behave similar to counting (with smoothing).

3/15/2007

Alternative Machine Learning Reductions Definitions

A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1}, X x [0,1], X x {1,2,3,4,5}, etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero.

For this post, we can think of a learning reduction as

A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification).
A mapping Q from predictors for type T’ to predictors for type T.

The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E_{(x,y) ~ D} L_T(y,Q(b,x)) <= f(E_{(x’,y’)~R(D)} L_T’(y’,b(x’)))
Here L_T is the loss for the type T problem and L_T’ is the loss for the type T’ problem. Also, R(D) is the distribution over samples induced by first drawing from D and then mapping the sample via R. The function f() is the loss transform function—we try to find reductions R,Q which minimize it’s value.

If R,Q are deterministic, then there always exists a choice of D,b such that the loss rate on the right hand side is 0. However, it’s common to encounter real-world learning problems D which are inherently noisy, implying that the induced problem D’ is often inherently noisy. Distinguishing between errors due to environmental noise and errors due to base predictor mistakes seems important (and experimentally, it has been). Regret transform reductions can get at this. They have theorems of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E_{(x,y) ~ D} L_T(y,Q(b,x)) – min_c E_{(x,y) ~ D} L_T(y,c(x)) <= f(E_{(x’,y’)~R(D)} L_T’(y’,b(x’)) – min_b’ E_{(x’,y’)~R(D)} L_T’(y’,b'(x’)))
The essential idea in regret transform reductions is that we subtract off the inherent noise in both the induced and original problem, and bound the excess loss due to suboptimal prediction directly.

The skeletons of the theory for these families of reductions have been layed out at this point. There remain some open problems, but another interesting direction to consider is other families of reductions. The hope is that by placing more stringent requirements on reductions, we limit ourselves to algorithms which tend to perform better in practice. This hope is pretty reasonable—empirically, we have observed a consistent step up in performance going from loss transform to regret transform reductions.

Limited Regret Transform Reductions. The fact that the minimum is taken over all predictors in regret transforms is counterintuitive to some people, who are used to “Empirical Risk Minimization” statements where a minimum is taken over a limited set of predictors. We could imagine theorem statements of the form:
Theorem For all sets of base predictors B, For all base predictors b, for all distributions D over examples of type T:
E_{(x,y) ~ D} L_T(y,Q(b,x)) – min_{b’ in B} E_{(x,y) ~ D} L_T(y,Q(b’,x)) <= f(E_{(x’,y’)~R(D)} L_T’(y’,b(x’)) – min_{b’ in B} E_{(x’,y’)~R(D)} L_T’(y’,b'(x’)))
This is a more general statement than a regret transform reduction—when B is the set of all base predictors, we recover standard regret transforms
One case where it’s easy to see that this kind of statement holds is for the reduction from importance weighted binary classification to binary classification. However, little more is currently known.
Reversible Reductions. This is an idea which Russell Impagliazzo first mentioned to me. Essentially, we limit ourselves to reductions with the property that they are reversible. Reversibility can be tested by mapping from one problem to another, and then back. There are a several variant theorem statements we could imagine. The most tractable variant for analysis might be the following:
Theorem There exists R^-1,Q^-1 such that for all base predictors b, for base learning problems D’:
E_{(x’,y’)~D’} L_T’(y’,b(x’)) = E_{(x’,y’) ~ R(R^-1(D’))} L_T’(y’,b(x’))
and Q^-1(Q(b))=b
Closely related (but different) is the following:
Theorem There exists R^-1,Q^-1 such that for all type T predictors h, for all type T distributions D:
E_{(x,y) ~ D} L_T(y,h(x)) = E_{(x,y)~R^-1(R(D))} L_T(y,h(x))
and Q(Q^-1(h)) = h
Bayesian Reductions This is an idea which Simon Osindero mentioned. The basic observation is that Bayes Law is pretty important to the process of learning. We would like it to be the case that Bayes Law and reductions compose. A theorem statement of the following form might be about right.
Theorem For some large family of priors P over distributions D of type T:
Bayes(P,(x,y)~D~P) = Q(Bayes(R(P),(x’,y’)~D’~R(P)))
Here “Bayes” is a learning algorithm which takes as input a prior P (or R(P)), and a sample (x,y) drawn by first drawing a D from P and then drawing from D (and similarly for the induced problem). Also, R(P) is the prior induced by mapping D to R(D) after drawing from P.

The two missing components for these kinds of reductions are:

Theoretical evidence that we can satisfy these definitions of reduction between interesting types of learning problems.
Empirical evidence that algorithmic modifications driven by the theory are useful.

My experience is that analyzing reductions has yielded significant insight into how to solve learning problems, so I would encourage anyone with a bit of theoretical inclination in Machine Learning to consider the above (or other) families of reductions.