# Machine Learning (Theory)

## 3/15/2007

### Alternative Machine Learning Reductions Definitions

A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1}, X x [0,1], X x {1,2,3,4,5}, etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero.

For this post, we can think of a learning reduction as

1. A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification).
2. A mapping Q from predictors for type T’ to predictors for type T.

The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E(x,y) ~ D LT(y,Q(b,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)))

Here LT is the loss for the type T problem and LT’ is the loss for the type T’ problem. Also, R(D) is the distribution over samples induced by first drawing from D and then mapping the sample via R. The function f() is the loss transform function—we try to find reductions R,Q which minimize it’s value.

If R,Q are deterministic, then there always exists a choice of D,b such that the loss rate on the right hand side is 0. However, it’s common to encounter real-world learning problems D which are inherently noisy, implying that the induced problem D’ is often inherently noisy. Distinguishing between errors due to environmental noise and errors due to base predictor mistakes seems important (and experimentally, it has been). Regret transform reductions can get at this. They have theorems of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E(x,y) ~ D LT(y,Q(b,x)) – minc E(x,y) ~ D LT(y,c(x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ E(x’,y’)~R(D) LT’(y’,b'(x’)))

The essential idea in regret transform reductions is that we subtract off the inherent noise in both the induced and original problem, and bound the excess loss due to suboptimal prediction directly.

The skeletons of the theory for these families of reductions have been layed out at this point. There remain some open problems, but another interesting direction to consider is other families of reductions. The hope is that by placing more stringent requirements on reductions, we limit ourselves to algorithms which tend to perform better in practice. This hope is pretty reasonable—empirically, we have observed a consistent step up in performance going from loss transform to regret transform reductions.

1. Limited Regret Transform Reductions. The fact that the minimum is taken over all predictors in regret transforms is counterintuitive to some people, who are used to “Empirical Risk Minimization” statements where a minimum is taken over a limited set of predictors. We could imagine theorem statements of the form:
Theorem For all sets of base predictors B, For all base predictors b, for all distributions D over examples of type T:
E(x,y) ~ D LT(y,Q(b,x)) – minb’ in B E(x,y) ~ D LT(y,Q(b’,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ in B E(x’,y’)~R(D) LT’(y’,b'(x’)))

This is a more general statement than a regret transform reduction—when B is the set of all base predictors, we recover standard regret transforms
One case where it’s easy to see that this kind of statement holds is for the reduction from importance weighted binary classification to binary classification. However, little more is currently known.
2. Reversible Reductions. This is an idea which Russell Impagliazzo first mentioned to me. Essentially, we limit ourselves to reductions with the property that they are reversible. Reversibility can be tested by mapping from one problem to another, and then back. There are a several variant theorem statements we could imagine. The most tractable variant for analysis might be the following:
Theorem There exists R-1,Q-1 such that for all base predictors b, for base learning problems D’:
E(x’,y’)~D’ LT’(y’,b(x’)) = E(x’,y’) ~ R(R-1(D’)) LT’(y’,b(x’))

and Q-1(Q(b))=b

Closely related (but different) is the following:
Theorem There exists R-1,Q-1 such that for all type T predictors h, for all type T distributions D:
E(x,y) ~ D LT(y,h(x)) = E(x,y)~R-1(R(D)) LT(y,h(x))

and Q(Q-1(h)) = h
3. Bayesian Reductions This is an idea which Simon Osindero mentioned. The basic observation is that Bayes Law is pretty important to the process of learning. We would like it to be the case that Bayes Law and reductions compose. A theorem statement of the following form might be about right.
Theorem For some large family of priors P over distributions D of type T:
Bayes(P,(x,y)~D~P) = Q(Bayes(R(P),(x’,y’)~D’~R(P)))

Here “Bayes” is a learning algorithm which takes as input a prior P (or R(P)), and a sample (x,y) drawn by first drawing a D from P and then drawing from D (and similarly for the induced problem). Also, R(P) is the prior induced by mapping D to R(D) after drawing from P.

The two missing components for these kinds of reductions are:

1. Theoretical evidence that we can satisfy these definitions of reduction between interesting types of learning problems.
2. Empirical evidence that algorithmic modifications driven by the theory are useful.

My experience is that analyzing reductions has yielded significant insight into how to solve learning problems, so I would encourage anyone with a bit of theoretical inclination in Machine Learning to consider the above (or other) families of reductions.

## 3/3/2007

### All Models of Learning have Flaws

Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I’ve created a table (below) outlining the major flaws in some common models of machine learning.

The point here is not simply “woe unto us”. There are several implications which seem important.

1. The multitude of models is a point of continuing confusion. It is common for people to learn about machine learning within one framework which often becomes there “home framework” through which they attempt to filter all machine learning. (Have you met people who can only think in terms of kernels? Only via Bayes Law? Only via PAC Learning?) Explicitly understanding the existence of these other frameworks can help resolve the confusion. This is particularly important when reviewing and particularly important for students.
2. Algorithms which conform to multiple approaches can have substantial value. “I don’t really understand it yet, because I only understand it one way”. Reinterpretation alone is not the goal—we want algorithmic guidance.
3. We need to remain constantly open to new mathematical models of machine learning. It’s common to forget the flaws of the model that you are most familiar with in evaluating other models while the flaws of new models get exaggerated. The best way to avoid this is simply education.
4. The value of theory alone is more limited than many theoreticians may be aware. Theories need to be tested to see if they correctly predict the underlying phenomena.

Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what’s right as well.