A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1}, X x [0,1], X x {1,2,3,4,5}, etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero.
For this post, we can think of a learning reduction as
- A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification).
- A mapping Q from predictors for type T’ to predictors for type T.
The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:
E(x,y) ~ D LT(y,Q(b,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)))
Here
LT is the loss for the type
T problem and
LT’ is the loss for the type
T’ problem. Also,
R(D) is the distribution over samples induced by first drawing from
D and then mapping the sample via
R. The function
f() is the loss transform function—we try to find reductions
R,Q which minimize it’s value.
If R,Q are deterministic, then there always exists a choice of D,b such that the loss rate on the right hand side is 0. However, it’s common to encounter real-world learning problems D which are inherently noisy, implying that the induced problem D’ is often inherently noisy. Distinguishing between errors due to environmental noise and errors due to base predictor mistakes seems important (and experimentally, it has been). Regret transform reductions can get at this. They have theorems of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:
E(x,y) ~ D LT(y,Q(b,x)) – minc E(x,y) ~ D LT(y,c(x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ E(x’,y’)~R(D) LT’(y’,b'(x’)))
The essential idea in regret transform reductions is that we subtract off the inherent noise in both the induced and original problem, and bound the excess loss due to suboptimal prediction directly.
The skeletons of the theory for these families of reductions have been layed out at this point. There remain some open problems, but another interesting direction to consider is other families of reductions. The hope is that by placing more stringent requirements on reductions, we limit ourselves to algorithms which tend to perform better in practice. This hope is pretty reasonable—empirically, we have observed a consistent step up in performance going from loss transform to regret transform reductions.
- Limited Regret Transform Reductions. The fact that the minimum is taken over all predictors in regret transforms is counterintuitive to some people, who are used to “Empirical Risk Minimization” statements where a minimum is taken over a limited set of predictors. We could imagine theorem statements of the form:
Theorem For all sets of base predictors B, For all base predictors b, for all distributions D over examples of type T:
E(x,y) ~ D LT(y,Q(b,x)) – minb’ in B E(x,y) ~ D LT(y,Q(b’,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ in B E(x’,y’)~R(D) LT’(y’,b'(x’)))
This is a more general statement than a regret transform reduction—when B is the set of all base predictors, we recover standard regret transforms
One case where it’s easy to see that this kind of statement holds is for the reduction from importance weighted binary classification to binary classification. However, little more is currently known.
- Reversible Reductions. This is an idea which Russell Impagliazzo first mentioned to me. Essentially, we limit ourselves to reductions with the property that they are reversible. Reversibility can be tested by mapping from one problem to another, and then back. There are a several variant theorem statements we could imagine. The most tractable variant for analysis might be the following:
Theorem There exists R-1,Q-1 such that for all base predictors b, for base learning problems D’:
E(x’,y’)~D’ LT’(y’,b(x’)) = E(x’,y’) ~ R(R-1(D’)) LT’(y’,b(x’))
and Q-1(Q(b))=b
Closely related (but different) is the following:
Theorem There exists R-1,Q-1 such that for all type T predictors h, for all type T distributions D:
E(x,y) ~ D LT(y,h(x)) = E(x,y)~R-1(R(D)) LT(y,h(x))
and Q(Q-1(h)) = h
- Bayesian Reductions This is an idea which Simon Osindero mentioned. The basic observation is that Bayes Law is pretty important to the process of learning. We would like it to be the case that Bayes Law and reductions compose. A theorem statement of the following form might be about right.
Theorem For some large family of priors P over distributions D of type T:
Bayes(P,(x,y)~D~P) = Q(Bayes(R(P),(x’,y’)~D’~R(P)))
Here “Bayes” is a learning algorithm which takes as input a prior P (or R(P)), and a sample (x,y) drawn by first drawing a D from P and then drawing from D (and similarly for the induced problem). Also, R(P) is the prior induced by mapping D to R(D) after drawing from P.
The two missing components for these kinds of reductions are:
- Theoretical evidence that we can satisfy these definitions of reduction between interesting types of learning problems.
- Empirical evidence that algorithmic modifications driven by the theory are useful.
My experience is that analyzing reductions has yielded significant insight into how to solve learning problems, so I would encourage anyone with a bit of theoretical inclination in Machine Learning to consider the above (or other) families of reductions.