Nearly all natural problems require nonlinearity

One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations.

There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail.

In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. A basic question is: Why is the study of linear representations so prevalent?

There are several reasons for investigating the linear viewpoint.

  1. Linear learning is sufficient. As discussed above, this is really only true in practice if you have sufficiently capable humans hand-engineering features. On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems.
  2. Linear learning is a compelling primitive. Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures.
  3. Linear learning is the only thing tractable, empirically. This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. On a dataset large enough to require efficient algorithms, you often want to use online learning. And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles.
  4. Linear learning is the only thing tractable, theoretically. There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. However, there are fundamental limits to the extent that a well understood tool can be misused, and it’s important to understand that these theorems do not (and cannot) say that learning on a linear representation will solve some concrete problem like (say) face recognition from 10000 labeled examples. In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions, but there are others also.

Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story:
At night you see someone searching the ground under a streetlight.
You ask, “What happened?”
They say, “I’m looking for the keys I dropped in the bushes.”
“But there aren’t any bushes where you are searching.”
“Yes, but I can’t see over there.”

Watchword: Supervised Learning

I recently discovered that supervised learning is a controversial term. The two definitions are:

  1. Known Loss Supervised learning corresponds to the situation where you have unlabeled examples plus knowledge of the loss of each possible predicted choice. This is the definition I’m familiar and comfortable with. One reason to prefer this definition is that the analysis of sample complexity for this class of learning problems are all pretty similar.
  2. Any kind of signal Supervised learning corresponds to the situation where you have unlabeled examples plus any source of side information about what the right choice is. This notion of supervised learning seems to subsume reinforcement learning, which makes me uncomfortable, because it means there are two words for the same class. This also means there isn’t a convenient word to describe the first definition.

Reviews suggest there are people who are dedicated to the second definition out there, so it can be important to discriminate which you mean.

Alternative Machine Learning Reductions Definitions

A type of prediction problem is specified by the type of samples produced by a data source (Example: X x {0,1}, X x [0,1], X x {1,2,3,4,5}, etc…) and a loss function (0/1 loss, squared error loss, cost sensitive losses, etc…). For simplicity, we’ll assume that all losses have a minimum of zero.

For this post, we can think of a learning reduction as

  1. A mapping R from samples of one type T (like multiclass classification) to another type T’ (like binary classification).
  2. A mapping Q from predictors for type T’ to predictors for type T.

The simplest sort of learning reduction is a “loss reduction”. The idea in a loss reduction is to prove a statement of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E(x,y) ~ D LT(y,Q(b,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)))

Here LT is the loss for the type T problem and LT’ is the loss for the type T’ problem. Also, R(D) is the distribution over samples induced by first drawing from D and then mapping the sample via R. The function f() is the loss transform function—we try to find reductions R,Q which minimize it’s value.

If R,Q are deterministic, then there always exists a choice of D,b such that the loss rate on the right hand side is 0. However, it’s common to encounter real-world learning problems D which are inherently noisy, implying that the induced problem D’ is often inherently noisy. Distinguishing between errors due to environmental noise and errors due to base predictor mistakes seems important (and experimentally, it has been). Regret transform reductions can get at this. They have theorems of the form:
Theorem For all base predictors b, for all distributions D over examples of type T:

E(x,y) ~ D LT(y,Q(b,x)) – minc E(x,y) ~ D LT(y,c(x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ E(x’,y’)~R(D) LT’(y’,b'(x’)))

The essential idea in regret transform reductions is that we subtract off the inherent noise in both the induced and original problem, and bound the excess loss due to suboptimal prediction directly.

The skeletons of the theory for these families of reductions have been layed out at this point. There remain some open problems, but another interesting direction to consider is other families of reductions. The hope is that by placing more stringent requirements on reductions, we limit ourselves to algorithms which tend to perform better in practice. This hope is pretty reasonable—empirically, we have observed a consistent step up in performance going from loss transform to regret transform reductions.

  1. Limited Regret Transform Reductions. The fact that the minimum is taken over all predictors in regret transforms is counterintuitive to some people, who are used to “Empirical Risk Minimization” statements where a minimum is taken over a limited set of predictors. We could imagine theorem statements of the form:
    Theorem For all sets of base predictors B, For all base predictors b, for all distributions D over examples of type T:
    E(x,y) ~ D LT(y,Q(b,x)) – minb’ in B E(x,y) ~ D LT(y,Q(b’,x)) <= f(E(x’,y’)~R(D) LT’(y’,b(x’)) – minb’ in B E(x’,y’)~R(D) LT’(y’,b'(x’)))

    This is a more general statement than a regret transform reduction—when B is the set of all base predictors, we recover standard regret transforms
    One case where it’s easy to see that this kind of statement holds is for the reduction from importance weighted binary classification to binary classification. However, little more is currently known.
  2. Reversible Reductions. This is an idea which Russell Impagliazzo first mentioned to me. Essentially, we limit ourselves to reductions with the property that they are reversible. Reversibility can be tested by mapping from one problem to another, and then back. There are a several variant theorem statements we could imagine. The most tractable variant for analysis might be the following:
    Theorem There exists R-1,Q-1 such that for all base predictors b, for base learning problems D’:
    E(x’,y’)~D’ LT’(y’,b(x’)) = E(x’,y’) ~ R(R-1(D’)) LT’(y’,b(x’))

    and Q-1(Q(b))=b

    Closely related (but different) is the following:
    Theorem There exists R-1,Q-1 such that for all type T predictors h, for all type T distributions D:
    E(x,y) ~ D LT(y,h(x)) = E(x,y)~R-1(R(D)) LT(y,h(x))

    and Q(Q-1(h)) = h
  3. Bayesian Reductions This is an idea which Simon Osindero mentioned. The basic observation is that Bayes Law is pretty important to the process of learning. We would like it to be the case that Bayes Law and reductions compose. A theorem statement of the following form might be about right.
    Theorem For some large family of priors P over distributions D of type T:
    Bayes(P,(x,y)~D~P) = Q(Bayes(R(P),(x’,y’)~D’~R(P)))

    Here “Bayes” is a learning algorithm which takes as input a prior P (or R(P)), and a sample (x,y) drawn by first drawing a D from P and then drawing from D (and similarly for the induced problem). Also, R(P) is the prior induced by mapping D to R(D) after drawing from P.

The two missing components for these kinds of reductions are:

  1. Theoretical evidence that we can satisfy these definitions of reduction between interesting types of learning problems.
  2. Empirical evidence that algorithmic modifications driven by the theory are useful.

My experience is that analyzing reductions has yielded significant insight into how to solve learning problems, so I would encourage anyone with a bit of theoretical inclination in Machine Learning to consider the above (or other) families of reductions.

The Call of the Deep

Many learning algorithms used in practice are fairly simple. Viewed representationally, many prediction algorithms either compute a linear separator of basic features (perceptron, winnow, weighted majority, SVM) or perhaps a linear separator of slightly more complex features (2-layer neural networks or kernelized SVMs). Should we go beyond this, and start using “deep” representations?

What is deep learning?
Intuitively, deep learning is about learning to predict in ways which can involve complex dependencies between the input (observed) features.

Specifying this more rigorously turns out to be rather difficult. Consider the following cases:

  1. SVM with Gaussian Kernel. This is not considered deep learning, because an SVM with a gaussian kernel can’t succinctly represent certain decision surfaces. One of Yann LeCun‘s examples is recognizing objects based on pixel values. An SVM will need a new support vector for each significantly different background. Since the number of distinct backgrounds is large, this isn’t easy.
  2. K-Nearest neighbor. This is not considered deep learning for essentially the same reason as the gaussian SVM. The number of representative points required to recognize an image in any background is very large.
  3. Decision Tree. A decision tree might be considered a deep learning system. However, there exist simple learning problems that defeat decision trees using axis aligned splits. It’s easy to find problems that defeat such decision trees by rotating a linear separator through many dimensions.
  4. 2-layer neural networks. A two layer neural network isn’t considered deep learning because it isnt a deep architecture. More importantly, perhaps, the object recognition with occluding background problem implies that the hidden layer must be very large to do general purpose detection.
  5. Deep neural networks. (for example, convolutional neural networks) A neural network with several layers might be considered deep.
  6. Deep Belief networks are “deep”.
  7. Automated feature generation and selection systems might be considered deep since they can certainly develop deep dependencies between the input and the output.

One test for a deep learning system is: are there well-defined learning problems which the system can not solve but a human easily could? If the answer is ‘yes’, then it’s perhaps not a deep learning system.

Where might deep learning be useful?
There are several theorems of the form: “nearest neighbor can learn any measurable function”, “2 layer neural networks can represent any function”, “a support vector machine with a gaussian kernel can learn any function”. These theorems imply that deep learning is only interesting in the bounded data or computation case.

And yet, for the small data situation (think “30 examples”), problems with overfitting become so severe it’s difficult to imagine using more complex learning algorithms than the shallow systems comonly in use.

So the domain where a deep learning system might be most useful involves large quantities of data with computational constraints.

What are the principles of design for deep learning systems?
The real answer here is “we don’t know”, and this is an interesting but difficult direction of research.

  1. Is (approximate) gradient descent the only efficient training algorithm?
  2. Can we learn an architecture on the fly or must it be prespecified?
  3. What are the limits of what can be learned?

The value of the orthodox view of Boosting

The term “boosting” comes from the idea of using a meta-algorithm which takes “weak” learners (that may be able to only barely predict slightly better than random) and turn them into strongly capable learners (which predict very well). Adaboost in 1995 was the first widely used (and useful) boosting algorithm, although there were theoretical boosting algorithms floating around since 1990 (see the bottom of this page).

Since then, many different interpretations of why boosting works have arisen. There is significant discussion about these different views in the annals of statistics, including a response by Yoav Freund and Robert Schapire.

I believe there is a great deal of value to be found in the original view of boosting (meta-algorithm for creating a strong learner from a weak learner). This is not a claim that one particular viewpoint obviates the value of all others, but rather that no other viewpoint seems to really capture important properties.

Comparing with all other views of boosting is too clumsy, so I will pick one: “boosting coordinate-wise gradient descent (CWGD for short) on an exponential loss function” which started here and compare it with Adaboost.

There are two advantages of the “weak to strong learning” view:

  1. Automatic computational speedups. In the “weak to strong learning” view, you automatically think about using a learning algorithm as a subroutine. As a consequence, we know the computation can be quite fast. In the CWGD view, using C4.5 (or some other algorithm) to pick the coordinate is an unmotivated decision. The straightforward thing to do is simply check each coordinate in turn which yields no computational speedups.
  2. Meta-algorithm based performance gains. Using a nontrivial base learning algorithm seems to improve performance. This is unsurprising—simply consider the limit where only one round of boosting is done. This is not well-accounted for by the CWGD view.

The point here is not that the old view subsumes the CWGD view, but rather that the CWGD view does not account for all the value in the old view. In particular, the CWGD view suggests that a broader family of algorithms may be useful than the weak-to-strong view might suggest.

This might seem like a “too meta” discussion, but it is very relevant to the process of research. We as researchers in machine learning have a choice of many methods of thinking about developing algorithms. Some methods are harder to use than others, so it is useful to think about what works well. Gradient descent is a core algorithmic tool in machine learning. After making a sequence of more-or-less unmotivated steps, we can derive Adaboost (and other algorithms) as an application of gradient descent. Or, we can study the notion of boosting weak learning to achieve strong learning and derive Adaboost. My impression is that the “weak learning to achieve strong learning” view is significantly more difficult to master than gradient descent, but it is also a significantly more precise mechanism for deriving useful algorithms. There are many gradient descent algorithms which are not very useful in machine learning. Amongst other things, the “weak to strong” view significantly informed some of the early development of learning reductions. It is no coincidence that Adaboost can be understood in this framework.