One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations.

There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail.

In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. A basic question is: Why is the study of linear representations so prevalent?

There are several reasons for investigating the linear viewpoint.

- Linear learning is sufficient. As discussed above, this is really only true in practice if you have sufficiently capable humans hand-engineering features. On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems.
- Linear learning is a compelling primitive. Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures.
- Linear learning is the only thing tractable, empirically. This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. On a dataset large enough to require efficient algorithms, you often want to use online learning. And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles.
- Linear learning is the only thing tractable, theoretically. There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. However, there are fundamental limits to the extent that a well understood tool can be misused, and it’s important to understand that these theorems do not (and cannot) say that learning on a linear representation will solve some concrete problem like (say) face recognition from 10000 labeled examples. In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions, but there are others also.

Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story:

At night you see someone searching the ground under a streetlight.

You ask, “What happened?”

They say, “I’m looking for the keys I dropped in the bushes.”

“But there aren’t any bushes where you are searching.”

“Yes, but I can’t see over there.”

Yes , and many times we found first the solution and then we understand why, so we must search also without the light.

There seems to be much too much of a focus on linear learning, but even for non-linear learning there is often a focus on so called “shallow” architectures with convex loss functions. That is why I take every opportunity to proselytize for deep learning, with things like Bengio and LeCun’s paper “Scaling Learning Algorithms towards AI” (http://yann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf).

I’m surprised nobody mentioned kernel methods, radial basis functions, etc. Aren’t they a way of introducing nonlinearity into linear classifiers?

Yes, but unless the kernel is learned they apply a fixed nonlinearity and are essentially linear classifiers on top of a bunch of template matchers. So they are usually one large layer of simple template matchers with a single layer of learnable weights. See the paper I mention above for a more fleshed out version of this remark.

What about small n large p problems which are very common in bioinformatics? In many cases you are limited to essentially linear models, since you have to use such extreme penalization (e.g., the centroid classifier is often as good as a linear-kernel SVM).

I’m assuming p is the number of features. In this setting, I expect bringing any external knowledge to bare by any means might yield substantial improvements. For example, methods including feature engineering, Bayesian integration, and external databases might be helpful. If you really have no extra information, I can see how a severely restricted linear representation might be a reasonable choice. I’d be tempted to try out some of the exponentiated gradient style algorithms, in the hope that it can quickly pick out a relevant feature.