One conventional wisdom is that learning algorithms with linear representations are sufficient to solve natural learning problems. This conventional wisdom appears unsupported by empirical evidence as far as I can tell. In nearly all vision, language, robotics, and speech applications I know where machine learning is effectively applied, the approach involves either a linear representation on hand crafted features capturing substantial nonlinearities or learning directly on nonlinear representations.
There are a few exceptions to this—for example, if the problem of interest to you is predicting the next word given previous words, n-gram methods have been shown effective. Viewed the right way, n-gram methods are essentially linear predictors on an enormous sparse feature space, learned from an enormous number of examples. Hal’s post here describes some of this in more detail.
In contrast, if you go to a machine learning conference, a large number of the new algorithms are variations of learning on a linear representation. This claim should be understood broadly to include (for example) kernel methods, random projection methods, and more traditionally linear representations such as the perceptron. A basic question is: Why is the study of linear representations so prevalent?
There are several reasons for investigating the linear viewpoint.
- Linear learning is sufficient. As discussed above, this is really only true in practice if you have sufficiently capable humans hand-engineering features. On one hand, there is a compelling directness to that approach, but on the other it’s not the kind of approach which transfers well to new problems.
- Linear learning is a compelling primitive. Many of the effective approaches for nonlinear learning use some combination of linear primitives connected by nonlinearities to make a final prediction. As such, there is a plausible hope that improvements in linear learning can be applied repeatedly in these more complex structures.
- Linear learning is the only thing tractable, empirically. This has a grain of truth to it, but it appears to be uncompelling when you get down to the nitty-gritty details. On a dataset large enough to require efficient algorithms, you often want to use online learning. And, when you use online learning with a pure linear representation, the limiting factor is the speed that data can be sucked into the CPU from the network or the disk. If you aren’t doing something more interesting than plain vanilla linear prediction, you are wasting most of your CPU cycles.
- Linear learning is the only thing tractable, theoretically. There are certainly many statements and guarantees that we only know how to make with linear representations and (typically) convex losses. However, there are fundamental limits to the extent that a well understood tool can be misused, and it’s important to understand that these theorems do not (and cannot) say that learning on a linear representation will solve some concrete problem like (say) face recognition from 10000 labeled examples. In addition, there are some analysis methods which apply to nonlinear learning systems—my favorite example is learning reductions, but there are others also.
Some of the reasons for linear investigations appear sound, while others are simply variants of “looking where the light is”, which comes from an often retold story:
At night you see someone searching the ground under a streetlight.
You ask, “What happened?”
They say, “I’m looking for the keys I dropped in the bushes.”
“But there aren’t any bushes where you are searching.”
“Yes, but I can’t see over there.”