Use of Learning Theory – Machine Learning (Theory)

I’ve had serious conversations with several people who believe that the theory in machine learning is “only useful for getting papers published”. That’s a compelling statement, as I’ve seen many papers where the algorithm clearly came first, and the theoretical justification for it came second, purely as a perceived means to improve the chance of publication.

Naturally, I disagree and believe that learning theory has much more substantial applications.

Even in core learning algorithm design, I’ve found learning theory to be useful, although it’s application is more subtle than many realize. The most straightforward applications can fail, because (as expectation suggests) worst case bounds tend to be loose in practice (*). In my experience, considering learning theory when designing an algorithm has two important effects in practice:

It can help make your algorithm behave right at a crude level of analysis, leaving finer details to tuning or common sense. The best example I have of this is the Isomap, where the algorithm was informed by the analysis yielding substantial improvements in sample complexity over earlier algorithmic ideas.
An algorithm with learning theory considered in it’s design can be more automatic. I’ve gained more respect for Rifkin’s claim: that the one-against-all reduction, when tuned well, can often perform as well as other approaches. The “when tuned well” caveat is however substantial, because learning algorithms may be applied by nonexperts or by other algorithms which are computationally constrained. A reasonable and worthwhile hope for other methods of addressing multiclass problems is that they are more automatic and computationally faster. The subtle issue here is: How do you measure “more automatic”?

In my experience, learning theory is most useful in it’s crudest forms. A good example comes in the architecting problem: how do you go about solving a learning problem? I mean this in the broadest sense imaginable:

Is it a learning problem or not? Many problems are most easily solved via other means such as engineering, because that’s easier, because there is a severe data gathering problem, or because there is so much data that memorization works fine. Learning theory such as statistical bounds and online learning with experts helps substantially here because it provides guidelines about what is possible to learn and what not.
What type of learning problem is it? Is it a problem where exploration is required or not? Is it a structured learning problem? A multitask learning problem? A cost sensitive learning problem? Are you interested in the median or the mean? Is active learning useable or not? Online or not? Answering these questions correctly can easily make a difference between a succesful application and not. Answering these questions is partly definition checking, and since the answer is often “all of the above”, figuring out which aspect of the problem to address first or next is helpful.
What is the right learning algorithm to use? Here the relative capacity of a learning algorithm and it’s computational efficiency are most important. If you have few features and many examples, a nonlinear algorithm with more representational capacity is a good idea. If you have many features and little data, linear representations or even exponentiated gradient style algorithms are important. If you have very large amounts of data, the most scalable algorithms (so far) use a linear representation. If you have little data and few features, a Bayesian approach may be your only option. Learning theory can help in all of the above by quantifying “many”, “little”, “most”, and “few”. How do you deal with the overfitting problem? One thing I realized recently is that the overfitting problem can be a concern even with very large natural datasets, because some examples are naturally more important than others.

As might be clear, I think of learning theory as somewhat broader than might be traditional. Some of this is simply education. Many people have only been exposed to one piece of learning theory, often VC theory or it’s cousins. After seeing this, they come to think of it as learning theory. VC theory is a good theory, but it is not complete, and other elements of learning theory seem at least as important and useful. Another aspect is publishability. Simply sampling from the learning theory in existing papers does not necessarily give a good distribution of subjects for teaching, because the goal of impressing reviewers does not necessarily coincide with the clean simple analysis that is teachable.

(*) There is significant investigation into improving the tightness of bounds to the point of usefulness, and maybe it will pay off.

11 Replies to “Use of Learning Theory”

Great post, John, thanks!

Thank you for the great post.
I am interested in which aspects of ML Theory should be taught. I recently took an ML course and we spent most of our time (in the learning theory domain) talking about VC Dimensions. Obviously, VC dimensions are very useful especially when expressing the complexity of hypothesis sets (with infinite cardinality), and the (somewhat crude) upper bound you can establish using the VC dimension of set is very important. But besides these two points, I didn’t get to fully enjoy this aspect of ML. I am interested in teaching ML in college and would like to focus more on ML theory rather than its applications during the earlier parts of the course. I am interested in hearing the readers’ opinion on useful, teachable, learning theory concepts to both undergrad and graduate students.
Thanks

Regarding the first sentence of your post – I don’t know of any other subfield of computer science theory that gets such disrespect from outsiders. I wish I understood why learning theory has this problem.

There are two reasons I know of: (1) the value of learning theory for machine learning is more subtle than in other fields, as discussed above. (2) there are multiple sources of theory colliding, from computer science, from statistics, and from mathematics. Given this, some tug-of-war happens.

This is one example from a previous post. For more detail on some subjects, Avrim’s notes are a fine source.

Nice post

John, that explanation makes sense, but it doesn’t explain why other areas of CS theory don’t also get such a harsh treatment. e.g. computational mechanism design – most papers are not well targeted at the alleged practical application (e-commerce) but the field still seems to elicit much more general enthusiasm.

I am not sure if this is what your post implies, but it is a real shame that Learning Theory is not perceived well within Theory, e.g. FOCS/STOC/SODA.

I wasn’t thinking about this explicitly.

A fundamental difficulty with STOC/FOCS/SODA is that learning theory is partly computational and partly informational. STOC/FOCS/SODA are almost entirely oriented around computation, so it’s easy for them to miss or misunderstand the informational aspect.

Pingback: Union Bound, Hoeffding Inequality and Some Error Bounds in Learning Theory – Part I « Onionesque Reality

Pingback: Union Bound, Hoeffding Inequality and Some Bounds in Learning Theory – Part I | BlogHalt.com (Pre-Launch)

Comments are closed.