John Langford – Page 92 – Machine Learning (Theory)

3/8/20053/8/2005

Fast Physics for Learning

While everyone is silently working on ICML submissions, I found this discussion about a fast physics simulator chip interesting from a learning viewpoint. In many cases, learning attempts to predict the outcome of physical processes. Access to a fast simulator for these processes might be quite helpful in predicting the outcome. Bayesian learning in particular may directly benefit while many other algorithms (like support vector machines) might have their speed greatly increased.

The biggest drawback is that writing software for these odd architectures is always difficult and time consuming, but a several-orders-of-magnitude speedup might make that worthwhile.

3/5/20053/5/2005

Funding Research

The funding of research (and machine learning research) is an issue which seems to have become more significant in the United States over the last decade. The word “research” is applied broadly here to science, mathematics, and engineering.

There are two essential difficulties with funding research:

Longshot Paying a researcher is often a big gamble. Most research projects don’t pan out, but a few big payoffs can make it all worthwhile.
Information Only Much of research is about finding the right way to think about or do something.

The Longshot difficulty means that there is high variance in payoffs. This can be compensated for by funding many different research projects, reducing variance.

The Information-Only difficulty means that it’s hard to extract a profit directly from many types of research, so companies have difficulty justifying basic research. (Patents are a mechanism for doing this. They are often extraordinarily clumsy or simply not applicable.)

These two difficulties together imply that research is often chronically underfunded compared to what would be optimal for any particular nation. They also imply that funding for research makes more sense for larger nations and makes sense for government (rather than private) investment.

The United States has a history of significant research, and significant benefits from research, but that seems to be under attack.

Historically, the old phone monopoly ran Bell Labs which employed thousands doing science and engineering research. It made great sense for them because research was a place to stash money (and evade regulators) that might have some return. They invented the transistor, the laser, and unix. With the breakup of the phone monopoly, it no longer made sense, and so it has been broken apart and has lost orders of magnitude of staff.
On a smaller scale, Xerox Parc (inventors of mice, ethernet, and other basic bits of computer technology) has been radically scaled back.
IBM and HP, who have been historically strong funders of computer-related research have been forced to shift towards more direct research. (Some great research still happens at these places, but the overall trend seems clear.)
The NSF has had funding cut.

What’s Left
The new monopoly on the block is Microsoft, which has been a significant funder of new research, some of which is basic. IBM is still managing to do some basic research. Many companies are funding directed research (with short term expected payoffs). The NSF still has a reasonable budget, even if it is a bit reduced. Many other branches of the government fund directed research of one sort or another. From the perspective of a researcher, this isn’t as good as NSF because it is “money with strings attached”, including specific topics, demos, etc…

Much of the funding available falls into two or three categories: directed into academia, very directed, or both. These have difficulties from a research viewpoint.

Into Academia The difficulty with funding directed into academia is that the professors who it is directed at are incredibly busy with nonresearch. Teaching and running a university are full time jobs. It takes an extraordinary individual to manage all of this and get research done. (Amazingly, many people do manage, but the workload can be crushing.) From the perspective of funding research, this is problematic, because the people being funded are forced to devote much time to nonresearch. As an example from machine learning, AT&T inherited the machine learning group from Bell Labs consisting of Leon Bottou, Michael Collins, Sanjoy Dasgupta, Yoave Freund, Michael Kearns, Yann Lecun, David McAllester, Robert Schapire, Satinder Singh, Peter Stone, Rich Sutton, Vladimir Vapnik (and maybe a couple more I’m forgetting). The environment there was almost pure research without other responsibilities. It would be extremely difficult to argue that a similar-sized group drawn randomly from academia has had as significant an impact on machine learning. This group is gone now, scattered to many different locations.
Very directed It’s a basic fact of research that it is difficult to devote careful and deep attention to something that does not interest you. Attempting to do so simply means that many researchers aren’t efficient. (I’m not arguing against any direction here. It makes sense for any nation to invest in directions which seem important.)

The Good News (for researchers, anyways)
The good news at the moment is outside of the US. NICTA, in Australia, seems to be a well made attempt to do research right. India is starting to fund research more. China is starting to fund research more. Japan is starting to fund basic research more. With the rise of the EU more funding for research makes sense because the benefit applies to a much larger pool of people. In machine learning, this is being realized with the PASCAL project. On the engineering side, centers like the Mozilla Foundation and OSDL (which are funded by corporate contributions) provide some funding for open source programmers.

We can hope for improvements in the US—there is certainly room for it. For example, the NSF budget is roughly 0.3% of the Federal government budget so the impact of more funding for basic research is relatively trivial in the big picture. However, it’s never easy to tradeoff immediate needs against the silent loss of the future.

3/4/20053/4/2005

The Big O and Constants in Learning

The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n).

In learning theory, there are many statements about learning algorithms of the form “under assumptions x,y, and z, the classifier learned has an error rate of at most O(f(m))“.

There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well.

Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not generally change learning algorithm and decrease your error rate by some constant factor.
You’re fired When you go and run a learning algorithm to acquire some predictor, the performance it achieves is a key quantity of significant interest. Using an algorithm with merely twice the error rate of a better algorithm can easily make the difference between “it works” and “it doesn’t”. This is not often true when programming, as evidenced by the large number of people who use computationally inefficient languages to solve problems.

We can’t say “never use O()“, because sometimes abstracting away details is the right thing to do. However, any use of O() should be analyzed for “reasonableness” more thoroughly than for computational complexity. Similarly, more interest should be allocated to improving constants in such analysis than is done in algorithms.

3/2/20053/2/2005

Prior, “Prior” and Bias

Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs.

Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data.
“Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well.
Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another.

This only scratches the surface—there are yet more subtleties. For example the (as mentioned in meaning of probability) shifts from one viewpoint to another.

2/28/20052/28/2005

Regularization

Yaroslav Bulatov says that we should think about regularization a bit. It’s a complex topic which I only partially understand, so I’ll try to explain from a couple viewpoints.

Functionally. Regularization is optimizing some representation to fit the data and minimize some notion of predictor complexity. This notion of complexity is often the l₁ or l₂ norm on a set of parameters, but the term can be used much more generally. Empirically, this often works much better than simply fitting the data.
Statistical Learning Viewpoint Regularization is about the failiure of statistical learning to adequately predict generalization error. Let e(c,D) be the expected error rate with respect to D of classifier c and e(c,S) the observed error rate on a sample S. There are numerous bounds of the form: assuming i.i.d. samples, with high probability over the drawn samples S, e(c,D) less than e(c,S) + f(complexity) where complexity is some measure of the size of a set of functions. Unfortunately, we have never convincingly nailed the exact value of f(). We can note that f() is always monotonically increasing with the complexity measure and so there exists a unique constant C such that f(complexity)=C*complexity at the value of complexity which minimizes the bound. Empirical parameter tuning such as for the C constant in a support vector machine can be regarded as searching for this “right” tradeoff.
Computationally Regularization can be thought of as a computational shortcut to computing the f() above. Hence, smoothness, convexity, and other computational constraints are important issues.

One thing which should be clear is that there is no one best method of regularization for all problems. “What is a good regularizer for my problem?” is another “learning complete” question since solving it perfectly implies solving the learning problem (For example consider the “regularizer” which assigns complexity 0 to the best prediction function and infinity to all others). Similarly, “What is an empirically useful regularizer?” is like “What is a good learning algorithm?” The choice of regularizer used when solving empirical problems is a degree of freedom with which prior information and biases can be incorporated in order to improve performance.