March 2005 – Page 3 – Machine Learning (Theory)

3/10/20053/10/2005

Breaking Abstractions

Sam Roweis‘s comment reminds me of a more general issue that comes up in doing research: abstractions always break.

Real number’s aren’t. Most real numbers can not be represented with any machine. One implication of this is that many real-number based algorithms have difficulties when implemented with floating point numbers.
The box on your desk is not a turing machine. A turing machine can compute anything computable, given sufficient time. A typical computer fails terribly when the state required for the computation exceeds some limit.
Nash equilibria aren’t equilibria. This comes up when trying to predict human behavior based on the result of the equilibria computation. Often, it doesn’t work.
The probability isn’t. Probability is an abstraction expressing either our lack of knowledge (the Bayesian viewpoint) or fundamental randomization (the frequentist viewpoint). From the frequentist viewpoint the lack of knowledge typically precludes actually knowing the fundamental randomization. From the Bayesian viewpoint, precisely specifying our lack of knowledge is extremely difficult and typically not done.

So, what should we do when we learn that our basic tools can break? The answer, of course is to keep using them until something better comes along. However, the uncomfortable knowledge our tools break is necessary in a few ways:

When considering a new abstraction, the existence of a break does not imply that it is a useless abstraction. (Just as the existence of the breaks above does not imply that they are useless abstractions.)
When using an abstraction in some new way, we must generally consider “is this a reasonable use?”
Some tools break more often than other tools. We should actively consider the “rate of breakage” when deciding amongst tools.

3/9/20053/9/2005

Bad Reviewing

This is a difficult subject to talk about for many reasons, but a discussion may be helpful.

Bad reviewing is a problem in academia. The first step in understanding this is admitting to the problem, so here is a short list of examples of bad reviewing.

Reviewer disbelieves theorem proof (ICML), or disbelieve theorem with a trivially false counterexample. (COLT)
Reviewer internally swaps quantifiers in a theorem, concludes it has been done before and is trivial. (NIPS)
Reviewer believes a technique will not work despite experimental validation. (COLT)
Reviewers fail to notice flaw in theorem statement (CRYPTO).
Reviewer erroneously claims that it has been done before (NIPS, SODA, JMLR)—(complete with references!)
Reviewer inverts the message of a paper and concludes it says nothing important. (NIPS*2)
Reviewer fails to distinguish between a DAG and a tree (SODA).
Reviewer is enthusiastic about paper but clearly does not understand (ICML).
Reviewer erroneously believe that the “birthday paradox” is relevant (CCS).

The above is only for cases where there was sufficient reviewer comments to actually understand reviewer failure modes. Many reviewers fail to leave sufficient comments and it’s easy to imagine they commit similar mistakes.

Bad reviewing should be clearly distinguished from rejections—note that some of the above examples are actually accepts.

The standard psychological reaction to any rejected paper is trying to find fault with the reviewers. You, as a paper writer, have invested significant work (weeks? months? years?) in the process of creating a paper, so it is extremely difficult to step back and read the reviews objectively. One distinguishing characteristic of a bad review from a rejection is that it bothers you years later.

If we accept that bad reviewing happens and want to address the issue, we are left with a very difficult problem. Many smart people have thought about improving this process, yielding the system we observe now. There are many subtle issues here and several solutions that (naively) appear obvious don’t work.

3/8/20053/8/2005

Fast Physics for Learning

While everyone is silently working on ICML submissions, I found this discussion about a fast physics simulator chip interesting from a learning viewpoint. In many cases, learning attempts to predict the outcome of physical processes. Access to a fast simulator for these processes might be quite helpful in predicting the outcome. Bayesian learning in particular may directly benefit while many other algorithms (like support vector machines) might have their speed greatly increased.

The biggest drawback is that writing software for these odd architectures is always difficult and time consuming, but a several-orders-of-magnitude speedup might make that worthwhile.

3/5/20053/5/2005

Funding Research

The funding of research (and machine learning research) is an issue which seems to have become more significant in the United States over the last decade. The word “research” is applied broadly here to science, mathematics, and engineering.

There are two essential difficulties with funding research:

Longshot Paying a researcher is often a big gamble. Most research projects don’t pan out, but a few big payoffs can make it all worthwhile.
Information Only Much of research is about finding the right way to think about or do something.

The Longshot difficulty means that there is high variance in payoffs. This can be compensated for by funding many different research projects, reducing variance.

The Information-Only difficulty means that it’s hard to extract a profit directly from many types of research, so companies have difficulty justifying basic research. (Patents are a mechanism for doing this. They are often extraordinarily clumsy or simply not applicable.)

These two difficulties together imply that research is often chronically underfunded compared to what would be optimal for any particular nation. They also imply that funding for research makes more sense for larger nations and makes sense for government (rather than private) investment.

The United States has a history of significant research, and significant benefits from research, but that seems to be under attack.

Historically, the old phone monopoly ran Bell Labs which employed thousands doing science and engineering research. It made great sense for them because research was a place to stash money (and evade regulators) that might have some return. They invented the transistor, the laser, and unix. With the breakup of the phone monopoly, it no longer made sense, and so it has been broken apart and has lost orders of magnitude of staff.
On a smaller scale, Xerox Parc (inventors of mice, ethernet, and other basic bits of computer technology) has been radically scaled back.
IBM and HP, who have been historically strong funders of computer-related research have been forced to shift towards more direct research. (Some great research still happens at these places, but the overall trend seems clear.)
The NSF has had funding cut.

What’s Left
The new monopoly on the block is Microsoft, which has been a significant funder of new research, some of which is basic. IBM is still managing to do some basic research. Many companies are funding directed research (with short term expected payoffs). The NSF still has a reasonable budget, even if it is a bit reduced. Many other branches of the government fund directed research of one sort or another. From the perspective of a researcher, this isn’t as good as NSF because it is “money with strings attached”, including specific topics, demos, etc…

Much of the funding available falls into two or three categories: directed into academia, very directed, or both. These have difficulties from a research viewpoint.

Into Academia The difficulty with funding directed into academia is that the professors who it is directed at are incredibly busy with nonresearch. Teaching and running a university are full time jobs. It takes an extraordinary individual to manage all of this and get research done. (Amazingly, many people do manage, but the workload can be crushing.) From the perspective of funding research, this is problematic, because the people being funded are forced to devote much time to nonresearch. As an example from machine learning, AT&T inherited the machine learning group from Bell Labs consisting of Leon Bottou, Michael Collins, Sanjoy Dasgupta, Yoave Freund, Michael Kearns, Yann Lecun, David McAllester, Robert Schapire, Satinder Singh, Peter Stone, Rich Sutton, Vladimir Vapnik (and maybe a couple more I’m forgetting). The environment there was almost pure research without other responsibilities. It would be extremely difficult to argue that a similar-sized group drawn randomly from academia has had as significant an impact on machine learning. This group is gone now, scattered to many different locations.
Very directed It’s a basic fact of research that it is difficult to devote careful and deep attention to something that does not interest you. Attempting to do so simply means that many researchers aren’t efficient. (I’m not arguing against any direction here. It makes sense for any nation to invest in directions which seem important.)

The Good News (for researchers, anyways)
The good news at the moment is outside of the US. NICTA, in Australia, seems to be a well made attempt to do research right. India is starting to fund research more. China is starting to fund research more. Japan is starting to fund basic research more. With the rise of the EU more funding for research makes sense because the benefit applies to a much larger pool of people. In machine learning, this is being realized with the PASCAL project. On the engineering side, centers like the Mozilla Foundation and OSDL (which are funded by corporate contributions) provide some funding for open source programmers.

We can hope for improvements in the US—there is certainly room for it. For example, the NSF budget is roughly 0.3% of the Federal government budget so the impact of more funding for basic research is relatively trivial in the big picture. However, it’s never easy to tradeoff immediate needs against the silent loss of the future.

3/4/20053/4/2005

The Big O and Constants in Learning

The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n).

In learning theory, there are many statements about learning algorithms of the form “under assumptions x,y, and z, the classifier learned has an error rate of at most O(f(m))“.

There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well.

Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not generally change learning algorithm and decrease your error rate by some constant factor.
You’re fired When you go and run a learning algorithm to acquire some predictor, the performance it achieves is a key quantity of significant interest. Using an algorithm with merely twice the error rate of a better algorithm can easily make the difference between “it works” and “it doesn’t”. This is not often true when programming, as evidenced by the large number of people who use computationally inefficient languages to solve problems.

We can’t say “never use O()“, because sometimes abstracting away details is the right thing to do. However, any use of O() should be analyzed for “reasonableness” more thoroughly than for computational complexity. Similarly, more interest should be allocated to improving constants in such analysis than is done in algorithms.