The unrealized potential of the research lab

I attended the IBM research 60th anniversary. IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs.)

Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include:

  1. It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.)
  2. Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division?
  3. An early lead in IP networking hardware did not result in IBM growing a Cisco division. Why not?

And remember … IBM research is a stark success story compared to it’s competitors.

Why is there such a pervasive failure to recognize the really big things when they come along in a research lab? This is a huge by capitolization standards: several ideas created in IBM labs have resulted in companies with a stock market valuation greater than IBM. This is also of fundamental concern to researchers everywhere, because that failure is much of the reason why research is chronically underfunded.

A reasonable argument is “it’s much harder to predict the big ones in advance than in hindsight”. This is certainly true, but I don’t believe that is a full explanation. Too many ideas have succeeded because someone else outside of the orignal company recognized the value of the idea and exploited it. There is no fundamental reason why VCs should have an inherent advantage over the company running a research lab at recognizing good ideas within it’s own research lab.

Muthu Muthukrishnan says “it’s the incentives”. In particular, people who invent something within a research lab have little personal incentive in seeing it’s potential realized so they fail to pursue it as vigorously as they might in a startup setting. This is a very reasonable argument: incentives for success at a research lab are typically a low percentage of base salary while startup founders have been known to become billionaires.

A third argument is that researchers at a research lab are likely to find new and better ways of doing things that the company already does. This creates a problem because a large fraction of the company is specifically invested in the current-to-become-old way of doing things. When a debate happens, it is always easy to find the faults and drawbacks of the new method while ignoring the accepted faults of the old (this is common to new directions in research as well).

None of these three reasons are fundamentally unremovable, and it seems plausible that the first industrial research lab to remove these obstacles will reap huge benefits. My current candidates for ‘most likely to remove barriers’ are google and NICTA.

Predictive Search is Coming

“Search” is the other branch of AI research which has been succesful. Concrete examples include Deep Blue which beat the world chess champion and Chinook the champion checkers program. A set of core search techniques exist including A*, alpha-beta pruning, and others that can be applied to any of many different search problems.

Given this, it may be surprising to learn that there has been relatively little succesful work on combining prediction and search. Given also that humans typically solve search problems using a number of predictive heuristics to narrow in on a solution, we might be surprised again. However, the big successful search-based systems have typically not used “smart” search algorithms. Insteady they have optimized for very fast search. This is not for lack of trying… many people have tried to synthesize search and prediction to various degrees of success. For example, Knightcap achieves good-but-not-stellar chess playing performance, and TD-gammon has achieved near-optimal Backgammon play.

The basic claim of this post is that we will see more and stronger successes of the sort exemplified by Knightcap and TD-gammon and less of those exemplified by the “pure” search techniques. Here are two reasons:

  1. The strengths of computers are changing. Computers have long held a huge advantage over humans in two areas:

    1. Short term memory. A typical human can remember 7ish random numbers presented in a sequence. A computer can easily remember 107 random numbers. This strength is continuing to grow.
    2. Sequential execution speed. What I mean by ‘sequential execution speed’ is simply time ordered instruction execution with arbitrary data dependencies. Neurons in a human brain might be able to fire at a rate of 102 or 103 cycles/second (depending on who you ask), while computers can run at 109 cycles/second. Sequential execution speed growth appears stalled for the moment due to heat issues.

    These two points of excellence are precisely what search algorithms typically require for good performance which partially explains why the human-preferred method of search is so different from the computer-preferred method. The advantages of humans have traditionally been:

    1. Significant long term memory For example humans might have a significant advantage using a smarter search technique becaue they can use previous search-based successes and failures to bias search on new problems. Computers are gaining this advantage as well due to massive hard drives and networking.
    2. Massively parallel hardware While the speed that brains work at may not be very significant, the total amount of computation done by a brain is enormous. For example, estimates of 1015 neural connections are credible. Chip designers, who are stymied in the production of sequential execution speed are parallelizing via speculative execution, special instruction sets, “hyperthreading”, dual core processors, SMP machines, and clusters. It will be awhile before computers can execute as many parallel instructions as computer processors, but not as long as you might expect—the cell processor claims 0.25*1012 floating point operations per second.

    Putting these observations together, we see that computers are becoming more ’rounded’ in their strengths which suggests highly parallelizable algorithms using large quantities of memory may become more viable solutions for search problems. These qualities of ‘highly parallelizable’ and ‘incorporating long term memory’ characterize many of our prediction algorithms.

  2. We have gained a finer understanding of what learning is and how it can take place.

    In the beginning, on the applied side, learning was simply understood at a ‘type’ level with a relatively sparse set of types such as ‘binary classification’ and ‘multiclass classification’. Evidence of this can be seen in the UCI machine learning repository where most learning problems are ‘projected’ (via throwing away extra information) into binary or multiclass prediction even when the original problem was significantly different.

    What we have learned since then is that it is critical to consider:

    1. The natural distribution over instances. What is meant here is that instances come from some process and it is only necessary to perform well with respect to the samples produced by that process. This observation guides us to focus attention in learning machines.

      At the theoretical level, it is difficult to even think about learning problems without considering some natural distribution over the instances we need to predict. At the practical level, there are many cases where people have enjoyed grealy improved success simply by using the natural distribution. This occurs repeatedly in the learning reductions analysis (an exemplar is here), but this core observation is common elsewhere as well.

    2. Richer classification problems. It turns out that many of the details lost in translation of a problem to multiclass or binary classification are very important. For example, if one failure mode is much more severe than another exposing that to the learning machine can make a huge difference in performance. We now have a number of techniques for exposing this information to learning machines and can use much richer side information.

    The basic claim here is that understanding these two concepts is key to mixing search and prediction. Considering the ‘natural distribution’ is important because search systems create their own distribution over instances. This property of ‘self-creating distribution’ implies that it is easy to misdesign a combined learning/search system. For example, it is easy to train the learning algorithm on samples drawn from the wrong process.

    The value of searching in some direction vs. another direction is not easily described as a simple classification problem. Consequently, projecting away this information, as might have been done a decade ago, can result in a large performance loss.

We have a winner

The DARPA grandchallenge is a big contest for autonomous robot vehicle driving. It was run once in 2004 for the first time and all teams did badly. This year was notably different with the Stanford and CMU teams succesfully completing the course. A number of details are here and wikipedia has continuing coverage.

A formal winner hasn’t been declared yet although Stanford completed the course quickest.

The Stanford and CMU teams deserve a large round of applause as they have strongly demonstrated the feasibility of autonomous vehicles.

The good news for machine learning is that the Stanford team (at least) is using some machine learning techniques.

On-line learning of regular decision rules

Many decision problems can be represented in the form
FOR n=1,2,…:
— Reality chooses a datum xn.
— Decision Maker chooses his decision dn.
— Reality chooses an observation yn.
— Decision Maker suffers loss L(yn,dn).
END FOR.
The observation yn can be, for example, tomorrow’s stock price and the decision dn the number of shares Decision Maker chooses to buy. The datum xn ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data.

Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D(x). Can we perform as well, or almost as well, as D, without knowing it? This is essentially a special case of the problem of on-line learning.

This is a simple result of this kind. Suppose the data xn are taken from [0,1] and L(y,d)=|yd|. A norm ||h|| of a function h on [0,1] is defined by

||h||2 = (Integral01 h(t) dt)2 + Integral01 (h'(t))2 dt.
Decision Maker has a strategy that guarantees, for all N and all D with finite ||D||,
Sumn=1N L(yn,dn) is at most Sumn=1N L(yn,D(xn)) + (||2D–1|| + 1) N1/2.
Therefore, Decision Maker is competitive with D provided the “mean slope” (Integral01 (D'(t))2dt)1/2 of D is significantly less than N1/2. This can be extended to general reproducing kernel Hilbert spaces of decision rules and to many other loss functions.

It is interesting that the only known way of proving this result uses a non-standard (although very old) notion of probability. The standard definition (“Kolmogorov’s axiomatic“) is that probability is a normalized measure. The alternative is to define probability in terms of games rather than measure (this was started by von Mises and greatly advanced by Ville, who in 1939 replaced von Mises’s awkward gambling strategies with what he called martingales). This game-theoretic probability allows one to state the most important laws of probability as continuous strategies for gambling against the forecaster: the gambler becomes rich if the forecasts violate the law. In 2004 Akimichi Takemura noticed that for every such strategy the forecaster can play in such a way as not to lose a penny. Takemura’s procedure is simple and constructive: for any continuous law of probability we have an explicit forecasting strategy that satisfies that law perfectly. We call this procedure defensive forecasting. Takemura’s version was about binary observations, but it has been greatly extended since.

Now let us see how we can achieve the goal Sumn=1 N L(yn,dn) < Sumn=1N L(yn,D(xn)) + (something small). This would be easy if we knew the true probabilities for the observations. At each step we would choose the decision with the smallest expected loss and the law of large numbers would allow us to prove that our goal would be achieved. Alas, we do not know the true probabilities (if they exist at all). But with defensive forecasting at our disposal we do not need them: the fake probabilities produced by defensive forecasting are ideal as far as the law of large numbers is concerned. The rest of the proof is technical details.

Not ICML

Alex Smola showed me this ICML 2006 webpage. This is NOT the ICML we know, but rather some people at “Enformatika”. Investigation shows that they registered with an anonymous yahoo email account from dotregistrar.com the “Home of the $6.79 wholesale domain!” and their nameservers are by Turkticaret, a Turkish internet company.

It appears the website has since been altered to “ICNL” (the above link uses the google cache).

They say that imitation is the sincerest form of flattery, so the organizers of the real ICML 2006 must feel quite flattered.