John Langford – Page 76 – Machine Learning (Theory)

10/20/200510/20/2005

Machine Learning in the News

The New York Times had a short interview about machine learning in datamining being used pervasively by the IRS and large corporations to predict who to audit and who to target for various marketing campaigns. This is a big application area of machine learning. It can be harmful (learning + databases = another way to invade privacy) or beneficial (as google demonstrates, better targeting of marketing campaigns is far less annoying). This is yet more evidence that we can not rely upon “I’m just another fish in the school” logic for our expectations about treatment by government and large corporations.

10/19/200510/19/2005

Workshop: Atomic Learning

We are planning to have a workshop on atomic learning Jan 7 & 8 at TTI-Chicago.

Details are here.

The earlier request for interest is here.

The primary deadline is abstracts due Nov. 20 to jl@tti-c.org.

10/13/200510/14/2005

Site tweak

Several people have had difficulty with comments which seem to have an allowed language significantly poorer than posts. The set of allowed html tags has been increased and the markdown filter has been put in place to try to make commenting easier. I’ll put some examples into the comments of this post.

10/12/200510/12/2005

The unrealized potential of the research lab

I attended the IBM research 60th anniversary. IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs.)

Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include:

It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.)
Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division?
An early lead in IP networking hardware did not result in IBM growing a Cisco division. Why not?

And remember … IBM research is a stark success story compared to it’s competitors.

Why is there such a pervasive failure to recognize the really big things when they come along in a research lab? This is a huge by capitolization standards: several ideas created in IBM labs have resulted in companies with a stock market valuation greater than IBM. This is also of fundamental concern to researchers everywhere, because that failure is much of the reason why research is chronically underfunded.

A reasonable argument is “it’s much harder to predict the big ones in advance than in hindsight”. This is certainly true, but I don’t believe that is a full explanation. Too many ideas have succeeded because someone else outside of the orignal company recognized the value of the idea and exploited it. There is no fundamental reason why VCs should have an inherent advantage over the company running a research lab at recognizing good ideas within it’s own research lab.

Muthu Muthukrishnan says “it’s the incentives”. In particular, people who invent something within a research lab have little personal incentive in seeing it’s potential realized so they fail to pursue it as vigorously as they might in a startup setting. This is a very reasonable argument: incentives for success at a research lab are typically a low percentage of base salary while startup founders have been known to become billionaires.

A third argument is that researchers at a research lab are likely to find new and better ways of doing things that the company already does. This creates a problem because a large fraction of the company is specifically invested in the current-to-become-old way of doing things. When a debate happens, it is always easy to find the faults and drawbacks of the new method while ignoring the accepted faults of the old (this is common to new directions in research as well).

None of these three reasons are fundamentally unremovable, and it seems plausible that the first industrial research lab to remove these obstacles will reap huge benefits. My current candidates for ‘most likely to remove barriers’ are google and NICTA.

10/10/200510/10/2005

Predictive Search is Coming

“Search” is the other branch of AI research which has been succesful. Concrete examples include Deep Blue which beat the world chess champion and Chinook the champion checkers program. A set of core search techniques exist including A^*, alpha-beta pruning, and others that can be applied to any of many different search problems.

Given this, it may be surprising to learn that there has been relatively little succesful work on combining prediction and search. Given also that humans typically solve search problems using a number of predictive heuristics to narrow in on a solution, we might be surprised again. However, the big successful search-based systems have typically not used “smart” search algorithms. Insteady they have optimized for very fast search. This is not for lack of trying… many people have tried to synthesize search and prediction to various degrees of success. For example, Knightcap achieves good-but-not-stellar chess playing performance, and TD-gammon has achieved near-optimal Backgammon play.

The basic claim of this post is that we will see more and stronger successes of the sort exemplified by Knightcap and TD-gammon and less of those exemplified by the “pure” search techniques. Here are two reasons:

The strengths of computers are changing. Computers have long held a huge advantage over humans in two areas:
1. Short term memory. A typical human can remember 7ish random numbers presented in a sequence. A computer can easily remember 10⁷ random numbers. This strength is continuing to grow.
2. Sequential execution speed. What I mean by ‘sequential execution speed’ is simply time ordered instruction execution with arbitrary data dependencies. Neurons in a human brain might be able to fire at a rate of 10² or 10³ cycles/second (depending on who you ask), while computers can run at 10⁹ cycles/second. Sequential execution speed growth appears stalled for the moment due to heat issues.
These two points of excellence are precisely what search algorithms typically require for good performance which partially explains why the human-preferred method of search is so different from the computer-preferred method. The advantages of humans have traditionally been:
1. Significant long term memory For example humans might have a significant advantage using a smarter search technique becaue they can use previous search-based successes and failures to bias search on new problems. Computers are gaining this advantage as well due to massive hard drives and networking.
2. Massively parallel hardware While the speed that brains work at may not be very significant, the total amount of computation done by a brain is enormous. For example, estimates of 10¹⁵ neural connections are credible. Chip designers, who are stymied in the production of sequential execution speed are parallelizing via speculative execution, special instruction sets, “hyperthreading”, dual core processors, SMP machines, and clusters. It will be awhile before computers can execute as many parallel instructions as computer processors, but not as long as you might expect—the cell processor claims 0.25*10¹² floating point operations per second.
Putting these observations together, we see that computers are becoming more ’rounded’ in their strengths which suggests highly parallelizable algorithms using large quantities of memory may become more viable solutions for search problems. These qualities of ‘highly parallelizable’ and ‘incorporating long term memory’ characterize many of our prediction algorithms.
We have gained a finer understanding of what learning is and how it can take place.
In the beginning, on the applied side, learning was simply understood at a ‘type’ level with a relatively sparse set of types such as ‘binary classification’ and ‘multiclass classification’. Evidence of this can be seen in the UCI machine learning repository where most learning problems are ‘projected’ (via throwing away extra information) into binary or multiclass prediction even when the original problem was significantly different.

What we have learned since then is that it is critical to consider:
1. The natural distribution over instances. What is meant here is that instances come from some process and it is only necessary to perform well with respect to the samples produced by that process. This observation guides us to focus attention in learning machines.
  At the theoretical level, it is difficult to even think about learning problems without considering some natural distribution over the instances we need to predict. At the practical level, there are many cases where people have enjoyed grealy improved success simply by using the natural distribution. This occurs repeatedly in the learning reductions analysis (an exemplar is here), but this core observation is common elsewhere as well.
2. Richer classification problems. It turns out that many of the details lost in translation of a problem to multiclass or binary classification are very important. For example, if one failure mode is much more severe than another exposing that to the learning machine can make a huge difference in performance. We now have a number of techniques for exposing this information to learning machines and can use much richer side information.
The basic claim here is that understanding these two concepts is key to mixing search and prediction. Considering the ‘natural distribution’ is important because search systems create their own distribution over instances. This property of ‘self-creating distribution’ implies that it is easy to misdesign a combined learning/search system. For example, it is easy to train the learning algorithm on samples drawn from the wrong process.

The value of searching in some direction vs. another direction is not easily described as a simple classification problem. Consequently, projecting away this information, as might have been done a decade ago, can result in a large performance loss.