Predictive Search is Coming – Machine Learning (Theory)

“Search” is the other branch of AI research which has been succesful. Concrete examples include Deep Blue which beat the world chess champion and Chinook the champion checkers program. A set of core search techniques exist including A^*, alpha-beta pruning, and others that can be applied to any of many different search problems.

Given this, it may be surprising to learn that there has been relatively little succesful work on combining prediction and search. Given also that humans typically solve search problems using a number of predictive heuristics to narrow in on a solution, we might be surprised again. However, the big successful search-based systems have typically not used “smart” search algorithms. Insteady they have optimized for very fast search. This is not for lack of trying… many people have tried to synthesize search and prediction to various degrees of success. For example, Knightcap achieves good-but-not-stellar chess playing performance, and TD-gammon has achieved near-optimal Backgammon play.

The basic claim of this post is that we will see more and stronger successes of the sort exemplified by Knightcap and TD-gammon and less of those exemplified by the “pure” search techniques. Here are two reasons:

The strengths of computers are changing. Computers have long held a huge advantage over humans in two areas:
1. Short term memory. A typical human can remember 7ish random numbers presented in a sequence. A computer can easily remember 10⁷ random numbers. This strength is continuing to grow.
2. Sequential execution speed. What I mean by ‘sequential execution speed’ is simply time ordered instruction execution with arbitrary data dependencies. Neurons in a human brain might be able to fire at a rate of 10² or 10³ cycles/second (depending on who you ask), while computers can run at 10⁹ cycles/second. Sequential execution speed growth appears stalled for the moment due to heat issues.
These two points of excellence are precisely what search algorithms typically require for good performance which partially explains why the human-preferred method of search is so different from the computer-preferred method. The advantages of humans have traditionally been:
1. Significant long term memory For example humans might have a significant advantage using a smarter search technique becaue they can use previous search-based successes and failures to bias search on new problems. Computers are gaining this advantage as well due to massive hard drives and networking.
2. Massively parallel hardware While the speed that brains work at may not be very significant, the total amount of computation done by a brain is enormous. For example, estimates of 10¹⁵ neural connections are credible. Chip designers, who are stymied in the production of sequential execution speed are parallelizing via speculative execution, special instruction sets, “hyperthreading”, dual core processors, SMP machines, and clusters. It will be awhile before computers can execute as many parallel instructions as computer processors, but not as long as you might expect—the cell processor claims 0.25*10¹² floating point operations per second.
Putting these observations together, we see that computers are becoming more ’rounded’ in their strengths which suggests highly parallelizable algorithms using large quantities of memory may become more viable solutions for search problems. These qualities of ‘highly parallelizable’ and ‘incorporating long term memory’ characterize many of our prediction algorithms.
We have gained a finer understanding of what learning is and how it can take place.
In the beginning, on the applied side, learning was simply understood at a ‘type’ level with a relatively sparse set of types such as ‘binary classification’ and ‘multiclass classification’. Evidence of this can be seen in the UCI machine learning repository where most learning problems are ‘projected’ (via throwing away extra information) into binary or multiclass prediction even when the original problem was significantly different.

What we have learned since then is that it is critical to consider:
1. The natural distribution over instances. What is meant here is that instances come from some process and it is only necessary to perform well with respect to the samples produced by that process. This observation guides us to focus attention in learning machines.
  At the theoretical level, it is difficult to even think about learning problems without considering some natural distribution over the instances we need to predict. At the practical level, there are many cases where people have enjoyed grealy improved success simply by using the natural distribution. This occurs repeatedly in the learning reductions analysis (an exemplar is here), but this core observation is common elsewhere as well.
2. Richer classification problems. It turns out that many of the details lost in translation of a problem to multiclass or binary classification are very important. For example, if one failure mode is much more severe than another exposing that to the learning machine can make a huge difference in performance. We now have a number of techniques for exposing this information to learning machines and can use much richer side information.
The basic claim here is that understanding these two concepts is key to mixing search and prediction. Considering the ‘natural distribution’ is important because search systems create their own distribution over instances. This property of ‘self-creating distribution’ implies that it is easy to misdesign a combined learning/search system. For example, it is easy to train the learning algorithm on samples drawn from the wrong process.

The value of searching in some direction vs. another direction is not easily described as a simple classification problem. Consequently, projecting away this information, as might have been done a decade ago, can result in a large performance loss.