Machine Learning (Theory)

10/26/2005

Fallback Analysis is a Secret to Useful Algorithms

Tags: Bayesian, Online, Prediction Theory jl@ 1:54 pm

The ideal of theoretical algorithm analysis is to construct an algorithm with accompanying optimality theorems proving that it is a useful algorithm. This ideal often fails, particularly for learning algorithms and theory. The general form of a theorem is:

If preconditions Then postconditions
When we design learning algorithms it is very common to come up with precondition assumptions such as “the data is IID”, “the learning problem is drawn from a known distribution over learning problems”, or “there is a perfect classifier”. All of these example preconditions can be false for real-world problems in ways that are not easily detectable. This means that algorithms derived and justified by these very common forms of analysis may be prone to catastrophic failure in routine (mis)application.

We can hope for better. Several different kinds of learning algorithm analysis have been developed some of which have fewer preconditions. Simply demanding that these forms of analysis be used may be too strong—there is an unresolved criticism that these algorithm may be “too worst case”. Nevertheless, it is possible to have a learning algorithm that simultaneously provides strong postconditions given strong preconditions, reasonable postconditions given reasonable preconditions, and weak postconditions given weak preconditions. Some examples of this I’ve encountered include:

  1. Sham, Matthias and Dean showing that some Bayesian regression is robust in a minimax online learning analysis.
  2. The cover tree which creates an O(n) datastructure for nearest neighbor queries while simultaneously guaranteeing O(log(n)) query time when the metric obeys a dimensionality constraint.

The basic claim is that algorithms with a good fallback analysis are significantly more likely to achieve the theoretical algorithm analysis ideal. Both of the above algorithms have been tested in practice and found capable.

Several significant difficulties occur for anyone working on fallback analysis.

  1. It’s harder. This is probably the most valid reason—people have limited time to do things. Nevertheless, it is reasonable to hope that the core techniques used by many people have had this effort put into them.
  2. It is psychologically difficult to both assume and not assume a precondition, for a researcher. A critical valuable resource here is observing multiple forms of analysis.
  3. It is psychologically difficult for a reviewer to appreciate the value of both assuming and not assuming some precondition. This is a matter of education.
  4. It is neither “sexy” nore straightforward. In particular, theoretically inclined people 1) get great joy from showing that something new is possible and 1) routinely work on papers of the form “here is a better algorithm to do X given the same assumptions”. A fallback analysis requires a change in assumption invalidating (2) and the new thing that it shows for (1) is subtle: that two existing guarantees can hold for the same algorithm. My hope here is that this subtlety becomes better appreciated in time—making useful algorithms has a fundamental sexiness of it’s own.

10/20/2005

Machine Learning in the News

Tags: General jl@ 10:55 am

The New York Times had a short interview about machine learning in datamining being used pervasively by the IRS and large corporations to predict who to audit and who to target for various marketing campaigns. This is a big application area of machine learning. It can be harmful (learning + databases = another way to invade privacy) or beneficial (as google demonstrates, better targeting of marketing campaigns is far less annoying). This is yet more evidence that we can not rely upon “I’m just another fish in the school” logic for our expectations about treatment by government and large corporations.

10/19/2005

Workshop: Atomic Learning

Tags: General jl@ 12:23 pm

We are planning to have a workshop on atomic learning Jan 7 & 8 at TTI-Chicago.

Details are here.

The earlier request for interest is here.

The primary deadline is abstracts due Nov. 20 to jl@tti-c.org.

10/16/2005

Complexity: It’s all in your head

One of the central concerns of learning is to understand and to
prevent overfitting. Various notion of “function complexity” often
arise: VC dimension, Rademacher complexity, comparison classes of
experts, and program length are just a few.

The term “complexity” to me seems somehow misleading; the terms never
capture something that meets my intuitive notion of complexity. The
Bayesian notion clearly captures what’s going on. Functions aren’t
“complex”– they’re just “surprising”: we assign to them low
probability. Most (all?) complexity notions I know boil down
to some (generally loose) bound on the prior probability of the function.

In a sense, “complexity” fundementally arises because probability
distributions must sum to one. You can’t believe in all possibilities
at the same time, or at least not equally. Rather you have to
carefully spread the probability mass over the options you’d like to
consider. Large complexity classes means that beliefs are spread
thinly. In it’s simplest form, this phenomenom give the log (1\n) for
n hypotheses in classic PAC bounds.

In fact, one way to think about good learning algorithms is that they
are those which take full advantage of their probability mass.
In the language of Minimum Description Length, they correspond to
“non-defective distributions”.

So this raises a question: are there notions of complexity (preferably finite,
computable ones) that differ fundementally from the notions of “prior”
or “surprisingness”? Game-theoretic setups would seem to be promising,
although much of the work I’m familiar with ties it closely to the notion
of prior as well.

10/13/2005

Site tweak

Tags: General jl@ 10:43 am

Several people have had difficulty with comments which seem to have an allowed language significantly poorer than posts. The set of allowed html tags has been increased and the markdown filter has been put in place to try to make commenting easier. I’ll put some examples into the comments of this post.

10/12/2005

The unrealized potential of the research lab

Tags: General jl@ 10:35 am

I attended the IBM research 60th anniversary. IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs.)

Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include:

  1. It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.)
  2. Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division?
  3. An early lead in IP networking hardware did not result in IBM growing a Cisco division. Why not?

And remember … IBM research is a stark success story compared to it’s competitors.

Why is there such a pervasive failure to recognize the really big things when they come along in a research lab? This is a huge by capitolization standards: several ideas created in IBM labs have resulted in companies with a stock market valuation greater than IBM. This is also of fundamental concern to researchers everywhere, because that failure is much of the reason why research is chronically underfunded.

A reasonable argument is “it’s much harder to predict the big ones in advance than in hindsight”. This is certainly true, but I don’t believe that is a full explanation. Too many ideas have succeeded because someone else outside of the orignal company recognized the value of the idea and exploited it. There is no fundamental reason why VCs should have an inherent advantage over the company running a research lab at recognizing good ideas within it’s own research lab.

Muthu Muthukrishnan says “it’s the incentives”. In particular, people who invent something within a research lab have little personal incentive in seeing it’s potential realized so they fail to pursue it as vigorously as they might in a startup setting. This is a very reasonable argument: incentives for success at a research lab are typically a low percentage of base salary while startup founders have been known to become billionaires.

A third argument is that researchers at a research lab are likely to find new and better ways of doing things that the company already does. This creates a problem because a large fraction of the company is specifically invested in the current-to-become-old way of doing things. When a debate happens, it is always easy to find the faults and drawbacks of the new method while ignoring the accepted faults of the old (this is common to new directions in research as well).

None of these three reasons are fundamentally unremovable, and it seems plausible that the first industrial research lab to remove these obstacles will reap huge benefits. My current candidates for ‘most likely to remove barriers’ are google and NICTA.

10/10/2005

Predictive Search is Coming

Tags: General jl@ 10:08 am

“Search” is the other branch of AI research which has been succesful. Concrete examples include Deep Blue which beat the world chess champion and Chinook the champion checkers program. A set of core search techniques exist including A*, alpha-beta pruning, and others that can be applied to any of many different search problems.

Given this, it may be surprising to learn that there has been relatively little succesful work on combining prediction and search. Given also that humans typically solve search problems using a number of predictive heuristics to narrow in on a solution, we might be surprised again. However, the big successful search-based systems have typically not used “smart” search algorithms. Insteady they have optimized for very fast search. This is not for lack of trying… many people have tried to synthesize search and prediction to various degrees of success. For example, Knightcap achieves good-but-not-stellar chess playing performance, and TD-gammon has achieved near-optimal Backgammon play.

The basic claim of this post is that we will see more and stronger successes of the sort exemplified by Knightcap and TD-gammon and less of those exemplified by the “pure” search techniques. Here are two reasons:

  1. The strengths of computers are changing. Computers have long held a huge advantage over humans in two areas:

    1. Short term memory. A typical human can remember 7ish random numbers presented in a sequence. A computer can easily remember 107 random numbers. This strength is continuing to grow.
    2. Sequential execution speed. What I mean by ’sequential execution speed’ is simply time ordered instruction execution with arbitrary data dependencies. Neurons in a human brain might be able to fire at a rate of 102 or 103 cycles/second (depending on who you ask), while computers can run at 109 cycles/second. Sequential execution speed growth appears stalled for the moment due to heat issues.

    These two points of excellence are precisely what search algorithms typically require for good performance which partially explains why the human-preferred method of search is so different from the computer-preferred method. The advantages of humans have traditionally been:

    1. Significant long term memory For example humans might have a significant advantage using a smarter search technique becaue they can use previous search-based successes and failures to bias search on new problems. Computers are gaining this advantage as well due to massive hard drives and networking.
    2. Massively parallel hardware While the speed that brains work at may not be very significant, the total amount of computation done by a brain is enormous. For example, estimates of 1015 neural connections are credible. Chip designers, who are stymied in the production of sequential execution speed are parallelizing via speculative execution, special instruction sets, “hyperthreading”, dual core processors, SMP machines, and clusters. It will be awhile before computers can execute as many parallel instructions as computer processors, but not as long as you might expect—the cell processor claims 0.25*1012 floating point operations per second.

    Putting these observations together, we see that computers are becoming more ’rounded’ in their strengths which suggests highly parallelizable algorithms using large quantities of memory may become more viable solutions for search problems. These qualities of ‘highly parallelizable’ and ‘incorporating long term memory’ characterize many of our prediction algorithms.

  2. We have gained a finer understanding of what learning is and how it can take place.

    In the beginning, on the applied side, learning was simply understood at a ‘type’ level with a relatively sparse set of types such as ‘binary classification’ and ‘multiclass classification’. Evidence of this can be seen in the UCI machine learning repository where most learning problems are ‘projected’ (via throwing away extra information) into binary or multiclass prediction even when the original problem was significantly different.

    What we have learned since then is that it is critical to consider:

    1. The natural distribution over instances. What is meant here is that instances come from some process and it is only necessary to perform well with respect to the samples produced by that process. This observation guides us to focus attention in learning machines.

      At the theoretical level, it is difficult to even think about learning problems without considering some natural distribution over the instances we need to predict. At the practical level, there are many cases where people have enjoyed grealy improved success simply by using the natural distribution. This occurs repeatedly in the learning reductions analysis (an exemplar is here), but this core observation is common elsewhere as well.

    2. Richer classification problems. It turns out that many of the details lost in translation of a problem to multiclass or binary classification are very important. For example, if one failure mode is much more severe than another exposing that to the learning machine can make a huge difference in performance. We now have a number of techniques for exposing this information to learning machines and can use much richer side information.

    The basic claim here is that understanding these two concepts is key to mixing search and prediction. Considering the ‘natural distribution’ is important because search systems create their own distribution over instances. This property of ’self-creating distribution’ implies that it is easy to misdesign a combined learning/search system. For example, it is easy to train the learning algorithm on samples drawn from the wrong process.

    The value of searching in some direction vs. another direction is not easily described as a simple classification problem. Consequently, projecting away this information, as might have been done a decade ago, can result in a large performance loss.

10/8/2005

We have a winner

Tags: General jl@ 6:55 pm

The DARPA grandchallenge is a big contest for autonomous robot vehicle driving. It was run once in 2004 for the first time and all teams did badly. This year was notably different with the Stanford and CMU teams succesfully completing the course. A number of details are here and wikipedia has continuing coverage.

A formal winner hasn’t been declared yet although Stanford completed the course quickest.

The Stanford and CMU teams deserve a large round of applause as they have strongly demonstrated the feasibility of autonomous vehicles.

The good news for machine learning is that the Stanford team (at least) is using some machine learning techniques.

10/7/2005

On-line learning of regular decision rules

Tags: Online, Universal Learning Volodya@ 4:08 pm

Many decision problems can be represented in the form
FOR n=1,2,…:
— Reality chooses a datum xn.
— Decision Maker chooses his decision dn.
— Reality chooses an observation yn.
— Decision Maker suffers loss L(yn,dn).
END FOR.
The observation yn can be, for example, tomorrow’s stock price and the decision dn the number of shares Decision Maker chooses to buy. The datum xn ideally contains all information that might be relevant in making this decision. We do not want to assume anything about the way Reality generates the observations and data.

Suppose there is a good and not too complex decision rule D mapping each datum x to a decision D(x). Can we perform as well, or almost as well, as D, without knowing it? This is essentially a special case of the problem of on-line learning.

This is a simple result of this kind. Suppose the data xn are taken from [0,1] and L(y,d)=|yd|. A norm ||h|| of a function h on [0,1] is defined by

||h||2 = (Integral01 h(t) dt)2 + Integral01 (h'(t))2 dt.
Decision Maker has a strategy that guarantees, for all N and all D with finite ||D||,
Sumn=1N L(yn,dn) is at most Sumn=1N L(yn,D(xn)) + (||2D–1|| + 1) N1/2.
Therefore, Decision Maker is competitive with D provided the “mean slope” (Integral01 (D'(t))2dt)1/2 of D is significantly less than N1/2. This can be extended to general reproducing kernel Hilbert spaces of decision rules and to many other loss functions.

It is interesting that the only known way of proving this result uses a non-standard (although very old) notion of probability. The standard definition (”Kolmogorov’s axiomatic“) is that probability is a normalized measure. The alternative is to define probability in terms of games rather than measure (this was started by von Mises and greatly advanced by Ville, who in 1939 replaced von Mises’s awkward gambling strategies with what he called martingales). This game-theoretic probability allows one to state the most important laws of probability as continuous strategies for gambling against the forecaster: the gambler becomes rich if the forecasts violate the law. In 2004 Akimichi Takemura noticed that for every such strategy the forecaster can play in such a way as not to lose a penny. Takemura’s procedure is simple and constructive: for any continuous law of probability we have an explicit forecasting strategy that satisfies that law perfectly. We call this procedure defensive forecasting. Takemura’s version was about binary observations, but it has been greatly extended since.

Now let us see how we can achieve the goal Sumn=1 N L(yn,dn) < Sumn=1N L(yn,D(xn)) + (something small). This would be easy if we knew the true probabilities for the observations. At each step we would choose the decision with the smallest expected loss and the law of large numbers would allow us to prove that our goal would be achieved. Alas, we do not know the true probabilities (if they exist at all). But with defensive forecasting at our disposal we do not need them: the fake probabilities produced by defensive forecasting are ideal as far as the law of large numbers is concerned. The rest of the proof is technical details.

10/3/2005

Not ICML

Tags: General jl@ 4:00 am

Alex Smola showed me this ICML 2006 webpage. This is NOT the ICML we know, but rather some people at “Enformatika”. Investigation shows that they registered with an anonymous yahoo email account from dotregistrar.com the “Home of the $6.79 wholesale domain!” and their nameservers are by Turkticaret, a Turkish internet company.

It appears the website has since been altered to “ICNL” (the above link uses the google cache).

They say that imitation is the sincerest form of flattery, so the organizers of the real ICML 2006 must feel quite flattered.

Powered by WordPress