January 2008 – Machine Learning (Theory)

1/28/2008

Sufficient Computation

Do we have computer hardware sufficient for AI? This question is difficult to answer, but here’s a try:

One way to achieve AI is by simulating a human brain. A human brain has about 10¹⁵ synapses which operate at about 10² per second implying about 10¹⁷ bit ops per second.

A modern computer runs at 10⁹ cycles/second and operates on 10² bits per cycle implying 10¹¹ bits processed per second.

The gap here is only 6 orders of magnitude, which can be plausibly surpassed via cluster machines. For example, the BlueGene/L operates 10⁵ nodes (one order of magnitude short). It’s peak recorded performance is about 0.5*10¹⁵ FLOPS which translates to about 10¹⁶ bit ops per second, which is nearly 10¹⁷.

There are many criticisms (both positive and negative) for this argument.

Simulation of a human brain might require substantially more detail. Perhaps an additional 10² is required per neuron.
We may not need to simulate a human brain to achieve AI. There are certainly many examples where we have been able to design systems that work much better than evolved systems.
The internet can be viewed as a supercluster with 10⁹ or so CPUs, easily satisfying the computational requirements.
Satisfying the computational requirement is not enough—bandwidth and latency requirements must also be satisfied.

These sorts of order-of-magnitude calculations appear sloppy, but they work out a remarkable number of times when tested elsewhere. I wouldn’t be surprised to see it work out here.

Even with sufficient harrdware, we are missing a vital ingredient: knowing how to do things.

1/25/20081/26/2008

Turing’s Club for Machine Learning

Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions:

How does the learning algorithm scale with the number of examples m? Any algorithm using all of the data is at least O(m), but in many cases this is O(m²) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled.
The above question can also be asked for test cases. In some applications, test-time performance is of great importance.
How does the algorithm scale with the number of features n per example? Many second order gradient descent algorithms are O(n²) or worse which becomes unacceptable as the number of parameters grows. Nonsparse algorithms applied to sparse datasets have an undefined dependence, which is typically terrible.
What are the memory requirements of the learning algorithm? Something linear in the number of features (or less) is nice. Nearest neighbor and kernel methods can be problematic, because the memory requirement is uncontrolled.

One unfortunate aspect of big-O notation is that it doesn’t give an intuitive good sense of the scale of problems solvable by a machine. A simple trick is to pick a scale, and ask what size problem can be solved given the big-O dependence. For various reasons (memory size, number of web pages, FLOPS of a modern machine), a scale of 10¹⁰ is currently appropriate. Computing scales, you get:

O(m)	O(m log(m))	O(m²)	O(m³)	O(e^m)
10¹⁰	510⁸*	10⁵	210³*	25

There is good reason to stick with big-O notation over the long term, because the scale of problems we tackle keeps growing. Having a good understanding of the implied scale remains very handy for understanding the practicality of algorithms for problems.

There are various depths to which we can care about computation. The Turing’s Razor application would be “a learning algorithm isn’t interesting unless it runs in time linear in the number of bytes input”. This isn’t crazy—for people with a primary interest in large scale learning (where you explicitly have large datasets) or AI (where any effective system must scale to very large amounts of experience), a O(mn log(mn)) or better dependence is the target.

For someone deeply steeped in computer science algorithms and complexity thoery, the application is: “a learning algorithm isn’t interesting unless it has a polynomial dependence on the number of bytes input”. This is mismatched for machine learning. It’s too crude because O(m^9) algorithms are interesting to basically no one. It’s too fine because (a) there are a number of problems of interest with only a small amount of data where algorithms with unquantifiable computation may be of interest (think of Bayesian integration) and (b) some problems simply have no solution yet, so the existence of a solution (which is not necessarily efficient) is of substantial interest.

The right degree of care about computation I’ll call “Turing’s club”. Computation is a primary but not overriding concern. Every algorithm should be accompanied by some statement about it’s computational and space costs. Algorithms in the “no known computational bound” category are of interest if they accomplish something never before done, but are otherwise of little interest. Algorithms with controlled guarantees on computational requirements are strongly preferred. Linear time algorithms are strongly preferred. Restated: there are often many algorithms capable of solving a particular problem reasonably well so fast algorithms with controlled resource guarantees distinguish themselves by requiring less TLC to make them work well.

1/23/2008

Why Workshop?

I second the call for workshops at ICML/COLT/UAI.

Several times before, details of why and how to run a workshop have been mentioned.

There is a simple reason to prefer workshops here: attendance. The Helsinki colocation has placed workshops directly between ICML and COLT/UAI, which is optimal for getting attendees from any conference. In addition, last year ICML had relatively few workshops and NIPS workshops were overloaded. In addition to those that happened a similar number were rejected. The overload has strange consequences—for example, the best attended workshop wasn’t an official NIPS workshop. Aside from intrinsic interest, the Deep Learning workshop benefited greatly from being off schedule.

1/18/2008

Datasets

David Pennock notes the impressive set of datasets at datawrangling.

1/7/2008

2008 Summer Machine Learning Conference Schedule

Conference	Paper due date	Conference Date	Location
AAAI	January 22/23/25/30	July 13-17	Chicago, Illinois
ICML	Feb 8	July 5-9	Helsinki, Finland
COLT	Feb 20	July 9-12	Helsinki, Finland
KDD	Feb 23/29	August 24-27	Las Vegas, Nevada
UAI	Feb 27/Feb 29	July 9-12	Helsinki, Finland

Helsinki is a fun place to visit.