Sufficient Computation

Do we have computer hardware sufficient for AI? This question is difficult to answer, but here’s a try:

One way to achieve AI is by simulating a human brain. A human brain has about 1015 synapses which operate at about 102 per second implying about 1017 bit ops per second.

A modern computer runs at 109 cycles/second and operates on 102 bits per cycle implying 1011 bits processed per second.

The gap here is only 6 orders of magnitude, which can be plausibly surpassed via cluster machines. For example, the BlueGene/L operates 105 nodes (one order of magnitude short). It’s peak recorded performance is about 0.5*1015 FLOPS which translates to about 1016 bit ops per second, which is nearly 1017.

There are many criticisms (both positive and negative) for this argument.

  1. Simulation of a human brain might require substantially more detail. Perhaps an additional 102 is required per neuron.
  2. We may not need to simulate a human brain to achieve AI. There are certainly many examples where we have been able to design systems that work much better than evolved systems.
  3. The internet can be viewed as a supercluster with 109 or so CPUs, easily satisfying the computational requirements.
  4. Satisfying the computational requirement is not enough—bandwidth and latency requirements must also be satisfied.

These sorts of order-of-magnitude calculations appear sloppy, but they work out a remarkable number of times when tested elsewhere. I wouldn’t be surprised to see it work out here.

Even with sufficient harrdware, we are missing a vital ingredient: knowing how to do things.

Turing’s Club for Machine Learning

Many people in Machine Learning don’t fully understand the impact of computation, as demonstrated by a lack of big-O analysis of new learning algorithms. This is important—some current active research programs are fundamentally flawed w.r.t. computation, and other research programs are directly motivated by it. When considering a learning algorithm, I think about the following questions:

  1. How does the learning algorithm scale with the number of examples m? Any algorithm using all of the data is at least O(m), but in many cases this is O(m2) (naive nearest neighbor for self-prediction) or unknown (k-means or many other optimization algorithms). The unknown case is very common, and it can mean (for example) that the algorithm isn’t convergent or simply that the amount of computation isn’t controlled.
  2. The above question can also be asked for test cases. In some applications, test-time performance is of great importance.
  3. How does the algorithm scale with the number of features n per example? Many second order gradient descent algorithms are O(n2) or worse which becomes unacceptable as the number of parameters grows. Nonsparse algorithms applied to sparse datasets have an undefined dependence, which is typically terrible.
  4. What are the memory requirements of the learning algorithm? Something linear in the number of features (or less) is nice. Nearest neighbor and kernel methods can be problematic, because the memory requirement is uncontrolled.

One unfortunate aspect of big-O notation is that it doesn’t give an intuitive good sense of the scale of problems solvable by a machine. A simple trick is to pick a scale, and ask what size problem can be solved given the big-O dependence. For various reasons (memory size, number of web pages, FLOPS of a modern machine), a scale of 1010 is currently appropriate. Computing scales, you get:

O(m) O(m log(m)) O(m2) O(m3) O(em)
1010 5*108 105 2*103 25

There is good reason to stick with big-O notation over the long term, because the scale of problems we tackle keeps growing. Having a good understanding of the implied scale remains very handy for understanding the practicality of algorithms for problems.

There are various depths to which we can care about computation. The Turing’s Razor application would be “a learning algorithm isn’t interesting unless it runs in time linear in the number of bytes input”. This isn’t crazy—for people with a primary interest in large scale learning (where you explicitly have large datasets) or AI (where any effective system must scale to very large amounts of experience), a O(mn log(mn)) or better dependence is the target.

For someone deeply steeped in computer science algorithms and complexity thoery, the application is: “a learning algorithm isn’t interesting unless it has a polynomial dependence on the number of bytes input”. This is mismatched for machine learning. It’s too crude because O(m^9) algorithms are interesting to basically no one. It’s too fine because (a) there are a number of problems of interest with only a small amount of data where algorithms with unquantifiable computation may be of interest (think of Bayesian integration) and (b) some problems simply have no solution yet, so the existence of a solution (which is not necessarily efficient) is of substantial interest.

The right degree of care about computation I’ll call “Turing’s club”. Computation is a primary but not overriding concern. Every algorithm should be accompanied by some statement about it’s computational and space costs. Algorithms in the “no known computational bound” category are of interest if they accomplish something never before done, but are otherwise of little interest. Algorithms with controlled guarantees on computational requirements are strongly preferred. Linear time algorithms are strongly preferred. Restated: there are often many algorithms capable of solving a particular problem reasonably well so fast algorithms with controlled resource guarantees distinguish themselves by requiring less TLC to make them work well.

Why Workshop?

I second the call for workshops at ICML/COLT/UAI.

Several times before, details of why and how to run a workshop have been mentioned.

There is a simple reason to prefer workshops here: attendance. The Helsinki colocation has placed workshops directly between ICML and COLT/UAI, which is optimal for getting attendees from any conference. In addition, last year ICML had relatively few workshops and NIPS workshops were overloaded. In addition to those that happened a similar number were rejected. The overload has strange consequences—for example, the best attended workshop wasn’t an official NIPS workshop. Aside from intrinsic interest, the Deep Learning workshop benefited greatly from being off schedule.