Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because:
- Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet.
- Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore.
- Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future.
- IBM’s Cell processor has (essentially) 9 cores.
- Modern graphics chips can have an order of magnitude more separate execution units.
The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear.
So, how do we parallelize machine learning algorithms?
- The simplest and most common technique is to simply run the same learning algorithm with different parameters on different processors. Cluster management software like OpenMosix, Condor, or Torque are helpful here. This approach doesn’t speed up any individual run of a learning algorithm.
- The next simplest technique is to decompose a learning algorithm into an adaptive sequence of statistical queries and parallelize the queries over the sample. This paper (updated from the term paper according to a comment) points out that statistical integration can be implemented via MapReduce which Google popularized (the open source version is Hadoop). The general technique of parallel statistical integration is already used by many people including IBM’s Parallel Machine Learning Toolbox. The disadvantage of this approach is that it is particularly good at speeding up slow algorithms. One example of a fast sequential algorithm is perceptron. The perceptron works on a per example basis making individual updates extremely fast. It is explicitly not a statistical query algorithm.
- The most difficult method for speeding up an algorithm is fine-grained structural parallelism. The brain works in this way: each individual neuron operates on it’s own. The reason why this is difficult is that the programming is particularly tricky—you must carefully optimize to avoid latency in network communication. The full complexity of parallel programming is exposed.
A basic question is: are there other approaches to speeding up learning algorithms which don’t incur the full complexity of option (3)? One approach is discussed here.