Previously, we discussed parallel machine learning a bit. As parallel ML is rather difficult, I’d like to describe my thinking at the moment, and ask for advice from the rest of the world. This is particularly relevant right now, as I’m attending a workshop tomorrow on parallel ML.
Parallelizing slow algorithms seems uncompelling. Parallelizing many algorithms also seems uncompelling, because the effort required to parallelize is substantial. This leaves the question: Which one fast algorithm is the best to parallelize? What is a substantially different second?
One compellingly fast simple algorithm is online gradient descent on a linear representation. This is the core of Leon’s sgd code and Vowpal Wabbit. Antoine Bordes showed a variant was competitive in the large scale learning challenge. It’s also a decades old primitive which has been reused in many algorithms, and continues to be reused. It also applies to online learning rather than just online optimization, implying the algorithm can be used in a host of situations where batch algorithms are awkward or unviable.
If we start with a fast learning algorithm as a baseline, there seem to be two directions we can go with parallel ML:
- (easier) Try to do more in the same amount of time. For example, Paul and Neil suggest using parallelism to support ensemble methods.
- (harder) Try to use parallelism to reduce the amount of time required to effectively learn on large amounts of data. For this approach, bandwidth and latency become uncomfortably critical implementation details. Due to these issues, it appears important to at least loosen the goal to competing with learning on large amounts of data. Alternatively, we could consider this as evidence some other learning primitive is desirable, although I’m not sure which one.
The choice of algorithms to parallelize seems to depend on the trade off between ease of parallelism (if at all) and the eventual (after parallelizing) payoff in terms of performance. Hence, this choice of algorithms for the particular application should probably depend on this trade off, rather than choosing the best sequential algorithm or the easiest to parralelize algorithm for the relevant task.
Ensemble methods do not only allow old tasks to run faster by parallizing, but introduces new benefits like converting old algorithms into online learning and performing data fusion. Ensemble learning would seem the easiest way to implement parallelism as it focus on only, ignoring the many new programming bugs possible in parallell computing. How can be applied to estimate the speedup isn’t entirely clear to me, but it might be that the speedup is proportional to the reduction in training set size and dimensionality for each ensemble member, compared to one learner on all data. For problems with medium amounts of data, this could quickly put a cap on the speedup we can expect.
With reference to your previous post, machine learning framed as robust search would at first glance also be a clear candidate for parallellizing and perhaps scale better than ensembles with regard to the speedup?
Ensemble methods do not only allow old tasks to run faster by parallizing, but introduces new benefits like converting old algorithms into online learning and performing data fusion. Ensemble learning would seem the easiest way to implement parallelism as it focus on only, ignoring the many new programming bugs possible in parallell computing. How Amdahl’s law can be applied to estimate the speedup isn’t entirely clear to me, but it might be that the speedup is proportional to the reduction in training set size and dimensionality for each ensemble member, compared to one learner on all data. For problems with medium amounts of data, this could quickly put a cap on the speedup we can expect.
With reference to your previous post, machine learning framed as robust search would at first glance also be a clear candidate for parallellizing and perhaps scale better than ensembles with regard to the speedup?
That should be “on task parallelism only”
For some probabilistic models like Naïve Bayes and similar, direction (2) is actually straightforward and easy to implement with little overhead.
This may be relevant because although these models are usually inferior to SVM-type models, parallelism and additional data may actually make them competitive at fixed training time.
I don’t think these were considered at all in the large scale learning challenge, though.
Hi,
Your site is really useful for me. I have started my PhD recently working on Parallel and distributed machine learning algorithms.
May I ask you if you have any paper or thesis describing the challenging ML in parallel? If you have any resources for challenges on ML on grid computing?
I don’t have a particular paper on parallel ML yet.
In my experience, the biggest challenge for ML on grid computing is speeding up faster learning algorithms rather than speeding up slower ones.