November 2005 – Page 2 – Machine Learning (Theory)

This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices.

Architecture There are several architectural choices.
1. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB.
2. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type.
3. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive.
4. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them.
Network Gigabit ethernet is cheap, easy, and even built into the motherboard.
Operating System The options are
1. Windows
2. Linux
We chose Linux (and in particular the Fedora Core 3 variant) because Linux means cheaper, less licensing hassles, and you get to work with a system that has been used in clusters for much longer. An additional issue is the version of Linux:
1. 32bit linux: The advantage here is everything “just works”. The disadvantage is that using more than 4GB of RAM is awkward and you lose out on some minor architectural speedups of 64bit mode.
2. 64bit linux
Programming paradigm There are several paradigms for how to use a parallel machine. These are not exclusive.
1. Use a standard language with a specialized parallel programming library such as mpich. This choice can result in a significant slowdown in programming time, but is necessary to eak every bit of performance out of the system.
2. Use a batch control system to match jobs with nodes. There are several custom systems around for doing this, and it’s not hard to make up your own script. There is some learning curve here although it is necessarily smaller. With this approach, you can achieve near maximal performance as long as the individual processes do not need to communicate.
3. Turn the cluster into a large virtual machine via openMosix and then simply launch several processes. This is the worst option performance-wise and the best option convenience-wise. To use it, you simply start processes and the system takes care of distributing them across the cluster. Processes must, however, be designed to minimize disk IO (as well as program IO) in order to achieve high efficiency computation.
At TTI, we focused on the openMosix approach because this fits well with standard machine learning programs. In addition, starting at the ease-of-use end and then graduating to more difficult paradigms as necessary seems reasonable.

There are a couple things which did not quite work out. Ideally, each of the nodes would be rackmounted (for ease of maintenance) and, except for the “master node”, use ethernet boot on startup. The rackmounting was relatively easy, but the combination of ethernet boot, openmosix, and linux was frustrating. Instead Don ordered some very small hard drives for each node and simply installed linux on them. Another minor surprise is that the opteron motherboard required a video card in order to boot.

In the end, the total cost was about $1000 per CPU and it took (perhaps) a man-week to setup. There are many caveats here—prices change rapidly, software improves, and how you want to use the cluster is important to consider. Nevertheless, this design point is hopefully of some help in calibrating your own designs. (Feel free to add in any of your own experience below.)

Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning. This is an update on the progress I know of. As a refresher, active learning as meant here is:

There is a source of unlabeled data.
There is an oracle from which labels can be requested for unlabeled data produced by the source.
The goal is to perform well with minimal use of the oracle.

Here is what I’ve learned:

Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier).
Nina, Alina, and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here.
Nicolo, Claudio, and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here. This was published a year or two ago and I recently learned about it.

All of these results are relatively ‘rough’: they don’t necessarily make good algorithms as stated (although the last one has a few experiments). None of these results are directly comparable because the assumptions vary. Comparing the assumptions and the results leads to a number of remaining questions:

Do the sufficient and seminecessary conditions apply to the IID only case? The adversarial case?
Is there a generic algorithm for any hypothesis space that works in the fully adversarial setting?
What are special cases of these algorithms which are computationally tractable and useful?

The Foundations of Active Learning workshop at NIPS should be a good place to discuss these questions.

Month: November 2005

The design of a computing cluster

Progress in Active Learning