This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices.
- Architecture There are several architectural choices.
- AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB.
- AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type.
- Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive.
- Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them.
In the end, we decided to go with option (2). Roughly speaking, the AMD system seemed like a better value than Intel. The opteron systems were desirable over the Athlon64 systems because halving the number of nodes aides system setup and maintenance while preserving about the same (or slightly more) cost/cpu. Another concern driving us towards the Opteron system was the ability to expand the RAM at a later date. In the last year or so, CPU speeds have not increased signficantly, instead dual core chips have come out. This trend seems likely to continue which implies the time to obselescence of a machine is driven by the maximum RAM capacity.
- Network Gigabit ethernet is cheap, easy, and even built into the motherboard.
- Operating System The options are
- Windows
- Linux
We chose Linux (and in particular the Fedora Core 3 variant) because Linux means cheaper, less licensing hassles, and you get to work with a system that has been used in clusters for much longer. An additional issue is the version of Linux:
- 32bit linux: The advantage here is everything “just works”. The disadvantage is that using more than 4GB of RAM is awkward and you lose out on some minor architectural speedups of 64bit mode.
- 64bit linux
We ended up choosing 32bit linux simply for stability and ease-of-setup reasons. The ease-of-setup was particularly compelling with respect to the openMosix choice (below). The exact variant of linux we used was a matter of some initial exploration determined by Don Coleman (TTIs master of machines). It is very plausible that we will want to switch to 64bit linux at some point in the future.
- Programming paradigm There are several paradigms for how to use a parallel machine. These are not exclusive.
- Use a standard language with a specialized parallel programming library such as mpich. This choice can result in a significant slowdown in programming time, but is necessary to eak every bit of performance out of the system.
- Use a batch control system to match jobs with nodes. There are several custom systems around for doing this, and it’s not hard to make up your own script. There is some learning curve here although it is necessarily smaller. With this approach, you can achieve near maximal performance as long as the individual processes do not need to communicate.
- Turn the cluster into a large virtual machine via openMosix and then simply launch several processes. This is the worst option performance-wise and the best option convenience-wise. To use it, you simply start processes and the system takes care of distributing them across the cluster. Processes must, however, be designed to minimize disk IO (as well as program IO) in order to achieve high efficiency computation.
At TTI, we focused on the openMosix approach because this fits well with standard machine learning programs. In addition, starting at the ease-of-use end and then graduating to more difficult paradigms as necessary seems reasonable.
There are a couple things which did not quite work out. Ideally, each of the nodes would be rackmounted (for ease of maintenance) and, except for the “master node”, use ethernet boot on startup. The rackmounting was relatively easy, but the combination of ethernet boot, openmosix, and linux was frustrating. Instead Don ordered some very small hard drives for each node and simply installed linux on them. Another minor surprise is that the opteron motherboard required a video card in order to boot.
In the end, the total cost was about $1000 per CPU and it took (perhaps) a man-week to setup. There are many caveats here—prices change rapidly, software improves, and how you want to use the cluster is important to consider. Nevertheless, this design point is hopefully of some help in calibrating your own designs. (Feel free to add in any of your own experience below.)