The design of a computing cluster

This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices.

  1. Architecture There are several architectural choices.
    1. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB.
    2. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type.
    3. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive.
    4. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them.
    5. In the end, we decided to go with option (2). Roughly speaking, the AMD system seemed like a better value than Intel. The opteron systems were desirable over the Athlon64 systems because halving the number of nodes aides system setup and maintenance while preserving about the same (or slightly more) cost/cpu. Another concern driving us towards the Opteron system was the ability to expand the RAM at a later date. In the last year or so, CPU speeds have not increased signficantly, instead dual core chips have come out. This trend seems likely to continue which implies the time to obselescence of a machine is driven by the maximum RAM capacity.

  2. Network Gigabit ethernet is cheap, easy, and even built into the motherboard.
  3. Operating System The options are
    1. Windows
    2. Linux

    We chose Linux (and in particular the Fedora Core 3 variant) because Linux means cheaper, less licensing hassles, and you get to work with a system that has been used in clusters for much longer. An additional issue is the version of Linux:

    1. 32bit linux: The advantage here is everything “just works”. The disadvantage is that using more than 4GB of RAM is awkward and you lose out on some minor architectural speedups of 64bit mode.
    2. 64bit linux
    3. We ended up choosing 32bit linux simply for stability and ease-of-setup reasons. The ease-of-setup was particularly compelling with respect to the openMosix choice (below). The exact variant of linux we used was a matter of some initial exploration determined by Don Coleman (TTIs master of machines). It is very plausible that we will want to switch to 64bit linux at some point in the future.

  4. Programming paradigm There are several paradigms for how to use a parallel machine. These are not exclusive.
    1. Use a standard language with a specialized parallel programming library such as mpich. This choice can result in a significant slowdown in programming time, but is necessary to eak every bit of performance out of the system.
    2. Use a batch control system to match jobs with nodes. There are several custom systems around for doing this, and it’s not hard to make up your own script. There is some learning curve here although it is necessarily smaller. With this approach, you can achieve near maximal performance as long as the individual processes do not need to communicate.
    3. Turn the cluster into a large virtual machine via openMosix and then simply launch several processes. This is the worst option performance-wise and the best option convenience-wise. To use it, you simply start processes and the system takes care of distributing them across the cluster. Processes must, however, be designed to minimize disk IO (as well as program IO) in order to achieve high efficiency computation.

    At TTI, we focused on the openMosix approach because this fits well with standard machine learning programs. In addition, starting at the ease-of-use end and then graduating to more difficult paradigms as necessary seems reasonable.

There are a couple things which did not quite work out. Ideally, each of the nodes would be rackmounted (for ease of maintenance) and, except for the “master node”, use ethernet boot on startup. The rackmounting was relatively easy, but the combination of ethernet boot, openmosix, and linux was frustrating. Instead Don ordered some very small hard drives for each node and simply installed linux on them. Another minor surprise is that the opteron motherboard required a video card in order to boot.

In the end, the total cost was about $1000 per CPU and it took (perhaps) a man-week to setup. There are many caveats here—prices change rapidly, software improves, and how you want to use the cluster is important to consider. Nevertheless, this design point is hopefully of some help in calibrating your own designs. (Feel free to add in any of your own experience below.)

8 Replies to “The design of a computing cluster”

  1. e have an old cluster, but the most popular (too popular ;->)
    machines in my group are two rack-mounted Penguin Computing dual
    Opteron 250 boxes with 8Gb each (less than $6k each), with 64 bit Linux

    Linux version 2.6.9-11.ELsmp (root@yort.fnal.gov) (gcc version 3.4.3
    20041212 (Red Hat 3.4.3-9.EL4)) #1 SMP Fri Jun 10 10:48:24 CDT 2005

    These work very, very well. No hardware problems ever, and just a few
    NFS problems in an earlier version that have gone away. They have
    small local disks for scratch space and booting, and a network-
    attached 2TB RAID array for larger data.

    We are starting to have resource contention (we have also two older
    Intel Xeon dual processor servers) issues. 64 bit is essential for us, these NLP
    problems have lots of data and lots of features. OpenMosix looks interesting, but we will probably try a simpler batch submission system, like Sun’s open source Grid Engine. For maintenance and reliability reasons, we prefer not to tamper with kernels.

  2. This is very useful, as I’m looking at building a cluster in the next few months. $1000ish/CPU is reasonably good news.

    Did you consider purchasing compute time from HP or Sun as an alternative? I’m wondering if this might be cost effective for me, especially because when I want to run something, I want a gazillion machines, but there’s a fair amount of time when I don’t need to run anything and my cluster would be sitting idle.

  3. No, we did not consider buying compute time, principally because we did not think of it as an option. It looks like Sun is charging $1/cpu-hour which means 1000 hours (= 41 days) is the break-even point neglecting maintenance costs. It seems plausible that we will load the machines at least an order of magnitude more than that, but this is principally because several different people will be using the cluster. I can see how cpu-hour purchases might be attractive for a single person with variable demand.

  4. Fernando Diaz alerted me to your cluster undertaking–congratulations on getting your system set up.

    We just installed a 30-node (60 processor) Xeon cluster. Each node has 4GB of RAM, and the cluster nodes share approximately 5TB of storage using PVFS. We’re using 64-bit Linux (Rocks) and scheduling jobs with Grid Engine. We designed this cluster based on experiences using Grid Engine on 14 processors with NFS shared storage.

    We chose not to use openMosix because we felt that fair job scheduling and queueing was extremely important. We also have a 16-way Sun SMP system, and students rarely use all the processors, because they try to keep a few free for other users that may have urgent work. Grid Engine takes care of fair scheduling, so users can feel free to queue up as much work as they like. It’s now common for one user to queue up 5000 jobs at once and let them run for a week.

    I wrote to Sun immediately after the Sun Grid was announced, asking for an account. They never wrote me back. I think Sun is currently looking for very big spenders only for their $1/hour grid system.

  5. Your comment is pretty interesting. I did participate in the development of an HA cluster but not in a HPC cluster and I am wondering how you make sure that if the master node get down, that the other node running will be able to connect to a secondary taking over the master. I been looking at OpenSSI, OpenMosix, Oscar and Rocks as well and I try to define the topology that would bring me both, HA and HPC with Rocks or OpenMosix as an example. So far, I found out that I should have to top, two machines (Active/Passive) for the administration of the cluster and having nodes connecting to the master IP address, if it fail, do an IP take over the slave so nodes can continue to work as usual. From your experience, is it a good solution? Also, when using a SAN, is it possible to have all cluster file system of the master on the SAN and mount the root file system from a it?

    Thanks for your comments, Marc.

  6. Interesting stuff. I just built a test Maya 3D rendering cluster with 20 nodes (old Compaq 500Mhz SFF PC with 256MB of ram) using DrQueue as the Manager, Fedora Core 3, and NFS share on the main server. After doing a few renders I realize that I need at least 2GB of ram on each PC but I’m interested to see whether I could just create one large openmosix virtual machine instead. Anyone heard of that being done with Maya? Would actually prove really beneficial because presently with rendering clusters you pay a per installation liscence of around $1000 for render node software such as Mental Ray and RenderMan.

    brynn 🙂

  7. Hi Brynn,

    OpenSSI could be good for what you want to do, you could create a SSI single system image of many nodes.

    Also, OpenMosix could be good too I guess, you can select the process load over the node, what memory to share, etc., would probably work well.

    With regards,

    Marc.

Comments are closed.