Optimizing Machine Learning Programs

Machine learning is often computationally bounded which implies that the ability to write fast code becomes important if you ever want to implement a machine learning algorithm. Basic tactical optimizations are covered well elsewhere, but I haven’t seen a reasonable guide to higher level optimizations, which are the most important in my experience. Here are some of the higher level optimizations I’ve often found useful.

  1. Algorithmic Improvement First. This is Hard, but it is the most important consideration, and typically yields the most benefits. Good optimizations here are publishable. In the context of machine learning, you should be familiar with the arguments for online vs. batch learning.
  2. Choice of Language. There are many arguments about the choice of language. Sometimes you don’t have a choice when interfacing with other people. Personally, I favor C/C++ when I want to write fast code. This (admittedly) makes me a slower programmer than when using higher level languages. (Sometimes I prototype in Ocaml.) Choosing the wrong language can result in large slowdowns.
  3. Avoid Pointer-Based Representations. The way your represent information in your program can have a dramatic impact on performance. My rule of thumb is “for fast programs, it’s all arrays in the end”. As an example, consider maps. STL provides map (a tree-based datastructure) and hash_map (an array-of-pointers data structure). Where a hash_map works, it’s common to observe an order-of-magnitude improvement in performance over a map. (Sometimes you must futz with the hash function to observe this). The Google dense_hash_map replaces the array of pointers with a plain old array and (unsurprisingly) is even faster.

    What’s fundamentally happening here is locality: dereferencing pointers is a very expensive operation on modern computers because the CPU runs much faster than the latency to RAM. By converting everything into an array, you compute rather than dereference the location of data. Converting things into an array is not always easy, but it is often possible with a little bit of thought and care.

  4. Cached Parsing. Fast algorithms are required for large quantities of data. Unfortunately, the computational process of reading and parsing the data is often intensive. By caching parsed examples in a machine representation format (either in RAM or on disk), a substantial performance boost is achievable. This comes from two sources:
    1. You avoid the work of parsing again.
    2. The machine representation can be more concise implying improved system caching effects.
  5. Deshuffle. Avoid copying information. It’s easy to end up copying data from one place to another. Commonly, the best implementation avoids this, which has strong implications on representation choice.
  6. Write less Code. There are many reasons to write less code where you can. For the purposes of optimization, having less code in the bottleneck is the surest way to reduce work in the bottleneck. There are lots of techniques for doing this—some of them are purely syntactic transformations while others involve redesigning the algorithm itself.
  7. Don’t trust libraries. In particular, don’t trust library calls inside the bottleneck. It’s often the case that a library function is general purpose while you can get away with a much faster hand-crafted (and inlined) function.
  8. Buffered disk I/O. There is a huge difference in performance between reading and writing directly from the disk and doing this through a buffering layer. In general, it’s not obvious which I/O library functions buffer properly. You can experiment, or implement your own buffering system. C++ I/O libraries seem to handle this better than C libraries in general.
  9. Amortization. Amortization is a very powerful technique for algorithm optimization. The basic idea is to always make sure that one computation (a secondary one) is amortized by another (your primary computation).
  10. Optimize while you wait. There is always a question about how much time should be spent on optimization vs. other aspects of programming. A reasonable rule of thumb is to spend time on optimization when you are waiting for the program to finish running. This is ammortization, applied to yourself.

In all program optimization, it is critical to know where the bottleneck is, and optimize preferentially there. This is fairly tricky on a modern computer because there are many ways that hidden latency can creep in. Tools like gprof and Valgrind can be helpful here, but there is no substitute for having a basic understanding of how a computer works.

10 Replies to “Optimizing Machine Learning Programs”

  1. I’d be interested in seeing some performance comparisons between C++ and Ocaml with some of the algorithms you’ve implemented. Do you have something like that kicking around? If not, do you have a “gut feel” of the difference?

  2. I might add one more to your list:

    * Profile first, optimize second.

    Too many times, I have optimized the code that I *thought* was the bottleneck, only to discover that the true bottleneck was elsewhere. For me, automatic profiling tools are indispensable, simply because I’m never *quite* sure what the limitations are of the language I’m using (this is particularly true of Matlab — I am often surprised at what is fast and what is slow in Matlab).

  3. A variant of the last, un-numbered point, is the most important: don’t optimize until you’ve profiled. Modern compilers and CPU execution architectures are too clever to figure out intuitively or even with paper — can you really predict branch and bound and register-load misses? You have to start with good algorithms and reasonable choices of data structure for your architecture. And then test. A great book to read for a set of easy, relevant case studies in this area is Jon Bentley’s “Programming Pearls”. It uses C, so it’s easy to read.

    The reason you can write ML and Java programs that are fast is related to point 3 — it’s all arrays in the end for fast programs, and there’s no longer much difference between languages there. With arrays or primitives, Java needs no boxing and won’t do any garbage collection. And ML can do very agressive static optimizations because its typing is tighter than C’s. That’s why OCAML tends to win most of the well run races in array/loop processing (unless we’re on big parallel machines, and then Fortran tends to win, because it’s also pretty narrowly typed and has good compilers).

    For most of the runtime models I use (HMM decoders, langauge models, KNN classifiers, etc.), once I’ve optimized, the bottleneck winds up being front-side bus cycles, not CPU cycles. This concern will impact any MCMC or EM-type algorithms that require online estimation. The best advice for helping with caching is to keep data together in memory if it’ll be processed together. For instance, it’s often worth an up-front matrix transposition to access either rows or columns.

    Point 7, “write less code”, relates to the common saying “the fastest code is the code that doesn’t execute”. Often causing code not to execute requires writing other code to trap special conditions. In general, there’s a size/speed tradeoff where unfolding and other like operations can speed up code.

    All of this advice is about to get pushed down one level of granularity. With multi-core CPUs, the trick is to be able to multi-thread programs or break them up at the process level. The next wave of tips is going to have to deal with multi-threaded caches, lock splitting for large models, and other concurrency issues. With clusters of loosely networked computers, the only programming technique that scales is divide and conquer (e.g. with map/reduce).

  4. I’ve never seen Ocaml a factor of 10 slower. Occasionally, I’ve seen Ocaml be faster.

    There are a few things limiting Ocaml speed:
    (1) The there is no -O flag as you might be familiar with from gcc/g++. This matters a lot with respect to inlining of functions and some of the low level optimizations.
    (2) There are issues with fully using the hardware’s types. Sometimes you really want a float and sometimes you really want a double. Losing a bit of precision for each of the int’s so you can garbage collect is sometimes quite painful.
    (3) Ocaml often boxes data which makes you deal with an extra pointer dereference.

  5. ocaml has an -inline option which makes inlining more aggressive. i always use -inline 100.

    if you care about speed in ocaml, the biggest problem is boxed float arrays. use bigarray and shell to C for important things (like dot products). yes, it’s lame that you have to do that, but…

  6. As far as point 3 is concerned, a map is slower than a hash_map because a map is a tree-like data structure
    and so operations typically proceed in O(logN) time rather than the expected O(1) for hash_maps

  7. For many interesting problems, optimizing everything to unboxed array-based implementations is simply prohibitively difficult. Consequently, for a given amount of developer time a tree-based solution is often much faster.

    If you’d like to learn more about this, read my posts about the performance of structs vs classes in F# and Set.union in OCaml.

  8. Hal, Ocaml float array are not boxed. They’re only boxed if you do something that makes ocaml think they’re polymoprhic. Also, you have the bigarray library so you don’t have to write your own.

Comments are closed.