Patterns for research in machine learning

There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are:

  1. Separate code from data.
  2. Separate input data, working data and output data.
  3. Save everything to disk frequently.
  4. Separate options from parameters.
  5. Do not use global variables.
  6. Record the options used to generate each run of the algorithm.
  7. Make it easy to sweep options.
  8. Make it easy to execute only portions of the code.
  9. Use checkpointing.
  10. Write demos and tests.

Click here for discussion and examples for each item. Also see Charles Sutton’s and HackerNews’ thoughts on the same topic.

My guess is that these patterns will not only be useful for machine learning, but also any other computational work that involves either a) processing large amounts of data, or b) algorithms that take a significant amount of time to execute. Share this list with your students and colleagues. Trust me, they’ll appreciate it.

8 Replies to “Patterns for research in machine learning”

  1. I wanted to add my thoughts on these.

    (1) Definitely essential.
    (2) I’ve found using a Makefile to organize predictor building quite helpful.
    (3) On larger scale datasets, there is some value in being strategic about what you save to disk.
    (4) I’ve seen plenty of times when model evaluation code is rewritten. Storing only the model is critical.
    (5) I have sinned here in VW. The advantage of global variables is that they make coding faster. The disadvantage is that I now need to deglobalize variables to make a reentrant library. Overall, I’m not sure it was a mistake to use global variables because easing coding is quite valuable when you are experimenting. Nevertheless, it’s clear that the end state has zero global variables.
    (6) The Makefile approach nails this.
    (7) Makefile isn’t particularly good at sweeping options, but it’s not particularly terrible either.
    (8) Modularity is generally useful.
    (9) Makefile again helps here. I’d also add that it’s _critical_ that your code be (effectively) synchronous for easy debugging.
    (10) ‘make test’ is fantastic to make sure you didn’t screw things up.

    I’d add one more detail which seems important: start small. Chop down your dataset to something that takes ~10 seconds or less to run when you are fiddling around with making sure things work. I’ve seen many people with a large dataset launch a job and say “well, I can’t do anything until I get results tomorrow”—it results in an excruciatingly slow development cycle. A good learning algorithm will work well on a small dataset and better on a larger dataset.

    1. Definitely agree about starting small.

      Another thing is that there is probably a whole other set of tricks about *debugging* ML algorithms. For example, if you are using continuous optimization, check whether your gradient function matches numerical differentiation of your objective function. Another example: One way to debug a complex algorithm is to check that it works in a simpler special case. e.g., If you were trying to debug a forward-backward implementation, you could check that it worked when there weren’t any observations or when the hidden states were actually independent.

  2. I’m a bit of a freak – enterprise software team lead during the day and neural network researcher during the evening. I would take a few Pragmatic approaches to writing code – as I am with the next version of my algorithm suite. There are a set of principles called SOLID – not all are applicable as they tend to deal with large enterprise applications.

    YAGNI – You Ain’t Gonna Need It. Don’t put anything into the code until you need it. You might think a method to flesh things out is a good idea but chances are you won’t.
    Separation of Concerns. – Split out the algorithm/model from the code that shows its output (the visualisation or view).
    Use Unit Tests – Rather than just tests. Unit tests test a single method or function. If you have a special math class that does a simple calculation, have a load of tests for it.
    Single Responsibility Principle – Every class, method and function should be responsible for one thing. Avoid having superman classes that can do everything.

    I am sure there are more, I might have to write a post about it!

Comments are closed.