There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are:
- Separate code from data.
- Separate input data, working data and output data.
- Save everything to disk frequently.
- Separate options from parameters.
- Do not use global variables.
- Record the options used to generate each run of the algorithm.
- Make it easy to sweep options.
- Make it easy to execute only portions of the code.
- Use checkpointing.
- Write demos and tests.
Click here for discussion and examples for each item. Also see Charles Sutton’s and HackerNews’ thoughts on the same topic.
My guess is that these patterns will not only be useful for machine learning, but also any other computational work that involves either a) processing large amounts of data, or b) algorithms that take a significant amount of time to execute. Share this list with your students and colleagues. Trust me, they’ll appreciate it.
Yaser points out some nicely videotaped machine learning lectures at Caltech. Yaser taught me machine learning, and I always found the lectures clear and interesting, so I expect many people can benefit from watching. Relative to Andrew Ng‘s ML class there are somewhat different areas of emphasis but the topic is the same, so picking and choosing the union may be helpful.
Since we last discussed the other online learning, Stanford has very visibly started pushing mass teaching in AI, Machine Learning, and Databases. In retrospect, it’s not too surprising that the next step up in serious online teaching experiments are occurring at the computer science department of a university embedded in the land of startups. Numbers on the order of 100000 are quite significant—similar in scale to the number of computer science undergraduate students/year in the US. Although these populations surely differ, the fact that they could overlap is worth considering for the future.
It’s too soon to say how successful these classes will be and there are many easy criticisms to make:
- Registration != Learning … but if only 1/10th complete these classes, the scale of teaching still surpasses the scale of any traditional process.
- 1st year excitement != nth year routine … but if only 1/10th take future classes, the scale of teaching still surpasses the scale of any traditional process.
- Hello, cheating … but teaching is much harder than testing in general, and we already have recognized systems for mass testing.
- Online misses out … sure, but for students not enrolled in a high quality university program, this is simply not a relevant comparison. There are also benefits to being online as well, as your time might be better focused. Anecdotally, at Caltech, they let us take two classes at the same time, which I did a few times. Typically, I had a better grade in the class that I skipped as the instructor had to go through things rather slowly.
- Where’s the beef? The hard nosed will want to know how to make money, which is always a concern. But, a decent expectation is that if you first figure out how to create value, you’ll find some way to make money. And, if you first wait until it’s clear how to make money, you won’t make any.
My belief is that this project will pan out, with allowances for the expected inevitable adjustments that you learn to make from experience. I think the critics miss an understanding of what’s possible and what motivates people.
The prospect of teaching 1 student means you might review some notes. The prospect of teaching ~10 students means you prepare some slides. The prospect of teaching ~100 students means you polish your slides well, trying to anticipate questions, and hopefully drawing on experience from previous presentations. I’ve never directly taught ~1000 students, but at that scale you must try very hard to make the presentation perfect, including serious testing with dry runs. 105 students must make getting out of bed in the morning quite easy.
Stanford has a significant first-mover advantage amongst top research universities, but it’s easy to imagine a few other (but not many) universities operating at a similar scale. Those that have the foresight to start a serious online teaching program soon will have a chance of being among the few. For other research universities, we can expect boutique traditional classes to continue for some time. These boutique classes may have some significant social value, because it’s easy to imagine that the few megaclasses miss important things in developing research areas. And for everyone working at teaching universities, someone is eating your lunch.
(Cross posted at CACM.)
Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.
This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.
This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth
(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)
At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.
There are many different kinds of scaling.
- Scaling in examples This is the most basic kind of scaling.
- Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
- Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck‘s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
- Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
- Feature Scaling—what do you do when you have very many features?
- Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
- Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
- Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
- Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
- Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
- GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
- Sourcing Just acquiring sufficient label information can be problematic.
- Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
- When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
- The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
- Many Labels It’s common to need to make a complex prediction.
- Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
- Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell‘s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
- Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.
What else is essential and missing?