Sebastien Bubeck has a new ML blog focused on optimization and partial feedback which may interest people.
Yann and I have arranged so that people who are interested in our large scale machine learning class and not able to attend in person can follow along via two methods.
- Videos will be posted with about a 1 day delay on techtalks. This is a side-by-side capture of video+slides from Weyond.
- We are experimenting with Piazza as a discussion forum. Anyone is welcome to subscribe to Piazza and ask questions there, where I will be monitoring things. update2: Sign up here.
The first lecture is up now, including the revised version of the slides which fixes a few typos and rounds out references.
The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.
Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.
Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.
It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.
- MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
- MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.
Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:
- MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
- MapReduce: Conceptual simplicity. One easy to understand function is enough.
- MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
- MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
- MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
- MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
- MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
- MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.
One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.
You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.
For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).
The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:
- For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
- For smaller datasets with computationally intense models, GPU approaches seem very compelling.
- For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
- The model needs to fit into memory. If that’s not the case, then other approaches are required.
I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.
Update: CACM Crosspost.
Since we last discussed the other online learning, Stanford has very visibly started pushing mass teaching in AI, Machine Learning, and Databases. In retrospect, it’s not too surprising that the next step up in serious online teaching experiments are occurring at the computer science department of a university embedded in the land of startups. Numbers on the order of 100000 are quite significant—similar in scale to the number of computer science undergraduate students/year in the US. Although these populations surely differ, the fact that they could overlap is worth considering for the future.
It’s too soon to say how successful these classes will be and there are many easy criticisms to make:
- Registration != Learning … but if only 1/10th complete these classes, the scale of teaching still surpasses the scale of any traditional process.
- 1st year excitement != nth year routine … but if only 1/10th take future classes, the scale of teaching still surpasses the scale of any traditional process.
- Hello, cheating … but teaching is much harder than testing in general, and we already have recognized systems for mass testing.
- Online misses out … sure, but for students not enrolled in a high quality university program, this is simply not a relevant comparison. There are also benefits to being online as well, as your time might be better focused. Anecdotally, at Caltech, they let us take two classes at the same time, which I did a few times. Typically, I had a better grade in the class that I skipped as the instructor had to go through things rather slowly.
- Where’s the beef? The hard nosed will want to know how to make money, which is always a concern. But, a decent expectation is that if you first figure out how to create value, you’ll find some way to make money. And, if you first wait until it’s clear how to make money, you won’t make any.
My belief is that this project will pan out, with allowances for the expected inevitable adjustments that you learn to make from experience. I think the critics miss an understanding of what’s possible and what motivates people.
The prospect of teaching 1 student means you might review some notes. The prospect of teaching ~10 students means you prepare some slides. The prospect of teaching ~100 students means you polish your slides well, trying to anticipate questions, and hopefully drawing on experience from previous presentations. I’ve never directly taught ~1000 students, but at that scale you must try very hard to make the presentation perfect, including serious testing with dry runs. 105 students must make getting out of bed in the morning quite easy.
Stanford has a significant first-mover advantage amongst top research universities, but it’s easy to imagine a few other (but not many) universities operating at a similar scale. Those that have the foresight to start a serious online teaching program soon will have a chance of being among the few. For other research universities, we can expect boutique traditional classes to continue for some time. These boutique classes may have some significant social value, because it’s easy to imagine that the few megaclasses miss important things in developing research areas. And for everyone working at teaching universities, someone is eating your lunch.