At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.
There are many different kinds of scaling.
- Scaling in examples This is the most basic kind of scaling.
- Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
- Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck‘s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
- Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
- Feature Scaling—what do you do when you have very many features?
- Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
- Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
- Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
- Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
- Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
- GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
- Labels
- Sourcing Just acquiring sufficient label information can be problematic.
- Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
- When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
- The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
- Many Labels It’s common to need to make a complex prediction.
- Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
- Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell‘s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
- Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.
- Sourcing Just acquiring sufficient label information can be problematic.
What else is essential and missing?
a few things-
since “large scale learning” is essentially synonym for linear modeling, complex relationships between the dependent and independent variables usually need to be captured through some sort of feature creation. while this often isnt sexy, feature hacking has been, in my experience, the difference between a genuinely applicable model and one that a model that isn’t practically useful.
I’d be pretty shocked if someone was able to build a genuinely large scale (that is, large in the number of examples) ML problem using labeled data exclusively from M-turk. while m-turk does allow the collection of data quickly and cheaply, millions of labeled examples is probably beyond the reasonable scope. current kernel svm software can handle 100k examples of 5-10k dimensions on a reasonable server in an acceptable amount of time– for some notions of acceptability 🙂
finally- ive always considered active learning to be a way to AVOID large-scale learning. since typically one has to pay (in one way or another) for all of the examples, active learning is a good way to reduce total costs. because active learning is often able to build a model achieving a certain level of performance using far fewer examples than simply random sampling from the problem space, both money and time can be saved. Smaller training sets can often enable more complex models to be employed, again, giving expressive power beyond what may be achieved using some simple modeling paradigm.
Co-training and semi-supervised learning seem to offer similar promises, but I haven’t explored these directions sufficiently (as in, i’ve never attempted to deploy a such a system in practice), and am unable to offer any concrete feedback.
I think asynchronous online algorithms (your paper at NIPS, “Slow learners are fast”, and relevant stuff from other groups) definitely belongs in a course on this topic. They’re a poor fit for MapReduce (therefore pedagogically interesting as an example of where MR is problematic), but they’re absolutely the way to go for a lot of real problems where you want to scale up to large numbers of training examples.
Just a quick comment on L_1. If you want to do it well in an online setting a recent paper “Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization” by Brendan McMahan is a must read.
I would also add the “Weighted Sums of Random Kitchen Sinks” paper under the Feature Scaling category.
as it happens i’m running seminar on “large scale stuff” this semester, where much of “stuff” is learning-related stuff. there’s a lot of overlap between what you listed and what i have. i added: random kernel methods, streaming approaches for data mining, randomized matrix computations, large scale graphical models stuff, fancy data structures, and efficient nearest neighbor approaches.
but thanks for your list — i’ll append it to mine!!!
http://mahout.apache.org/ – You may or may not know about this project. It’s an effort to build common ML algorithms on top of the Hadoop stack. Some pretty clever people working on this and they appear to have made some really great progress. Haven’t yet had the opportunity to try it myself, but I’m looking forward to investigating.
There is a free book on mining massive datasets. Its oriented towards web apps but some of the underlying ideas might be useful to pick out:
http://infolab.stanford.edu/~ullman/mmds.html
Regarding having too many features, the working draft at
http://www.stanford.edu/~boyd/papers/distr_opt_stat_learning_admm.html
describes a distributed method for splitting up model fitting problems by features rather than training examples, in addition to the usual splitting by training examples. (The approach to feature splitting in the currently available version of the paper is a little complicated, but there’s a much simpler method that will be included in the final publication.) I don’t know whether or not it’s reasonable to include in a course, but some people may find it of interest regardless, as it’s different from the three you list.
Many good comments. A few thoughts:
(1) I believe M-turk was used to create datasets of size ~10^6 for imagenet. Also discussing Luis von Ahn’s games may be worthwhile.
(2) Streaming datamining is pretty cool, and seems plausibly worthwhile to me. My experience with the cover tree is that the applicability of fast nearest neighbors is somewhat delicate in large-D spaces. You really need to have substantial metric structure for it to work. In some applications, that is lacking.
(3) I’m aware of Mahout, but haven’t yet seen it do things that impress me. What is the best example of something impressive?
(4) Feature-based partitioning is built into VW, and it’s also common to several parallel decision tree implementations. In the long run, I believe this is the way to go, although substantial effort is required to prove it.
Hmm, in my (perhaps biased) view, I also think message passing on factor graphs should go on there. Message passing is incredibly scalable, distributes nicely and allows you to go beyond the linear model. I really think message passing is sometimes a bit undervalued.
What about scaling model structure? If you think of simple classifiers, the next step is sequence classifiers, like HMMs/CRFs. Then parsers, like PCFGs. Then models that combine tagging and parsing with clustering (such as for coreference in natural language processing — check out the LDA-like approach of Haghighi and Klein for a good starting point).
What about scaling model complexity? We’re working on multilevel logistic regressions with only a dozen or so basic predictors (like income, age, etc.), but we’re using lots of interacted features (like the Red-State/Blue-State example in Gelman and Hill’s regression book, only with more features and deeper interactions). Things get complicated because the hyperpriors are covariance matrices which themselves have priors.
What about scaling from point estimates to sampling? For many problems I work on, such as gene splice variant expression, there is substantial covariance among the model parameters. These have serious repercussions for inferences of interest to biologists, such as the differential expression between two variants in different conditions. This problem also requires scaling alignment to 100s of millions of sequences for the current generation of sequencers.
Large-scale recommendation systems would be nice and could tie into probabilistic matrix factoring (one topic Hal’s suggesting). Though even Netflix wasn’t that large (it was designed so the data could easily fit into a few hundred MB with sensible data structures).
Although also not quite learning per se, factor models are being used more frequently for generating features in a semi-supervised fashion (e.g. the work ranging from Brown et al. at IBM to Ando and Zhang, also at IBM, plus later “deep belief net” stuff from Hinton et al., which also scales structure).
If you’re going to do something like KNN, there are all sorts of locality-sensitive hashing tricks and other data structures that are fun to scale; the Rajaram and Ullman book cited above has a good section on LSH. They also cover document deduplication, which is another nice problem to think about scaling (though more like clustering again than machine learning).
Andrew Ng should know that you can use Mech Turk for substantial machine learning data set collection — he co-authored a paper with Rion Snow et al. on the topic. It’s surprisingly easy to collect a huge pile of labeled data — a summer intern collected 50K morphological stems in a few weeks with Mech Turk and 200K token labels for named-entity recognition with Mech Turk, both data sets of which turned out to be better than standards we could have bought from the Linguistic Data Consortium.
What about confirming the predictions? For our predictive modeling we are working with k-fold cross-validation, but also statistical testing to a desired confidence level. How should testing be implemented in large scale learning?
There is good news here. Testing is typically trivially parallelizable, and the amount of data that you must hold out to get an epsilon-accurate estimate of predictive power is independent of training set size.
“I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem”
Sorry but this comment seems misguided. MapReduce itself has little to do with the distributed filesystem it operates on—although it certainly needs it to work well. MapReduce is an excellent abstraction for distributing computation, more powerful than appears at first sight. We have been using it for large scale language modeling, among many other applications to modeling on very large amounts of data. Scaling up with the amount of data is just one side; the ability to estimate and use very large models (that are themselves distributed) is the other very important side of it. It enables the use of 100-1000 more training data and estimation of models that are 100-1000 times larger (as measured by the number of features/parameters in the model) than what can be done normally on a single machine/CPU. This typically leads to improvements in performance on task specific metrics (word error rate in automatic speech recognition, or bleu score in machine translation).
Can you discuss absolute rather than relative sizes? The problem with relative sizes, is that it’s not clear what they are anchored off of, and my experience is that people have enormous variations in what they consider “large”.
With respect to MapReduce, the underlying distributed filesystem is essential to it’s function. So, I believe you need to replace “work well” with “work at all” in the statement above. Restated, I don’t know of any broadly useful MapReduce implementations not implemented on top of a distributed file system.
MapReduce enables us to estimate models that have 10-100 billion n-grams, trained on 100 billion – 1 trillion data points/words: see http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/33278.pdf for MT, and http://research.google.com/pubs/pub37075.html for ASR/voice-search.
Yes, you are right, “work at all” is a better way to phrase it.
Thanks. There is a bit of a vocabulary gap from machine learning. From what I understand the MT paper deals with the largest scale dataset, and both approaches use Map-Reduce for counting.
Let me try to get the vocabulary straight. The MT paper has a ‘Web’ dataset which has 1.8T tokens. I think a ‘token’ is what I would call a sparse feature. In other words, 1.8T is within a constant factor of the size of the dataset. I’m unclear on the meaning of “LM size” as the text claims it’s the number of ngrams, but the table has a different entry for that.
Map-Reduce is good for counting and counting has been an effective approach for language models. I’m not personally convinced that counting is the way to go in the long term, as my experience is that direct optimization approaches tend to yield smaller/more effective predictors where they can be done. The counter argument for MT is that there is not too much MT labeled data compared to unlabeled, so a significant part of the training process must be unsupervised. I appreciate that, but would still pursue more efficient unsupervised approaches. With regards to to supervised learning on similar scales of data, I’m skeptical that the counting approach (effectively, naive bayes) yields high quality predictors.
Re: vocabulary mismatch:
A token is more than just one feature, it is one word as it occurs in the right-most position of nested n-grams, with n = 1 … N, where N is the order of the N-gram model being estimated. Each n-gram would count as a feature, and it comes with it’s own log-probability, and a back-off weight (loosely speaking, the highest order N-grams do not have a back-off weight).
Re: unsupervised:
Language modeling is for sure one of the foremost unsupervised estimation/modeling techniques. What one needs is text data to estimate the language model on; for MT that means text in the target language; for ASR that means text in the language of interest. You are probably thinking of the estimation for the channel models (acoustic model in ASR or translation/phrase model in MT); those do indeed require parallel data, but even that can be obtained in large amounts in a semi-supervised from logs of deployed systems, e.g. ASR/voice-search.
Re: counting:
A bit more than just counting, LM estimation entails smoothing the relative frequency estimates (counts) of n-grams to cope with unseen test n-grams, e.g. Kneser-Ney, or Katz (Good-Turing).
The gains we see from scaling up the amount of training data and the model size are quite substantial, and are far from easy to get by “smarter modeling” on the usual amounts of training data. Nobody says we should drop efforts to improve modeling/estimation, but I do not understand why many people prefer turning a blind eye to scaling up models with data?!
So the number of tokens is like dataset size, but number of tokens*N where N is the highest order ngram is like the number of sparse features.
It looks like the backoff requires the estimation of 1 or N parameters in the MT paper. There is a bit of optimization required here, but I tend to think there is a phase transition in proper methodology as soon as the curse of dimensionality forbids a guess-and-check approach.
I’m certainly for using more data—I believe that’s critical, and I’ve seen plenty of instances where it has been helpful. The question is: how? Counting approaches are easy, but again, I’m skeptical about their quality, particularly in supervised settings. The Mahout project has attempted a straightforward application of map-reduce for machine learning algorithms, with sometimes quite poor success. Elsewhere, I’ve been editing a book on parallel learning, where many different groups have tried many different methods. It’s unclear what approach is best there. And personally, my belief is that a MPI-style allreduce dominates MapReduce.
Not sure what you mean by “guess and check”. Smoothing is a generic term for techniques that need to discount the ML estimates such that it generalizes well to unseen data, and as such they are rigorous but only up to some point; sort of what ML people would call “regularization”. They are most commonly estimated using leave-one-out, or cross-validation of some sort.