The Large Scale Learning Survey Tutorial

Ron Bekkerman initiated an effort to create an edited book on parallel machine learning that Misha and I have been helping with. The breadth of efforts to parallelize machine learning surprised me: I was only aware of a small fraction initially.

This put us in a unique position, with knowledge of a wide array of different efforts, so it is natural to put together a survey tutorial on the subject of parallel learning for KDD, tomorrow. This tutorial is not limited to the book itself however, as several interesting new algorithms have come out since we started inviting chapters.

This tutorial should interest anyone trying to use machine learning on significant quantities of data, anyone interested in developing algorithms for such, and of course who has bragging rights to the fastest learning algorithm on planet earth 🙂

(Also note the Modeling with Hadoop tutorial just before ours which deals with one way of trying to speed up learning algorithms. We have almost no overlap.)

The Ideal Large Scale Learning Class

At NIPS, Andrew Ng asked me what should be in a large scale learning class. After some discussion with him and Nando and mulling it over a bit, these are the topics that I think should be covered.

There are many different kinds of scaling.

  1. Scaling in examples This is the most basic kind of scaling.
    1. Online Gradient Descent This is an old algorithm—I’m not sure if anyone can be credited with it in particular. Perhaps the Perceptron is a good precursor, but substantial improvements come from the notion of a loss function of which squared loss, logistic loss, Hinge Loss, and Quantile Loss are all worth covering. It’s important to cover the semantics of these loss functions as well. Vowpal Wabbit is a reasonably fast codebase implementing these.
    2. Second Order Gradient Descent methods For some problems, methods taking into account second derivative information can be more effective. I’ve seen preconditioned conjugate gradient work well, for which Jonathan Shewchuck‘s writeup is reasonable. Nando likes L-BFGS which I don’t have much experience with.
    3. Map-Reduce I have a love-hate relationship with the Map-Reduce framework. In my experience, it’s an excellent filesystem, but it’s quite frustrating to do machine learning with, since it encourages the parallelization of slow learning algorithms. I liked what Markus said at the LCCC workshop: nobody wants to give up on the idea of distributed fault tolerant storage and moving small amounts of code to large amounts of data rather than vice-versa. The best way to use this for Machine Learning isn’t yet clear to me. Hadoop is probably the most commonly used open source implementation of Map-Reduce.
  2. Feature Scaling—what do you do when you have very many features?
    1. Hashing approaches are surprisingly effective in my experience. It’s a good idea to also present Bloom Filters, as they help with the intuition of where this works substantially.
    2. Online l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general treatment.
    3. Boosting based approaches can also be effective, although training time can become problematic. This is partially mitigated by parallelization algorithms as discussed at the LCCC workshop See Jerry Ye’s talk and Krysta’s talk.. A really good public implementation of this is so far missing, as far as I know.
  3. Test-time Evaluation Ultrafast and efficient test-time evaluation seems to be a goal independent of training.
    1. Indicies One way to speed things up is with inverted indicies. Aside from the basic datastructure, I’d cover WAND and predictive indexing.
    2. GPU The use of GPU’s to make evaluation both more efficient and fast seems to make sense in many applications.
  4. Labels
    1. Sourcing Just acquiring sufficient label information can be problematic.
      1. Mechanical Turk can be an efficient approach. The basic approach can be improved with some work.
      2. When you are paying directly for labels, active learning approaches can substantially cut your costs. Burr Settles active learning survey is pretty comprehensive, although if I was to cover one algorithm, it would be this one which enjoys a compelling combination of strong theoretical guarantees, computational tractability, empirical performance, and generality.
      3. The other common approach is user-feedback information where bias and exploration effects becomes a critical concern. The tutorial Alina and I did on learning and exploration is critical here.
    2. Many Labels It’s common to need to make a complex prediction.
      1. Label Tree based approaches are a good starting point. I’d discuss the inconsistency of the naive approach and the Filter Tree, discussed here. Online tree building and conditional probability trees are also potentially extremely useful. Building smarter trees can help, such as with a confusion matrix or in an iterative fashion.
      2. Label Tree approaches breakdown when the number of labels becomes so large that filtering eliminates too many examples. Here Structured Prediction techniques become particularly important. I’d cover Searn as well as some of Drew Bagnell‘s work such as this one. Many other people are still interested in CRFs or Max-Margin Markov Networks which I find somewhat less compelling for computational reasons.
      3. Cascade Learning is also a compelling approach. The canonical paper on this is the Viola-Jones Face Detector. I’m sure there’s much other vision-related work on cascades that I’m unfamiliar. A more recent instance is the structured prediction cascade.

What else is essential and missing?

The Other Online Learning

If you search for “online learning” with any major search engine, it’s interesting to note that zero of the results are for online machine learning. This may not be a mistake if you are committed to a global ordering. In other words, the number of people specifically interested in the least interesting top-10 online human learning result might exceed the number of people interested in online machine learning, even given the presence of the other 9 results. The essential observation here is that the process of human learning is a big business (around 5% of GDP) effecting virtually everyone.

The internet is changing this dramatically, by altering the economics of teaching. Consider two possibilities:

  1. The classroom-style teaching environment continues as is, with many teachers for the same subject.
  2. All the teachers for one subject get together, along with perhaps a factor of 2 more people who are experts in online delivery. They spend a factor of 4 more time designing the perfect lecture & learning environment as verified by extensive study.

These two approaches have a similar economic cost, with the additional effort in the second approach being offset by the fact that it is a one-time effort rather than an annual effort.

I’m sure many people prefer the classroom approach, because it’s traditional, because a teacher can adjust dynamically and intelligently to the student, and because a teacher provides ancillary benefits such as day care and child abuse detection. Nevertheless, the second approach represents a compelling alternative addressing education. For classes commonly taught through high school, it’s difficult to imagine how good a learning experience could be after millions of hours spent refining to create the perfect approach. Imagine repeating a lecture over-and-over, testing the resulting student understanding a {day, week, month, year, decade} later to such an extent that every slide, every sentence, and every exercise is optimized for excellent learning. We could even imagine adapting the lecture to the learning style of each student.

The process of converting to the second approach has been slow, but it seems to be picking up. This suggests we can expect several things:

  1. Shakeout Like all new approaches, there is room for early adopters to win while the established old order suffers. We can expect the most severe impact on pure teaching institutions which do not adopt the newer approaches. Research universities will be insulated in two ways: much of their revenue comes from research grants anyways while the new approach creates a flight to excellence, which the research universities can lay some claim to. At one extreme, I understand that only 4-5% of the operating budget for Caltech comes from student tuition.
  2. Centralized Testing. Although class lessons can be taught at a distance, and exercises worked out by students, there is great room for cheating. The remedy for this is a strong centralized testing service. This already exists in the form of SAT, GRE, and AP tests, because grade inflation and nonuniform standards are common across schools. If a student can ace these tests after taking online learning classes, then there is a real sense in which colleges accepting students are satisfied by their qualifications. We can expect this to become more true, and perhaps to see more employer-oriented tests. We can also expect that testable subjects have an inherent advantage in online learning. As centralized testing is a difficult market to break into, the existing systems have a substantial advantage here.
  3. Digitization. Doing online learning brings all the advantages and disadvantages of any other digital media. These include perfect replicability, essentially free distribution, and difficult economics—on one hand the approach could be vastly valuable while on the other it’s difficult to charge someone for something they can get free. The economics imply that there is room for a major charity or state government to accomplish a great deal which might be difficult to accomplish in a business model.
  4. Gaps. There are areas of teaching which are not amenable to online instruction. For example, teaching people to do research remains in the apprentice model. Similarly, letters of recommendation remain an aspect of the apprentice model. Subjects of relatively small interest such as individual research directions may not merit the effort of a highly polished online instruction system. Similarly, many elements of our current education system are not related to formal education, but rather are about students meeting students, teachers acting as daycare for students, or simply structuring the day for learning. Mechanisms achieving the same ends with online human learning systems are necessary, and the conflation of goals represented by the traditional education approach will retard (but not stop) the adoption of online learning approaches. This process has already taken a decade, and we can expect more decades to come.

For those of us interested in online machine learning, it’s natural to question the relationship with online human learning. The practices differ entirely, but the theory still applies, as there are no clauses in the theorem statements of the form “if the learning agent is not a human then…” When you examine the theorem statements for applicability to online human learning, there are a few ideas which may transfer well. One of these is the necessity and techniques for handling exploration problems. If there are two ways to teach a subject, then you could simply try both and take the best. But if your resources are limited then a UCB approach provides a more efficient mechanism for doing this testing. Similarly, if a student has a set of known attributes, contextual-bandit approaches suggest a sound mechanism for personalization of lessons.

Much of our other theory about the process of online learning may be helpful in a heuristic-motivating manner, but it appears typically too pessimistic to accurately capture what is possible. For example, a common technique to explain an idea when teaching is to simply cover a few extreme cases from which all others are some interpolation. The closest common machine learning analogue to this is some active learning algorithms, where a learning algorithm chooses which examples to label. But, of course, this is not an accurate model, because it’s not the student, but rather the teacher which is choosing the examples. A setting more suitable for student and teacher has been studied in learning theory (see the bibliography here for a link into the citation tree). However, these results are typically rather brittle, so it’s not clear yet that we have understood the right way to formalize this process.

Many ways to Learn this summer

There are at least 3 summer schools related to machine learning this summer.

  1. The first is at University of Chicago June 1-11 organized by Misha Belkin, Partha Niyogi, and Steve Smale. Registration is closed for this one, meaning they met their capacity limit. The format is essentially an extended Tutorial/Workshop. I was particularly interested to see Valiant amongst the speakers. I’m also presenting Saturday June 6, on logarithmic time prediction.
  2. Praveen Srinivasan points out the second at Peking University in Beijing, China, July 20-27. This one differs substantially, as it is about vision, machine learning, and their intersection. The deadline for applications is June 10 or 15. This is also another example of the growth of research in China, with active support from NSF.
  3. The third one is at Cambridge, England, August 29-September 10. It’s in the MLSS series. Compared to the Chicago one, this one is more about the Bayesian side of ML, although effort has been made to create a good cross section of topics. It’s also more focused on tutorials over workshop-style talks.

Wielding a New Abstraction

This post is partly meant as an advertisement for the reductions tutorial Alina, Bianca, and I are planning to do at ICML. Please come, if you are interested.

Many research programs can be thought of as finding and building new useful abstractions. The running example I’ll use is learning reductions where I have experience. The basic abstraction here is that we can build a learning algorithm capable of solving classification problems up to a small expected regret. This is used repeatedly to solve more complex problems.

In working on a new abstraction, I think you typically run into many substantial problems of understanding, which make publishing particularly difficult.

  1. It is difficult to seriously discuss the reason behind or mechanism for abstraction in a conference paper with small page limits. People rarely see such discussions and hence have little basis on which to think about new abstractions. Another difficulty is that when building an abstraction, you often don’t know the right way to state things.

    Here’s my current attempt: The process of abstraction for learning reductions can start with sample complexity bounds (or online learning against an adversary analysis). A very simple sample complexity bound is that for all sets of hypotheses H, for all distributions D on examples (x,y), and for all confidence parameters d

    Pr(x,y)m~Dm(for all h in H: |e(h,D)-e(h,(x,y)m)| < (ln( |H|/ d )/m)0.5 ) > 1 – d

    Here (x,y)m is a sequence of m IID samples, e(h,D) is the error rate of h on D and e(h,(x,y)m) is the empirical error rate of h on the set of IID samples.

    The previous bound is a very simple example, and yet remarkably complex both to state and to interpret—many people have been lost by the meaning of d. The impact of this complexity is that it is difficult to effectively use these bounds in practical learning algorithm design, particularly in solving more complex learning problems where much more than one bit of prediction is required. This was a central frustration that I ran into in my thesis work. Some progress has been made since then, but it is still quite difficult. The abstraction in the learning reduction setting is:

    1. You throw away d, because it only has a logarithmic dependence anyways.
    2. You eliminate H and m on the theory that intelligent choices for H and m are made in practice.
    3. You eliminate the IID assumption, because it is no longer needed to define things

    The statement then is

    e(A((x,y)m),D)-e(h*,D) < eps

    where A() is the hypothesis output by the learning algorithm, h* is the best possible predictor, and eps is used to parameterize the theorems. This abstraction is radical in some sense, but something radical was needed to yield tractable and useful analysis on the complex problems people need to solve in practice.

  2. A consequence of lack of familiarity, is that people often misread. In reading a paper, there is a temptation to not read carefully and fill in your understanding of things. Most of the time this works out well, but not here. For example, we saw many instances where people inserted IID sample assumptions or other things that simply weren’t there.
  3. Once you get past the lack of familiarity and misunderstandings, there is a feeling that the new abstraction is cheating. To some extent I understand, as I remember learning about abstractions in class, and I remember feeling that they were in some real sense cheating by dropping important details. For example:
    1. Big-O notation provides an upper bound specified up to constants. For example O(log n) computational complexity means there exists a constant c such that the number of operations requires is less than c log n. Big-O can be abused by hiding “constants” larger than the plausible values for the parameters. In machine learning, a particularly egregious case occurs in Bandit analysis where the punchline of some papers is “logarithmic regret”, hiding an arbitrarily large problem dependent constant.
    2. TCP provides a mechanism for reliable transport over an unreliable network. It is a very commonly used mechanism for sending information over the internet—you used TCP in reading this. TCP is both a programming construct and a mechanism for abstracting communicating over a network. The TCP abstraction is broken when the network is too unreliable for it to recover, such as on sketchy wireless networks where the programmer built for the TCP abstraction which wasn’t delivered.
    3. Dimensional analysis is a technique for quick analysis in physics. The basic idea is to just look at the units when estimating some quantity and combine them to get the right unit answer. For example, to compute the distance d traveled after time t with acceleration a, you simply use at2, since that formula is the only way to combine a with units of distance/time2 and t with units of time to get units of distance. This answer is off by a factor of 2 from what a more detailed analysis using integration yields, which is typical. Dimensional analysis can be misleading when the constants are very large. One example is in Gravitation where there is a table with time and distance equated since they are related by a constant—the speed of light 3*108 m/s. For example, E=mc2 becomes E=m.

    Although the above breakages are real, the usefulness of these abstractions, in terms of allowing us to quickly think about and make decisions more than offsets the drawbacks. Indeed, even the breakages stated above are thought provoking or useful enough that I can’t even say it is wrong to consider them. This property that abstractions can be abused is generically essential to the process of abstraction itself. Abstraction is about neglecting details, and when these details are not neglectable, the abstraction is abused or ineffective. Because of this, any abstraction is insufficient for analyzing and solving real problems where the neglected details matter.

    Just as for these abstractions, the learning reduction abstraction can be abused—the chosen learning algorithm can be pathetic yielding vacuous bounds, or the reduction can scramble the feature information with an encryption algorithm making it so no reasonable learning algorithm could yield other than pathetic performance. Similarly, there are situations in which I don’t know how to effectively use a learning reduction to build a learning algorithm, and it seems implausible that observation changes as more is learned in the future.

For a good abstraction, the drawbacks are matched by the advantages. The principle advantage is that there is a new way to examine and solve problems. This has several interesting effects.

  1. A good abstraction can capture a more complete specification of the problem. As an example, the sample complexity view of learning is broken in practice, because when insufficient performance is achieved people choose a different set of hypotheses H by throwing in additional features or choosing a different learning algorithm. Capturing this process in the sample complexity view requires an additional level of complexity. In the reduction view, this is entirely natural, because any means for achieving a better generalization—more/better features, a better learning algorithm, a better prior, sticking a human in the learning process, etc… are legitimate. This is particularly powerful when architecting solutions, providing a partial answer to the “What?” question Yehuda pointed out.
  2. A higher level abstraction can let you accidentally solve problems in other areas as well. A good example of this is error correcting tournaments which are useful for tournament design to select the best player/team/paper in real tournaments. Recently, I was amused to learn that a standard betting procedure for basketball tournaments exactly mirrors the importance weights suggested for the final elimination of ECTs. The first phase of ECTs provides a sound and practical method to seed a final elimination tournament, eliminating the need for (and biases of) a committee.
  3. Perhaps the most interesting effect is that the new abstraction can aid you in finding effective solutions to new problems. For learning reductions, there are about 3 compelling instances I’ve seen so far.
    1. Given training-time access to a good policy oracle, Searn provides a method for decomposing any complex prediction problem into simple problems, such that low regret solutions to the simple problems imply a low regret solution to the original problem. While Searn competes well (computationally and prediction-wise) with existing methods for linear chain style structured prediction, it really shines on more complex problems. Hal used Searn for automatic document summarization (see section 6.2) which previously wasn’t really solved via ML. More generally, when I learn about the details of other complex prediction systems for machine translation or vision, the base algorithms are tweaked, typically in ways that Searn would suggest. This suggests that Searn formalizes and automates the intuitions of practical people.
    2. The “one step RL” reduction in Bianca‘s thesis (page 119) provided tractable and effective approaches to learning in partial feedback problems where only the loss of a chosen label is learned. An even simpler reduction exists as a matter of folklore—estimate the the value of each label and then take an argmax. However, we have found classification approaches generally work better, where applicable, and as the theory suggests.
    3. Many commonly used algorithms for prediction have a running time linear (or worse) in the number of labels with decision trees a good exception. While simply predicting faster isn’t normally solving a “new problem”, an exponential improvement in computational time seems to merit this description because it allows entirely new kinds of applications. It turns out that it is both very easy to do logarithmic time prediction wrong, and that this problem is often fixable. Furthermore, it appears logarthmic time prediction can really work in practice over very many labels.

When we started working on learning reductions, I had no idea what either the difficulties or rewards were going to be—it simply seemed like a natural and compelling direction of investigation. Given the substantial difficulties encountered, it’s not at all clear that this pursuit was personally worthwhile. It has cost much time which could have been put to good use in other ways.

On the other hand, the advantages are also substantial. I’ve learned something about architecting solutions to problems, both expanding the domain of application for the field and providing a personal edge that I can bring to many conversations about ML. It’s also progress towards the AI goal, which interests me. When I think of what I could have worked on instead to achieve these goals, I don’t have any more compelling answer yet. Learning reductions seem to have accomplished more per unit thought than any other theoretical approach I can identify over the last 5 or 6 years. Furthermore, they are composable by design, so they should stay relevant (and perhaps even become more so), when people use an online active deep semisupervised probabilistic convolutional algorithm to solve a problem, particularly for complex problems.

As I said at the beginning, please join us for the tutorial, if you are interested.