Vowpal Wabbit 7.8 at NIPS

I just created Vowpal Wabbit 7.8, and we are planning to have an increasingly less heretical followup tutorial during the non-“ski break” at the NIPS Optimization workshop. Please join us if interested.

I always feel like things are going slow, but in the last year, but there have been many changes overall. Notes for 7.7 are here. Since then, there are several areas of improvement as well as generalized bug fixes and refactoring.

  1. Learning to Search: Hal completely rewrote the learning to search system, enough that the numbers here are looking obsolete. Kai-Wei has also created several advanced applications for entity-relation and dependency parsing which are promising.
  2. Languages Hal also created a good python library, which includes call-backs for learning to search. You can now develop advanced structured prediction solutions in a nice language. Jonathan Morra also contributed an initial Java interface.
  3. Exploration The contextual bandit subsystem now allows evaluation of an arbitrary policy, and an exploration library is now factored out into an independent library (principally by Luong with help from Sid and Sarah). This is critical for real applications because randomization must happen at the point of decision.
  4. Reductions The learning reductions subsystem has continued to mature, although the perfectionist in me is still dissatisfied. As a consequence, it’s now pretty easy to program new reductions, and the efficiency of these reductions has generally improved. Several news ones are cooking.
  5. Online Learning Alekh added an online SVM implementation based on LaSVM. This is known to parallelize well via the para-active approach.

This project has grown quite a bit—there are about 30 different people contributing to VW since the last release, and there is now a VW meetup (December 8th!) in the bay area that I wish I could attend.

Allreduce (or MPI) vs. Parameter server approaches

In the last 7 years or so there has been quite a bit of work on parallel machine learning approaches, enough that I felt like a summary might be helpful both for myself and others. In each case, I put in the earliest known citation. If I missed something please comment.

One basic dividing line between parallel approaches is single-machine vs. multi-machine. Multi-machine approaches offer the potential for much greater improvements than single-machine approaches, but generally suffer from a lower bandwidth between components of the parallelized process.

Amongst single machine approaches, GPU-based learning is the dominant form of parallelism. For many algorithms, this can provide an easy 10x speedup, with the limits being programming (GPUs are special), the amount of GPU RAM (12GB for a K40), the bandwidth to the GPU interface, and your algorithms needing care as new architectures come out. I’m not sure who first started using GPUs for machine learning.

Another important characteristic of parallel approaches is deterministic vs. nondeterministic. When you run the same algorithm twice, do you always get the same result? Leon Bottou tells me that he thinks reproducibility is worth a factor of 2. I personally would rate it somewhat higher, just because debugging is such an intrinsic part of using machine learning algorithms and the debuggability of nondeterministic algorithms is greatly impaired.

  1. MPI gradient aggregation (See here (2007).) Accumulate gradient statistics in parallel and use a good solver to find a good solution. There are two weaknesses here:
    1. Batch solvers are slow compared to online gradient descent approaches, at least for the first pass.
    2. Large datasets typically do not sit in MPI clusters. There are good reasons for this—MPI clusters are typically not designed for heavy data work.
  2. Map-Reduce statistical query algorithms. The first paper (2007) of this sort was single machine, but it obviously applied to map-reduce clusters of the time starting the Mahout project. This addressed the second problem of the MPI approach, but not the first (batch algorithms are slow), and created a new problem (iteration and communication are slow in a map-reduce setting).
  3. Parameter averaging. (see here (2010)). Use online learning algorithms and then average parameter values. This dealt with both of the drawbacks of the MPI approach as applied to convex learning, but is limited to convex(ish) approaches and may take a long time to converge on datasets where a second order optimization is needed. Iteration in a map-reduce paradigm remains awkward/slow in many cases.
  4. Graph-based approaches. (see here (2010)). Learning algorithms that are represented by graphs can be partitioned across compute nodes and then operated on with parallel algorithms. This allows models larger than the state of a single machine. This addresses many learning algorithms that can be represented this way, but there are many that cannot be effectively represented this way as well.
  5. Parameter server approaches. (see here (2010)). This is distinct from graph based approaches in that parameter storage and updating is broken out as a primary component. This allows models larger than the state of a single machine. Parameter server approaches require nondeterminism to be performant. There has been quite a bit of follow-up work on parameter server approaches including shockingly inefficient systems(2012) and more efficient systems(2014) although they remain less efficient than GPUs.
  6. Allreduce approaches. (see here (2011)) Allreduce is an MPI-primitive which allows normal sequential code to work in parallel, implying very low programming overhead. This allows both parameter averaging, gradient aggregation, and iteration. The fundamental drawbacks are poor performance under misbalanced loads and difficulty with models that exceed working memory in size. A refined version of this approach has been used for speech recognition (2014).
  7. GPU+MPI approaches. (see here (2013)) GPUs are good and MPI is good, so GPU+MPI should be good. It is good, although there are caveats related to the phenomenal amount of computation a GPU provides compared to the communication available, even with a good interconnect. See the speech recognition paper above for a remediation.

Most of these papers are about planting a flag rather than determining what the best approach to parallelization is. This makes determining how to parallelize learning algorithms rather unclear. My present approach remains case-based.

  1. Don’t do it for the sake of parallelization. Have some other real goal in mind that justifies the effort. Parallelization is both simple and subtle which makes it unrewarding unless you really need it. Strongly consider the programming complexity of approaches if you decide to proceed.
  2. If you are locked into a particular piece of hardware or cluster software, then you don’t have much choice—make the best of it. For some people this is an MPI cluster, Hadoop, or an SQL cluster.
  3. If your data can easily be copied onto a single machine, then a GPU based approach seems like a good idea. GPU programming is nontrivial, but many people have done it at this point.
  4. If your data is of a multimachine scale you must do some form of cluster parallelism.
    1. Graph-based approaches can be the right answer when your graph is not too deeply interconnected.
    2. Allreduce-based approaches appear effective and easy to use in many other cases. I wish every cluster came with an allreduce library.
      1. If you are parsing limited (i.e. for linear representations) then a CPU cluster is fine.
      2. If you are compute limited, then a cluster of GPUs seems the way to go.

The above leaves out parameter server approaches, which is controversial since a huge amount of effort has been invested in parameter server approaches and reasonable people can disagree. The caveats that matter for me are:

  1. It might be that the right way to parallelize in a cluster has nothing to do with the right way to parallelize on a single machine, but this seems implausible.
  2. Success/effort appears relatively low. It’s surprising that you can effectively compete with mature parameter server approaches on compute heavy tasks using a single GPU.

Note that I’m not claiming parameter servers are useless—I think they could be effective if applied in situations where you cannot prebalance the compute load of parallel elements. But the extent to which this applies in a datacenter computation seems both manageable and a flaw of the datacenter that will be reduced with time.

Conference on Digitial Experimentation

I just attended CODE. The set of people interested in digital experimentation have very diverse backgrounds encompassing theory, machine learning, social science, economics, and industry so this seems like a good subject for a new conference. I hope it continues.

I found several talks interesting.

  • Eytan Bakshy talked about PlanOut which is language/platform for flexibly specifying experiments.
  • Ron Kohavi talked about EXP which is a heavily used A/B testing platform.
  • Susan Athey talked about long term vs short term metrics which seems both important to address, a constant problem, and not yet systematically solved.

There was a panel about the ongoing Facebook experimentation controversy. The issue here is complex. My understanding is that Facebook users have some expected ownership of the content they create, and hence aren’t comfortable with the content being used in unexpected ways. On the other hand, experimentation is so necessary to the functioning of all large modern internet sites that banning it or slowing down the process by a factor of a million (as some advocated) would badly degrade the future of these sites in practice.

My belief is that what’s lacking is education and trust. W.r.t. education, people need to understand that experimentation is unavoidable when trying to figure out how to optimize an enormously complex system, as there is just no other way to systematically make 1000 right decisions as is necessary for basic things like choosing the best homepage/search result/etc… W.r.t. trust, companies are not particularly good at creating trust in general, but finding the right mechanism for doing so seems critical. I would point out Vanguard as a company that managed to successfully create trust by design.

The perfect candidate

The last several years have seen a phenomonal growth in machine learning, such that this earlier post from 2007 is understated. Machine learning jobs aren’t just growing on trees, they are growing everywhere. The core dynamic is a digitizing world, which makes people who know how to use data effectively a very hot commodity. In the present state, anyone reasonably familiar with some machine learning tools and a master’s level of education can get a good job at many companies while Phd students coming out sometimes have bidding wars and many professors have created startups.

Despite this, hiring in good research positions can be challenging. A good research position is one where you can:

  • Spend the majority of your time working on research questions that interest.
  • Work with other like-minded people.
  • For several years.

I see these as critical—research is hard enough that you cannot expect to succeed without devoting the majority of your time. You cannot hope to succeed without personal interest. Other like-minded people are typically necessary in finding the solutions of the hardest problems. And, typically you must work for several years before seeing significant success. There are exceptions to everything, but these criteria are the working norm of successful research I see.

The set of good research positions is expanding, but at a much slower pace than the many applied scientist types of positions. This makes good sense as the pool of people able to do interesting research grows only slowly, and anyone funding this should think quite hard before making the necessary expensive commitment for success.

But, with the above said, what makes a good candidate for a research position? People have many diverse preferences, so I can only speak for myself with any authority. There are several things I do and don’t look for.

  1. Something new. Any good candidate should have something worth teaching. For a phd candidate, the subject of your research is deeply dependent on your advisor. It is not necessary that you do something different from your advisor’s research direction, but it is necessary that you own (and can speak authoritatively) about a significant advance.
  2. Something other than papers. It is quite possible to persist indefinitely in academia while only writing papers, but it does not show a real interest in what you are doing beyond survival. Why are you doing it? What is the purpose? Some people code. Some people solve particular applications. There are other things as well, but these make the difference.
  3. A difficult long-term goal. A goal suggests interest, but more importantly it makes research accumulate. Some people do research without a goal, solving whatever problems happen to pass by that they can solve. Very smart people can do well in research careers with a random walk amongst research problems. But people with a goal can have their research accumulate in a much stronger fashion than a random walk through research problems. I’m not an extremist here—solving off goal problems is fine and desirable, but having a long-term goal makes a long-term difference.
  4. A portfolio of coauthors. This shows that you are the sort of person able to and interested in working with other people, as is very often necessary for success. This can be particularly difficult for some phd candidates whose advisors expect them to work exclusively with (or for) them. Summer internships are both a strong tradition and a great opportunity here.
  5. I rarely trust recommendations, because I find them very difficult to interpret. When the candidate selects the writers, the most interesting bit is who the writers are. Letters default positive, but the degree of default varies from writer to writer. Occasionally, a recommendation says something surprising, but do you trust the recommender’s judgement? In some cases yes, but in many cases you do not know the writer.

Meeting the above criteria within the context of a phd is extraordinarily difficult. The good news is that you can “fail” with a job that is better in just about every way 🙂

Anytime criteria are discussed, it’s worth asking: should you optimize for them? In another context, Lines of code is a terrible metric to optimize when judging programmer productivity. Here, I believe optimizing for (1), (2), (3), and (4) are all beneficial and worthwhile for phd students.