John Langford – Page 8 – Machine Learning (Theory)

10/30/201510/30/2015

ICML 2016 in NYC and KDD Cup 2016

ICML 2016 is in New York City. I expect it to be the largest ICML by far given the destination—New York is the place which is perhaps easiest to reach from anywhere in the world and it has the largest machine learning meetup anywhere in the world.

I am the general chair this year, which is light in work but heavy in responsibilities. Some things I worry about:

How many people will actually come? Numbers are difficult to guess with the field growing and the conference changing locations. I believe we need capacity for at least 3000 people based on everything I know.
New York is expensive. What can be done about it? One thought is that we should actively setup a roommate finding system so the costs of hotels can be shared. Up to 3 people can share a hotel room for the conference hotel (yes, each with their own bed), and that makes the price much more reasonable. I’m also hoping donations will substantially defray the cost. If others have creative ideas, I’m definitely interested.

Markus Weimer also points out the 2016 KDD Cup which has a submission deadline of December 6. KDD Cup datasets have become common reference for many machine learning papers, so this is a good way to get your problem solved well by many people.

1/7/20151/8/2015

The NIPS experiment

Corinna Cortes and Neil Lawrence ran the NIPS experiment where 1/10th of papers submitted to NIPS went through the NIPS review process twice, and then the accept/reject decision was compared. This was a great experiment, so kudos to NIPS for being willing to do it and to Corinna & Neil for doing it.

The 26% disagreement rate presented at the conference understates the meaning in my opinion, given the 22% acceptance rate. The immediate implication is that between 1/2 and 2/3 of papers accepted at NIPS would have been rejected if reviewed a second time. For analysis details and discussion about that, see here.

Let’s give P(reject in 2nd review | accept 1st review) a name: arbitrariness. For NIPS 2014, arbitrariness was ~60%. Given such a stark number, the primary question is “what does it mean?”

Does it mean there is no signal in the accept/reject decision? Clearly not—a purely random decision would have arbitrariness of ~78%. It is however quite notable that 60% is much closer to 78% than 0%.

Does it mean that the NIPS accept/reject decision is unfair? Not necessarily. If a pure random number generator made the accept/reject decision, it would be ‘fair’ in the same sense that a lottery is fair, and have an arbitrariness of ~78%.

Does it mean that the NIPS accept/reject decision could be unfair? The numbers give no judgement here. It is however a natural fallacy to imagine that random judgements derived from people implies unfairness, so I would encourage people to withhold judgement on this question for now.

Is an arbitrariness of 0% the goal? Achieving 0% arbitrariness is easy: just choose all papers with an md5sum that ends in 00 (in binary). Clearly, there is something more to be desired from a reviewing process.

Perhaps this means we should decrease the acceptance rate? Maybe, but this makes sense only if you believe that arbitrariness is good, as it will almost surely increase the arbitrariness. In the extreme case where only one paper is accepted, the odds of it being the rejected on re-review are near 100%.

Perhaps this means we should increase the acceptance rate? If all papers submmitted were accepted, the arbitrariness would be 0, but as mentioned above arbitrariness 0 is not the goal.

Perhaps this means that NIPS is a very broad conference with substantial disagreement by reviewers (and attendees) about what is important? Maybe. This even seems plausible to me, given anecdotal personal experience. Perhaps small highly-focused conferences have a smaller arbitrariness?

Perhaps this means that researchers submit themselves to an arbitrary process for historical reasons? The arbitrariness is clear, but the reason less so. A mostly-arbitrary review process may be helpful in the sense that it gives authors a painful-but-useful opportunity to debug the easy ways to misinterpret their work. It may also be helpful in that it perfectly rejects the bottom 20% of papers which are actively wrong, and hence harmful to the process of developing knowledge. None of these reasons are confirmed of course.

Is it possible to do better? I believe the answer is “yes”, but it should be understood as a fundamentally difficult problem. Every program chair who cares tries to tweak the reviewing process to be better, and there have been many smart program chairs that tried hard. Why isn’t it better? There are strong nonvisible constraints on the reviewers time and attention.

What does it mean? In the end, I think it means two things of real importance.

The result of the process is mostly arbitrary. As an author, I found rejects of good papers very hard to swallow, especially when the reviews were nonsensical. Learning to accept that the process has a strong element of arbitrariness helped me deal with that. Now there is proof, so new authors need not be so discouraged.
CMT now has a tool for measuring arbitrariness that can be widely used by other conferences. Joelle and I changed ICML 2012 in various ways. Many of these appeared beneficial and some stuck, but others did not. In the long run, it’s the things which stick that matter. Being able to measure the review process in a more powerful way might be beneficial in getting good review practices to stick.

Other commentary from Lance, Bert, and Yisong.

Edit: Cross-posted on CACM.

12/6/2014

Vowpal Wabbit 7.8 at NIPS

I just created Vowpal Wabbit 7.8, and we are planning to have an increasingly less heretical followup tutorial during the non-“ski break” at the NIPS Optimization workshop. Please join us if interested.

I always feel like things are going slow, but in the last year, but there have been many changes overall. Notes for 7.7 are here. Since then, there are several areas of improvement as well as generalized bug fixes and refactoring.

Learning to Search: Hal completely rewrote the learning to search system, enough that the numbers here are looking obsolete. Kai-Wei has also created several advanced applications for entity-relation and dependency parsing which are promising.
Languages Hal also created a good python library, which includes call-backs for learning to search. You can now develop advanced structured prediction solutions in a nice language. Jonathan Morra also contributed an initial Java interface.
Exploration The contextual bandit subsystem now allows evaluation of an arbitrary policy, and an exploration library is now factored out into an independent library (principally by Luong with help from Sid and Sarah). This is critical for real applications because randomization must happen at the point of decision.
Reductions The learning reductions subsystem has continued to mature, although the perfectionist in me is still dissatisfied. As a consequence, it’s now pretty easy to program new reductions, and the efficiency of these reductions has generally improved. Several news ones are cooking.
Online Learning Alekh added an online SVM implementation based on LaSVM. This is known to parallelize well via the para-active approach.

This project has grown quite a bit—there are about 30 different people contributing to VW since the last release, and there is now a VW meetup (December 8th!) in the bay area that I wish I could attend.

11/28/2014

Allreduce (or MPI) vs. Parameter server approaches

In the last 7 years or so there has been quite a bit of work on parallel machine learning approaches, enough that I felt like a summary might be helpful both for myself and others. In each case, I put in the earliest known citation. If I missed something please comment.

One basic dividing line between parallel approaches is single-machine vs. multi-machine. Multi-machine approaches offer the potential for much greater improvements than single-machine approaches, but generally suffer from a lower bandwidth between components of the parallelized process.

Amongst single machine approaches, GPU-based learning is the dominant form of parallelism. For many algorithms, this can provide an easy 10x speedup, with the limits being programming (GPUs are special), the amount of GPU RAM (12GB for a K40), the bandwidth to the GPU interface, and your algorithms needing care as new architectures come out. I’m not sure who first started using GPUs for machine learning.

Another important characteristic of parallel approaches is deterministic vs. nondeterministic. When you run the same algorithm twice, do you always get the same result? Leon Bottou tells me that he thinks reproducibility is worth a factor of 2. I personally would rate it somewhat higher, just because debugging is such an intrinsic part of using machine learning algorithms and the debuggability of nondeterministic algorithms is greatly impaired.

MPI gradient aggregation (See here (2007).) Accumulate gradient statistics in parallel and use a good solver to find a good solution. There are two weaknesses here:
1. Batch solvers are slow compared to online gradient descent approaches, at least for the first pass.
2. Large datasets typically do not sit in MPI clusters. There are good reasons for this—MPI clusters are typically not designed for heavy data work.
Map-Reduce statistical query algorithms. The first paper (2007) of this sort was single machine, but it obviously applied to map-reduce clusters of the time starting the Mahout project. This addressed the second problem of the MPI approach, but not the first (batch algorithms are slow), and created a new problem (iteration and communication are slow in a map-reduce setting).
Parameter averaging. (see here (2010)). Use online learning algorithms and then average parameter values. This dealt with both of the drawbacks of the MPI approach as applied to convex learning, but is limited to convex(ish) approaches and may take a long time to converge on datasets where a second order optimization is needed. Iteration in a map-reduce paradigm remains awkward/slow in many cases.
Graph-based approaches. (see here (2010)). Learning algorithms that are represented by graphs can be partitioned across compute nodes and then operated on with parallel algorithms. This allows models larger than the state of a single machine. This addresses many learning algorithms that can be represented this way, but there are many that cannot be effectively represented this way as well.
Parameter server approaches. (see here (2010)). This is distinct from graph based approaches in that parameter storage and updating is broken out as a primary component. This allows models larger than the state of a single machine. Parameter server approaches require nondeterminism to be performant. There has been quite a bit of follow-up work on parameter server approaches including shockingly inefficient systems(2012) and more efficient systems(2014) although they remain less efficient than GPUs.
Allreduce approaches. (see here (2011)) Allreduce is an MPI-primitive which allows normal sequential code to work in parallel, implying very low programming overhead. This allows both parameter averaging, gradient aggregation, and iteration. The fundamental drawbacks are poor performance under misbalanced loads and difficulty with models that exceed working memory in size. A refined version of this approach has been used for speech recognition (2014).
GPU+MPI approaches. (see here (2013)) GPUs are good and MPI is good, so GPU+MPI should be good. It is good, although there are caveats related to the phenomenal amount of computation a GPU provides compared to the communication available, even with a good interconnect. See the speech recognition paper above for a remediation.

Most of these papers are about planting a flag rather than determining what the best approach to parallelization is. This makes determining how to parallelize learning algorithms rather unclear. My present approach remains case-based.

Don’t do it for the sake of parallelization. Have some other real goal in mind that justifies the effort. Parallelization is both simple and subtle which makes it unrewarding unless you really need it. Strongly consider the programming complexity of approaches if you decide to proceed.
If you are locked into a particular piece of hardware or cluster software, then you don’t have much choice—make the best of it. For some people this is an MPI cluster, Hadoop, or an SQL cluster.
If your data can easily be copied onto a single machine, then a GPU based approach seems like a good idea. GPU programming is nontrivial, but many people have done it at this point.
If your data is of a multimachine scale you must do some form of cluster parallelism.
1. Graph-based approaches can be the right answer when your graph is not too deeply interconnected.
2. Allreduce-based approaches appear effective and easy to use in many other cases. I wish every cluster came with an allreduce library.
  1. If you are parsing limited (i.e. for linear representations) then a CPU cluster is fine.
  2. If you are compute limited, then a cluster of GPUs seems the way to go.

The above leaves out parameter server approaches, which is controversial since a huge amount of effort has been invested in parameter server approaches and reasonable people can disagree. The caveats that matter for me are:

It might be that the right way to parallelize in a cluster has nothing to do with the right way to parallelize on a single machine, but this seems implausible.
Success/effort appears relatively low. It’s surprising that you can effectively compete with mature parameter server approaches on compute heavy tasks using a single GPU.

Note that I’m not claiming parameter servers are useless—I think they could be effective if applied in situations where you cannot prebalance the compute load of parallel elements. But the extent to which this applies in a datacenter computation seems both manageable and a flaw of the datacenter that will be reduced with time.

10/11/2014

Conference on Digitial Experimentation

I just attended CODE. The set of people interested in digital experimentation have very diverse backgrounds encompassing theory, machine learning, social science, economics, and industry so this seems like a good subject for a new conference. I hope it continues.

I found several talks interesting.

Eytan Bakshy talked about PlanOut which is language/platform for flexibly specifying experiments.
Ron Kohavi talked about EXP which is a heavily used A/B testing platform.
Susan Athey talked about long term vs short term metrics which seems both important to address, a constant problem, and not yet systematically solved.

There was a panel about the ongoing Facebook experimentation controversy. The issue here is complex. My understanding is that Facebook users have some expected ownership of the content they create, and hence aren’t comfortable with the content being used in unexpected ways. On the other hand, experimentation is so necessary to the functioning of all large modern internet sites that banning it or slowing down the process by a factor of a million (as some advocated) would badly degrade the future of these sites in practice.

My belief is that what’s lacking is education and trust. W.r.t. education, people need to understand that experimentation is unavoidable when trying to figure out how to optimize an enormously complex system, as there is just no other way to systematically make 1000 right decisions as is necessary for basic things like choosing the best homepage/search result/etc… W.r.t. trust, companies are not particularly good at creating trust in general, but finding the right mechanism for doing so seems critical. I would point out Vanguard as a company that managed to successfully create trust by design.