ICML 2016 in NYC and KDD Cup 2016

ICML 2016 is in New York City. I expect it to be the largest ICML by far given the destination—New York is the place which is perhaps easiest to reach from anywhere in the world and it has the largest machine learning meetup anywhere in the world.

I am the general chair this year, which is light in work but heavy in responsibilities. Some things I worry about:

  1. How many people will actually come? Numbers are difficult to guess with the field growing and the conference changing locations. I believe we need capacity for at least 3000 people based on everything I know.
  2. New York is expensive. What can be done about it? One thought is that we should actively setup a roommate finding system so the costs of hotels can be shared. Up to 3 people can share a hotel room for the conference hotel (yes, each with their own bed), and that makes the price much more reasonable. I’m also hoping donations will substantially defray the cost. If others have creative ideas, I’m definitely interested.

Markus Weimer also points out the 2016 KDD Cup which has a submission deadline of December 6. KDD Cup datasets have become common reference for many machine learning papers, so this is a good way to get your problem solved well by many people.

Hadoop AllReduce and Terascale Learning

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.

For communication infrastructures, the two most prevalent approaches are MPI and MapReduce, both of which have substantial drawbacks for machine learning with lots of data.

  1. MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
  2. MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
    MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.

Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:

  1. MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
  2. MapReduce: Conceptual simplicity. One easy to understand function is enough.
  3. MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
  4. MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
  5. MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
  6. MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
  7. MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
  8. MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

  1. For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  2. For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  3. For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
  4. The model needs to fit into memory. If that’s not the case, then other approaches are required.

I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.

Update: CACM Crosspost.

Fall Machine Learning Events

Many Machine Learning related events are coming up this fall.

  1. September 9, abstracts for the New York Machine Learning Symposium are due. Send a 2 page pdf, if interested, and note that we:
    1. widened submissions to be from anybody rather than students.
    2. set aside a larger fraction of time for contributed submissions.
  2. September 15, there is a machine learning meetup, where I’ll be discussing terascale learning at AOL.
  3. September 16, there is a CS&Econ day at New York Academy of Sciences. This is not ML focused, but it’s easy to imagine interest.
  4. September 23 and later NIPS workshop submissions start coming due. As usual, there are too many good ones, so I won’t be able to attend all those that interest me. I do hope some workshop makers consider ICML this coming summer, as we are increasing to a 2 day format for you. Here are a few that interest me:
    1. Big Learning is about dealing with lots of data. Abstracts are due September 30.
    2. The Bayes Bandits workshop. Abstracts are due September 23.
    3. The Personalized Medicine workshop
    4. The Learning Semantics workshop. Abstracts are due September 26.
    5. The ML Relations workshop. Abstracts are due September 30.
    6. The Hierarchical Learning workshop. Challenge submissions are due October 17, and abstracts are due October 21.
    7. The Computational Tradeoffs workshop. Abstracts are due October 17.
    8. The Model Selection workshop. Abstracts are due September 24.
  5. October 16-17 is the Singularity Summit in New York. This is for the AIists and only peripherally about ML.
  6. October 16-21 is a Predictive Analytics World in New York. As machine learning goes industrial, we see industrial-style conferences rapidly developing.
  7. October 21, there is the New York ML Symposium. In addition to what’s there, Chris Wiggins is looking into setting up a session for startups and those interested in them to get to know each other, as last year.
  8. Decembr 16-17 NIPS workshops in Granada, Spain.

The Heritage Health Prize

The Heritage Health Prize is potentially the largest prediction prize yet at $3M, which is sure to get many people interested. Several elements of the competition may be worth discussing.

  1. The most straightforward way for HPN to deploy this predictor is in determining who to cover with insurance. This might easily cover the costs of running the contest itself, but the value to the health system of a whole is minimal, as people not covered still exist. While HPN itself is a provider network, they have active relationships with a number of insurance companies, and the right to resell any entrant. It’s worth keeping in mind that the research and development may nevertheless end up being useful in the longer term, especially as entrants also keep the right to their code.
  2. The judging metric is something I haven’t seen previously. If a patient has probability 0.5 of being in the hospital 0 days and probability 0.5 of being in the hospital ~53.6 days, the optimal prediction in expectation is ~6.4 days. This is evidence against point (1) above, since cost is probably closer to linear in the number of hospital days. As a starting point, I suspect many people will simply optimize conditional squared loss and then back out an inferred prediction according to p=ex-1, with clipping. The standard approach of ensembling should be effective.
  3. The team structure seems a bit strange to me. I’m not sure there is a good reason for it from a prediction point of view and 8 may be too hard a limit on team size, imposing bin packing problems on the entrants.
  4. Privacy is clearly a huge concern. They anonymized the data, require entrants to protect the data, and admonish people to not try to break privacy. Despite that, the data will be released to large numbers of people, so I wouldn’t be surprised if someone attempts a join attack of some sort. Whether or not a join attack succeeds could make a huge difference in how this contest is viewed in the long term.
  5. The Accuracy Threshold is a big deal. If they set it at an out-of-reach point (which they could easily do), the size of the prize becomes 0.5M. This part of the contest is supposed to be determined next month.

This contest is not a slam-dunk, but is has the potential to become one, and I’ll be interested to see how it turns out.