The Multiworld Testing Decision Service

We made a tool that you can use. It is the first general purpose reinforcement-based learning system 🙂

Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:

  1. Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
  2. Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
  3. Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
  4. Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.

The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.

This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.

Have fun.

Remote large scale learning class participation

Yann and I have arranged so that people who are interested in our large scale machine learning class and not able to attend in person can follow along via two methods.

  1. Videos will be posted with about a 1 day delay on techtalks. This is a side-by-side capture of video+slides from Weyond.
  2. We are experimenting with Piazza as a discussion forum. Anyone is welcome to subscribe to Piazza and ask questions there, where I will be monitoring things. update2: Sign up here.

The first lecture is up now, including the revised version of the slides which fixes a few typos and rounds out references.

Berkeley Streaming Data Workshop

The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.

Hadoop AllReduce and Terascale Learning

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.

For communication infrastructures, the two most prevalent approaches are MPI and MapReduce, both of which have substantial drawbacks for machine learning with lots of data.

  1. MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
  2. MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
    MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.

Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:

  1. MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
  2. MapReduce: Conceptual simplicity. One easy to understand function is enough.
  3. MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
  4. MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
  5. MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
  6. MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
  7. MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
  8. MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

  1. For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
  2. For smaller datasets with computationally intense models, GPU approaches seem very compelling.
  3. For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
  4. The model needs to fit into memory. If that’s not the case, then other approaches are required.

I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.

Update: CACM Crosspost.