Vowpal Wabbit 8.5.0 & NIPS tutorial

Yesterday, I tagged VW version 8.5.0 which has many interactive learning improvements (both contextual bandit and active learning), better support for sparse models, and a new baseline reduction which I’m considering making a part of the default update rule.

If you want to know the details, we’ll be doing a mini-tutorial during the Friday lunch break at the Extreme Classification workshop at NIPS. Please join us if interested.

Edit: also announced at the Learning Systems workshop

The Multiworld Testing Decision Service

We made a tool that you can use. It is the first general purpose reinforcement-based learning system 🙂

Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:

  1. Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
  2. Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
  3. Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
  4. Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.

The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.

This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.

Have fun.

Randomized experimentation

One good thing about doing machine learning at present is that people actually use it! The back-ends of many systems we interact with on a daily basis are driven by machine learning. In most such systems, as users interact with the system, it is natural for the system designer to wish to optimize the models under the hood over time, in a way that improves the user experience. To ground the discussion a bit, let us consider the example of an online portal, that is trying to present interesting news stories to its user. A user comes to the portal and based on whatever information the portal has on the user, it recommends one (or more) news stories. The user chooses to read the story or not and life goes on. Naturally the portal wants to better tailor the stories it displays to the users’ taste over time, which can be observed if users start to click on the displayed story more often.

A natural idea would be to use the past logs and train a machine learning model which prefers the stories that users click on and discourages the stories which are avoided by the users. This sounds like a simple classification problem, for which we might use an off-the-shelf algorithm. This is indeed done reasonably often, and the offline logs suggest that the newly trained model will result in a lot more clicks than the old one. The new model is deployed, only to find out its performance is not as good as hoped, or even poorer than what was happening before! What went wrong? The natural reaction is typically that (a) the machine learning algorithm needs to be improved, or (b) we need better features, or (c) we need more data. Alas, in most of these cases, the right answer is (d) none of the above. Let us see why this is true through a simple example.

Imagine a simple world where some of our users are from New York and others are from Seattle. Some of our news stories pertain to finance, and others pertain to technology. Let us further imagine that the probability of a click (henceforth CTR for clickthrough rate) on a news article based on city and subject has the following distribution:


Finance CTR

Tech CTR

New York 1 0.6
Seattle 0.4 0.79

Table1: True (unobserved) CTRs

Of course, we do not have this information ahead of time while designing the system, so our starting system recommends articles according to some heuristic rule. Imagine that we user the rule:

  • New York users get Tech stories, Seattle users get Finance stories.

Now we collect the click data according to this system for a while. As we obtain more and more data, we obtain increasingly accurate estimates of the CTR for Tech stories and NY users, as well as Finance stories and Seattle users (0.6 and 0.4 resp.). However, we have no information on the other two combinations. So if we train a machine learning algorithm to minimize the squared loss between predicted CTR on an article and observed CTR, it is likely to predict the average of observed CTRs (that is 0.5) in the other two blocks. At this point, our guess looks like:



Finance CTR

Tech CTR

New York 1 / ? / 0.5 0.6 / 0.6 / 0.6
Seattle 0.4 / 0.4 / 0.4 0.79 / ? / 0.5

Table2: True / observed / estimated CTRs

Note that this would be the case even with infinite data and an all powerful learner, so machine learning is not to be faulted in any way here. Given these estimates, we naturally realize that show finance articles to Seattle users was a mistake, and switch to Tech. But Tech is also looking pretty good in NY, and we stick with it. Our new policy is:

  • Both NY and Seattle users get Tech articles.

Running the new system for a while, we will fix the erroneous estimates for the Tech CTR on Seattle (that is, up 0.5 to 0.79). But we still have no signal that makes us prefer Finance over Tech in NY. Indeed even with infinite data, the system will be stuck with this suboptimal choice at this point, and our CTR estimates will look something like:


Finance CTR

Tech CTR

New York 1 / ? / 0.59 0.6 / 0.6 / 0.6
Seattle 0.4 / 0.4 / 0.4 0.79 / 0.79 / 0.79

Table3: True / observed / estimated CTRs

We can now assess the earlier claims:

  1. More data does not help: Since Observed and True CTRs match wherever we are collecting data
  2. Better learning algorithm does not help: Since Predicted and Observed CTRs coincide wherever we are collecting data
  3. Better data does help!! We should not be having the blank cell in observed column.

This seems simple enough to fix though. We should have really known better than to completely omit observations in one cell of our table. With good intentions, we decide to collect data in all cells. We choose to use the following rule:

  • Seattle users get Tech stories during day and finance stories during night
  • Similarly, NY users get Tech stories during day and finance stories during night

We are now collecting data on each cell, but we find that our estimates still lead us to a suboptimal policy. Further investigation might reveal that users are more likely to read finance stories during the day when the markets are open. So when we only display finance stories during night, we underestimate the finance CTR and end up with wrong estimates. Realizing the error of our ways, we might try to fix this again and then run into another problem and so on.

The issue we have discovered above is that of confounding variables. There is lot of wonderful work and many techniques that can be used to circumvent confounding variables in experimentation. Here, I mention the simplest one and perhaps the most versatile one of them: Randomization. The idea is that instead of recommending stories to users according to a fix deterministic rule, we allow for different articles to be presented to the user according to some distribution. This distribution does not have to be uniform. In fact, good randomization would likely focus on plausibly good articles so as to not degrade the user experience. However, as long as we add sufficient randomization, we can then obtain consistent counterfactual estimates of quantities from our experimental data. There is growing literature on how to do this well. A nice paper which covers some of these techniques and provides an empirical evaluation is http://arxiv.org/abs/1103.4601. A more involved example in the context of computational advertising at Microsoft is discussed in http://leon.bottou.org/papers/bottou-jmlr-2013.



User preferences for search engines

I want to comment on the “Bing copies Google” discussion here, here, and here, because there are data-related issues which the general public may not understand, and some of the framing seems substantially misleading to me.

As a not-distant-outsider, let me mention the sources of bias I may have. I work at Yahoo!, which has started using Bing. This might predispose me towards Bing, but on the other hand I’m still at Yahoo!, and have been using Linux exclusively as an OS for many years, including even a couple minor kernel patches. And, on the gripping hand, I’ve spent quite a bit of time thinking about the basic principles of incorporating user feedback in machine learning. Also note, this post is not related to official Yahoo! policy, it’s just my personal view.

The issue Google engineers inserted synthetic responses to synthetic queries on google.com, then executed the synthetic searches on google.com using Internet Explorer with the Bing toolbar and later noticed some synthetic responses from Bing with the synthetic queries.

There are two kinds of disagreement which people might have with this.

One is the privacy disagreementBig Brother Microsoft is looking at what I search and using it”. I’m sympathetic on this count, but also sympathetic to the counter argument, that the data collected has value and can enhance the results for all users. In the end, I think companies should simply do their best to accept a user’s wishes, so those who want privacy can have it, and those who want to contribute their data towards improving a search engine can do so. The precise manner for achieving this by opt-in, opt-out, differential privacy, anonymization or other techniques is not entirely clear to me.

Let’s assume the privacy issue is dealt with. This is at least partly and possibly grossly untrue, but I want to focus on the other issue, and this assumption simplifies it’s discussion because a user and their internet browser are synonymous when the privacy issue is dealt with, as the agent’s actions are a true reflection of the user’s preferences.

The other issue is an originality disagreement, which much of the discussion focuses on. What I believe happened was a user feedback process, where users queried Google, clicked on a result, informed Microsoft/Bing of the query and clicked result, and their preference was used to promote the search result within Bing. Now, there is a slippery-slope of questions. Should a user be allowed to:

  1. Reveal to their chosen search engine their preferred result?
  2. Reveal to a competitor’s search engine their preferred result?

If you answer ‘no’ to the first, you are deeply against user freedom in a manner I can’t sympathize with. If you answer ‘yes’ to the first, and ‘no’ to the second, then you are still somewhat against user freedom. This isn’t too crazy a stance, as various people sell information and require of their users that it not be retransmitted. One of the more famous examples of this is the Bloomberg Terminal. However, in all instances I’m aware of, users knowingly agree to a contract providing access to the information with limitations. Google never entered into such a contract with it’s users, and I don’t know a sound basis for even an implicit contract. So, my answer are “yes, and yes” here.

But this doesn’t entirely deal with the issue of originality. You could argue that it’s ok for Microsoft to take advantage of revealed user interaction, but it’s still a matter of following rather than leading. This argument is simplistic and wrong, as I expect all informed parties already understand. A basic truth seen in many ways, is that the proper incorporation of new sources of information always improves results. This is true in machine learning where sample complexity results and cotraining formalize mechanisms and values of incorporating additional information, and it was heavily used by all competitive teams in the Netflix Competition. More generally, it’s true in basic knowledge engineering, where people fuse sources of information to create a better system, and I’m virtually certain it’s true of the ranking algorithms behind Google and Bing, which are surely complex beasts taking into account many sources of information. I know no details about the algorithm which Microsoft is using, but it’s quite plausible that they incorporated this information well enough to improve the quality of their results, perhaps in some instances so they are better than Google’s or the earlier version of Bing’s. If that’s the case, Google will either follow Microsoft’s lead taking into account user feedback as Microsoft does, or risk becoming obsolete.

We can also think about things in terms of the future. A basic truth, is that building a successful search engine is extraordinarily difficult. This is revealed by search market share, but also by simply thinking about the logistics involved. You need to crawl the web, have server farms all over the world (because the speed of light just isn’t fast enough), and incorporate many sources of information in just the right way in order to succeed, all while adversaries try to corrupt your results. If we prefer a future where there is a healthy competition amongst search engines, then it’s important to lower these barriers to entry so new people with new ideas can more easily test them out. One way to lower the barrier to entry is to accept that users can share their interaction, even with a competitor’s search engine.

Perhaps it’s inevitable that Amit Singhal has a viewpoint driving towards a monopoly on internet search. However, Google has generally been relatively good about supporting a rich ecosystem of innovation for information technology development, so I am still somewhat surprised. I would be more sympathetic to a position for allowing users of Internet Explorer a built-in means to choose to share their search behavior with Google or other search engines on an equal footing.

The Good News on Exploration and Learning

Consider the contextual bandit setting where, repeatedly:

  1. A context x is observed.
  2. An action a is taken given the context x.
  3. A reward r is observed, dependent on x and a.

Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward.

This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions.

A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and butter supervised learning problems. Just as for supervised learning, we know how to create and reuse datasets, how to benchmark algorithms, how to reuse existing supervised learning algorithms in this setting, and how to achieve optimal rates of learning quantitatively similar to supervised learning.

This is also one of the times when understanding the basic theory can make a huge difference in your success. There are many wrong ways to attack contextual bandit problems or prepare datasets, and taking a wrong turn can easily mean the difference between failure and success. Understanding how contextual bandit problems differ from basic supervised learning problems is critical to routine success here.

All of the above is not meant to claim that everything is done research-wise here so we’ll try to outline where the current boundary of research lies as best we can. However, we are surely at a point both in terms of application demand (especially for internet applications of search, advertising, page optimization, but also medical applications and surely others) and methodology supply (with basic reliable techniques now easily available or created) where these techniques are shifting from theory esoterica to required education.

Given the above, Alina and I decided to prepare a tutorial to be given at Yahoo! Labs summer school (my first India trip!), ICML, KDD, and hopefully videolectures.net. Please join us. The subjects we plan to cover are essentially the keys to the kingdom of solving shallow interactive learning problems.