Randomized experimentation

One good thing about doing machine learning at present is that people actually use it! The back-ends of many systems we interact with on a daily basis are driven by machine learning. In most such systems, as users interact with the system, it is natural for the system designer to wish to optimize the models under the hood over time, in a way that improves the user experience. To ground the discussion a bit, let us consider the example of an online portal, that is trying to present interesting news stories to its user. A user comes to the portal and based on whatever information the portal has on the user, it recommends one (or more) news stories. The user chooses to read the story or not and life goes on. Naturally the portal wants to better tailor the stories it displays to the users’ taste over time, which can be observed if users start to click on the displayed story more often.

A natural idea would be to use the past logs and train a machine learning model which prefers the stories that users click on and discourages the stories which are avoided by the users. This sounds like a simple classification problem, for which we might use an off-the-shelf algorithm. This is indeed done reasonably often, and the offline logs suggest that the newly trained model will result in a lot more clicks than the old one. The new model is deployed, only to find out its performance is not as good as hoped, or even poorer than what was happening before! What went wrong? The natural reaction is typically that (a) the machine learning algorithm needs to be improved, or (b) we need better features, or (c) we need more data. Alas, in most of these cases, the right answer is (d) none of the above. Let us see why this is true through a simple example.

Imagine a simple world where some of our users are from New York and others are from Seattle. Some of our news stories pertain to finance, and others pertain to technology. Let us further imagine that the probability of a click (henceforth CTR for clickthrough rate) on a news article based on city and subject has the following distribution:


Finance CTR

Tech CTR

New York 1 0.6
Seattle 0.4 0.79

Table1: True (unobserved) CTRs

Of course, we do not have this information ahead of time while designing the system, so our starting system recommends articles according to some heuristic rule. Imagine that we user the rule:

  • New York users get Tech stories, Seattle users get Finance stories.

Now we collect the click data according to this system for a while. As we obtain more and more data, we obtain increasingly accurate estimates of the CTR for Tech stories and NY users, as well as Finance stories and Seattle users (0.6 and 0.4 resp.). However, we have no information on the other two combinations. So if we train a machine learning algorithm to minimize the squared loss between predicted CTR on an article and observed CTR, it is likely to predict the average of observed CTRs (that is 0.5) in the other two blocks. At this point, our guess looks like:



Finance CTR

Tech CTR

New York 1 / ? / 0.5 0.6 / 0.6 / 0.6
Seattle 0.4 / 0.4 / 0.4 0.79 / ? / 0.5

Table2: True / observed / estimated CTRs

Note that this would be the case even with infinite data and an all powerful learner, so machine learning is not to be faulted in any way here. Given these estimates, we naturally realize that show finance articles to Seattle users was a mistake, and switch to Tech. But Tech is also looking pretty good in NY, and we stick with it. Our new policy is:

  • Both NY and Seattle users get Tech articles.

Running the new system for a while, we will fix the erroneous estimates for the Tech CTR on Seattle (that is, up 0.5 to 0.79). But we still have no signal that makes us prefer Finance over Tech in NY. Indeed even with infinite data, the system will be stuck with this suboptimal choice at this point, and our CTR estimates will look something like:


Finance CTR

Tech CTR

New York 1 / ? / 0.59 0.6 / 0.6 / 0.6
Seattle 0.4 / 0.4 / 0.4 0.79 / 0.79 / 0.79

Table3: True / observed / estimated CTRs

We can now assess the earlier claims:

  1. More data does not help: Since Observed and True CTRs match wherever we are collecting data
  2. Better learning algorithm does not help: Since Predicted and Observed CTRs coincide wherever we are collecting data
  3. Better data does help!! We should not be having the blank cell in observed column.

This seems simple enough to fix though. We should have really known better than to completely omit observations in one cell of our table. With good intentions, we decide to collect data in all cells. We choose to use the following rule:

  • Seattle users get Tech stories during day and finance stories during night
  • Similarly, NY users get Tech stories during day and finance stories during night

We are now collecting data on each cell, but we find that our estimates still lead us to a suboptimal policy. Further investigation might reveal that users are more likely to read finance stories during the day when the markets are open. So when we only display finance stories during night, we underestimate the finance CTR and end up with wrong estimates. Realizing the error of our ways, we might try to fix this again and then run into another problem and so on.

The issue we have discovered above is that of confounding variables. There is lot of wonderful work and many techniques that can be used to circumvent confounding variables in experimentation. Here, I mention the simplest one and perhaps the most versatile one of them: Randomization. The idea is that instead of recommending stories to users according to a fix deterministic rule, we allow for different articles to be presented to the user according to some distribution. This distribution does not have to be uniform. In fact, good randomization would likely focus on plausibly good articles so as to not degrade the user experience. However, as long as we add sufficient randomization, we can then obtain consistent counterfactual estimates of quantities from our experimental data. There is growing literature on how to do this well. A nice paper which covers some of these techniques and provides an empirical evaluation is http://arxiv.org/abs/1103.4601. A more involved example in the context of computational advertising at Microsoft is discussed in http://leon.bottou.org/papers/bottou-jmlr-2013.



12 Replies to “Randomized experimentation”

  1. I think you have a typo in this sentence: “The old model is deployed, only to find out its performance is not as good as hoped [..].” Should “old” be “new”?

  2. Nice article.

    Couple of questions:-

    1) In the initial case where we have no data, isn’t it better than to assume that the CTRs for both tech and finance articles would be the same rather than to only show tech articles to NYC users and finance
    articles to Seattle users.
    2) About randomization, if there is an actual economic impact due to these CTRs, can we provide a reason how doing a randomization would lead to better CTRs in the long run over just showing the class with the maximum CTR?

    1. These are both reasonable questions.

      1) Yes, you should ideally start by showing both types of articles in both cities. Yet, only ensuring that is not enough as the last part illustrates. If you decide when to display tech versus finance articles based on some rule rather than randomly, then you can still have confounding variables such as time of day. Having a default starting rule is fine, as long as you allow randomization with some non-trivial probability so that you gather the right data to improve your models in the longer term.

      2) In essence, the benefit of doing randomization can range from nothing to infinite, depending on how good or bad your starting point is. Good randomization strategies try to ensure that you will only explore amongst actions that are likely to be good in a given context, so that the negative impact of randomization can be controlled. There is a funny paradox in your question though, how do you even know what the best action with highest CTR is to begin with? Discovering this requires randomization. Since the real world is non-stationary, what is good today might not be good tomorrow and hence doing the randomization once and then turning it off forever is not enough. There are new webpages, new news articles, new ads, new trends that pop up and that you need to learn about on the go.


  3. Great article. To me this sounds a lot like the exploration-exploitation dilemma which is a notion usually used in reinforcement learning. In these situation we have to decide how much we want to exploit our current knowledge (and only present articles from the category we deem to be the most interesting) and how much we want to explore the whole space of possibilities to find better solutions.

    Do you know if there is any paper on relating the concepts of confounding variables and the exploration-exploitation dilemma?

    1. This is precisely a manifestation of the explore-exploit dilemma. The first approach which is inconsistent is essentially exploit-only and fails due to lack of exploration. The version of the problem described here is the contextual bandit problem (also called associative reinforcement learning), which is a special case of the general RL problem. If we take away the context (Seattle/NY in this case), then we are left with the standard multi-armed bandit problem.

      I think Leon’s paper mentioned in the post is a good one which presents both the classical causality angle and the more bandit-style angle. I suspect there are earlier papers too which draw on the connection, but I don’t know them off the top of my head. In general, if your randomization depends on some variables, then you risk having them as confounders is a good rule of thumb. This is why uniform randommization is the simplest (and statistically most inefficient) way of eliminating confounders.


      1. Hi Alekh/anyone,
        Can you explain what exactly is the confounding variable in this example? Is it the flawed assumptions/prior with which articles were served?

        1. The confounding variable initially would be location, in the later example it would be the time-of-day.


  4. Nice post. How to introduce exploration to online ads system? Any randomization added to good (improved over years) ad placement algorithm will reduce revenue.
    I understand that randomization is essential, but how to convince managers, since randomization losses money in short-term ? 😉

    1. It is tricky indeed, since the benefit of randomization is most easily realized in the long-term. However, there are some simple things you can do. For instance, if you are able to do even some random flights, and show that the offline data with randomization can be used to accurately predict online performance whereas the same is not possible using offline data without randomization–that is a powerful argument. The real question to ask is how do you even improve reliably over the years if you have no means of predicting what will work better and what will not.


Comments are closed.