One good thing about doing machine learning at present is that people actually use it! The back-ends of many systems we interact with on a daily basis are driven by machine learning. In most such systems, as users interact with the system, it is natural for the system designer to wish to optimize the models under the hood over time, in a way that improves the user experience. To ground the discussion a bit, let us consider the example of an online portal, that is trying to present interesting news stories to its user. A user comes to the portal and based on whatever information the portal has on the user, it recommends one (or more) news stories. The user chooses to read the story or not and life goes on. Naturally the portal wants to better tailor the stories it displays to the users’ taste over time, which can be observed if users start to click on the displayed story more often.
A natural idea would be to use the past logs and train a machine learning model which prefers the stories that users click on and discourages the stories which are avoided by the users. This sounds like a simple classification problem, for which we might use an off-the-shelf algorithm. This is indeed done reasonably often, and the offline logs suggest that the newly trained model will result in a lot more clicks than the old one. The new model is deployed, only to find out its performance is not as good as hoped, or even poorer than what was happening before! What went wrong? The natural reaction is typically that (a) the machine learning algorithm needs to be improved, or (b) we need better features, or (c) we need more data. Alas, in most of these cases, the right answer is (d) none of the above. Let us see why this is true through a simple example.
Imagine a simple world where some of our users are from New York and others are from Seattle. Some of our news stories pertain to finance, and others pertain to technology. Let us further imagine that the probability of a click (henceforth CTR for clickthrough rate) on a news article based on city and subject has the following distribution:
City
|
Finance CTR
|
Tech CTR
|
New York |
1 |
0.6 |
Seattle |
0.4 |
0.79 |
Table1: True (unobserved) CTRs
Of course, we do not have this information ahead of time while designing the system, so our starting system recommends articles according to some heuristic rule. Imagine that we user the rule:
- New York users get Tech stories, Seattle users get Finance stories.
Now we collect the click data according to this system for a while. As we obtain more and more data, we obtain increasingly accurate estimates of the CTR for Tech stories and NY users, as well as Finance stories and Seattle users (0.6 and 0.4 resp.). However, we have no information on the other two combinations. So if we train a machine learning algorithm to minimize the squared loss between predicted CTR on an article and observed CTR, it is likely to predict the average of observed CTRs (that is 0.5) in the other two blocks. At this point, our guess looks like:
City
|
Finance CTR
|
Tech CTR
|
New York |
1 / ? / 0.5 |
0.6 / 0.6 / 0.6 |
Seattle |
0.4 / 0.4 / 0.4 |
0.79 / ? / 0.5 |
Table2: True / observed / estimated CTRs
Note that this would be the case even with infinite data and an all powerful learner, so machine learning is not to be faulted in any way here. Given these estimates, we naturally realize that show finance articles to Seattle users was a mistake, and switch to Tech. But Tech is also looking pretty good in NY, and we stick with it. Our new policy is:
- Both NY and Seattle users get Tech articles.
Running the new system for a while, we will fix the erroneous estimates for the Tech CTR on Seattle (that is, up 0.5 to 0.79). But we still have no signal that makes us prefer Finance over Tech in NY. Indeed even with infinite data, the system will be stuck with this suboptimal choice at this point, and our CTR estimates will look something like:
City
|
Finance CTR
|
Tech CTR
|
New York |
1 / ? / 0.59 |
0.6 / 0.6 / 0.6 |
Seattle |
0.4 / 0.4 / 0.4 |
0.79 / 0.79 / 0.79 |
Table3: True / observed / estimated CTRs
We can now assess the earlier claims:
- More data does not help: Since Observed and True CTRs match wherever we are collecting data
- Better learning algorithm does not help: Since Predicted and Observed CTRs coincide wherever we are collecting data
- Better data does help!! We should not be having the blank cell in observed column.
This seems simple enough to fix though. We should have really known better than to completely omit observations in one cell of our table. With good intentions, we decide to collect data in all cells. We choose to use the following rule:
- Seattle users get Tech stories during day and finance stories during night
- Similarly, NY users get Tech stories during day and finance stories during night
We are now collecting data on each cell, but we find that our estimates still lead us to a suboptimal policy. Further investigation might reveal that users are more likely to read finance stories during the day when the markets are open. So when we only display finance stories during night, we underestimate the finance CTR and end up with wrong estimates. Realizing the error of our ways, we might try to fix this again and then run into another problem and so on.
The issue we have discovered above is that of confounding variables. There is lot of wonderful work and many techniques that can be used to circumvent confounding variables in experimentation. Here, I mention the simplest one and perhaps the most versatile one of them: Randomization. The idea is that instead of recommending stories to users according to a fix deterministic rule, we allow for different articles to be presented to the user according to some distribution. This distribution does not have to be uniform. In fact, good randomization would likely focus on plausibly good articles so as to not degrade the user experience. However, as long as we add sufficient randomization, we can then obtain consistent counterfactual estimates of quantities from our experimental data. There is growing literature on how to do this well. A nice paper which covers some of these techniques and provides an empirical evaluation is http://arxiv.org/abs/1103.4601. A more involved example in the context of computational advertising at Microsoft is discussed in http://leon.bottou.org/papers/bottou-jmlr-2013.