Machine Learning (Theory)


New York Machine Learning Deadlines

There’s a number of different Machine Learning related paper deadlines that may interest.

January 29 (abstract) for March 4 New York ML Symposium Register early because NYAS can only fit 300.
January 27 (abstract)/February 2 (paper) for July 9-15 IJCAI The biggest AI conference
February 5(paper) for June 19-24 ICML Nina and Kilian have 850 well-vetted reviewers. Marek and Peder have increased space to allow 3K people.
February 12(paper) for June 23-26 COLT Vitaly and Sasha are program chairs.
February 12(proposal) for June 23-24 ICML workshops Fei and Ruslan are the workshop chairs. I really like workshops.
February 19(proposal) for June 19 ICML tutorials Bernhard and Alina have invited a few tutorials already but are saving space for good proposals as well.
March 1(paper) for June 25-29 UAI Jersey City isn’t quite New York, but it’s close enough :-)
May ~2 for June 23-24 ICML workshops Varies with the workshop.


Interesting things at NIPS 2015

Tags: Conferences,Machine Learning jl@ 1:43 pm

NIPS is getting big. If you think of each day as a conference crammed into a day, you get a good flavor of things. Here are some of the interesting things I saw.

Two other notable events happened during NIPS.

  1. The Imagenet challenge and MS COCO results came out. The first represents a significant improvement over previous years (details here).
  2. The Open AI initiative started. Concerned billionaires create a billion dollar endowment to advance AI in a public(NOT Private) way. What will be done better than NSF (which has a similar(ish) goal)? I can think of many possibilities.

See also Seb’s post.


CNTK and Vowpal Wabbit tutorials at NIPS

Both CNTK and Vowpal Wabbit have pirate tutorials at NIPS. The CNTK tutorial is 1 hour during the lunch break of the Optimization workshop while the VW tutorial is 1 hour during the lunch break of the Extreme Multiclass workshop. Consider dropping by either if interested.

CNTK is a deep learning system started by the speech people who started the deep learning craze and grown into a more general platform-independent deep learning system. It has various useful features, the most interesting of which is perhaps efficient scalable training. Using GPUs with allreduce and one-bit sgd it achieves both high efficiency and scalability over many more GPUs than could ever fit into a single machine. This capability is unique amongst all open deep learning codebases so everything else looks nerfed in comparison. CNTK was released in April so this is the first chance for many people to learn about it. See here for more details.

The Vowpal Wabbit tutorial just focuses on what is new this year.

  1. The learning to search framework has greatly matured and is now easily used to solve ad-hoc joint(structured) prediction problems. The ICML tutorial covers algorithms/analysis so this is about using the system.
  2. VW has also become the modeling element of a larger system (called the decision service) which gathers data and uses it as per Contextual Bandit learning. This is now generally usable, and is the first general purpose system of this sort.


ICML 2016 in NYC and KDD Cup 2016

ICML 2016 is in New York City. I expect it to be the largest ICML by far given the destination—New York is the place which is perhaps easiest to reach from anywhere in the world and it has the largest machine learning meetup anywhere in the world.

I am the general chair this year, which is light in work but heavy in responsibilities. Some things I worry about:

  1. How many people will actually come? Numbers are difficult to guess with the field growing and the conference changing locations. I believe we need capacity for at least 3000 people based on everything I know.
  2. New York is expensive. What can be done about it? One thought is that we should actively setup a roommate finding system so the costs of hotels can be shared. Up to 3 people can share a hotel room for the conference hotel (yes, each with their own bed), and that makes the price much more reasonable. I’m also hoping donations will substantially defray the cost. If others have creative ideas, I’m definitely interested.

Markus Weimer also points out the 2016 KDD Cup which has a submission deadline of December 6. KDD Cup datasets have become common reference for many machine learning papers, so this is a good way to get your problem solved well by many people.


Randomized experimentation

One good thing about doing machine learning at present is that people actually use it! The back-ends of many systems we interact with on a daily basis are driven by machine learning. In most such systems, as users interact with the system, it is natural for the system designer to wish to optimize the models under the hood over time, in a way that improves the user experience. To ground the discussion a bit, let us consider the example of an online portal, that is trying to present interesting news stories to its user. A user comes to the portal and based on whatever information the portal has on the user, it recommends one (or more) news stories. The user chooses to read the story or not and life goes on. Naturally the portal wants to better tailor the stories it displays to the users’ taste over time, which can be observed if users start to click on the displayed story more often.

A natural idea would be to use the past logs and train a machine learning model which prefers the stories that users click on and discourages the stories which are avoided by the users. This sounds like a simple classification problem, for which we might use an off-the-shelf algorithm. This is indeed done reasonably often, and the offline logs suggest that the newly trained model will result in a lot more clicks than the old one. The new model is deployed, only to find out its performance is not as good as hoped, or even poorer than what was happening before! What went wrong? The natural reaction is typically that (a) the machine learning algorithm needs to be improved, or (b) we need better features, or (c) we need more data. Alas, in most of these cases, the right answer is (d) none of the above. Let us see why this is true through a simple example.

Imagine a simple world where some of our users are from New York and others are from Seattle. Some of our news stories pertain to finance, and others pertain to technology. Let us further imagine that the probability of a click (henceforth CTR for clickthrough rate) on a news article based on city and subject has the following distribution:


Finance CTR

Tech CTR

New York 1 0.6
Seattle 0.4 0.79

Table1: True (unobserved) CTRs

Of course, we do not have this information ahead of time while designing the system, so our starting system recommends articles according to some heuristic rule. Imagine that we user the rule:

  • New York users get Tech stories, Seattle users get Finance stories.

Now we collect the click data according to this system for a while. As we obtain more and more data, we obtain increasingly accurate estimates of the CTR for Tech stories and NY users, as well as Finance stories and Seattle users (0.6 and 0.4 resp.). However, we have no information on the other two combinations. So if we train a machine learning algorithm to minimize the squared loss between predicted CTR on an article and observed CTR, it is likely to predict the average of observed CTRs (that is 0.5) in the other two blocks. At this point, our guess looks like:



Finance CTR

Tech CTR

New York 1 / ? / 0.5 0.6 / 0.6 / 0.6
Seattle 0.4 / 0.4 / 0.4 0.79 / ? / 0.5

Table2: True / observed / estimated CTRs

Note that this would be the case even with infinite data and an all powerful learner, so machine learning is not to be faulted in any way here. Given these estimates, we naturally realize that show finance articles to Seattle users was a mistake, and switch to Tech. But Tech is also looking pretty good in NY, and we stick with it. Our new policy is:

  • Both NY and Seattle users get Tech articles.

Running the new system for a while, we will fix the erroneous estimates for the Tech CTR on Seattle (that is, up 0.5 to 0.79). But we still have no signal that makes us prefer Finance over Tech in NY. Indeed even with infinite data, the system will be stuck with this suboptimal choice at this point, and our CTR estimates will look something like:


Finance CTR

Tech CTR

New York 1 / ? / 0.59 0.6 / 0.6 / 0.6
Seattle 0.4 / 0.4 / 0.4 0.79 / 0.79 / 0.79

Table3: True / observed / estimated CTRs

We can now assess the earlier claims:

  1. More data does not help: Since Observed and True CTRs match wherever we are collecting data
  2. Better learning algorithm does not help: Since Predicted and Observed CTRs coincide wherever we are collecting data
  3. Better data does help!! We should not be having the blank cell in observed column.

This seems simple enough to fix though. We should have really known better than to completely omit observations in one cell of our table. With good intentions, we decide to collect data in all cells. We choose to use the following rule:

  • Seattle users get Tech stories during day and finance stories during night
  • Similarly, NY users get Tech stories during day and finance stories during night

We are now collecting data on each cell, but we find that our estimates still lead us to a suboptimal policy. Further investigation might reveal that users are more likely to read finance stories during the day when the markets are open. So when we only display finance stories during night, we underestimate the finance CTR and end up with wrong estimates. Realizing the error of our ways, we might try to fix this again and then run into another problem and so on.

The issue we have discovered above is that of confounding variables. There is lot of wonderful work and many techniques that can be used to circumvent confounding variables in experimentation. Here, I mention the simplest one and perhaps the most versatile one of them: Randomization. The idea is that instead of recommending stories to users according to a fix deterministic rule, we allow for different articles to be presented to the user according to some distribution. This distribution does not have to be uniform. In fact, good randomization would likely focus on plausibly good articles so as to not degrade the user experience. However, as long as we add sufficient randomization, we can then obtain consistent counterfactual estimates of quantities from our experimental data. There is growing literature on how to do this well. A nice paper which covers some of these techniques and provides an empirical evaluation is A more involved example in the context of computational advertising at Microsoft is discussed in



« Newer PostsOlder Posts »

Powered by WordPress