John Langford – Page 3 – Machine Learning (Theory)

6/4/2018

When the bubble bursts…

Consider the following facts:

NIPS submission are up 50% this year to ~4800 papers.
There is significant evidence that the process of reviewing papers in machine learning is creaking under several years of exponentiating growth.
Public figures often overclaim the state of AI.
Money rains from the sky on ambitious startups with a good story.
Apparently, we now even have a fake conference website (https://nips.cc/ is the real one for NIPS).

We are clearly not in a steady-state situation. Is this a bubble or a revolution? The answer surely includes a bit of revolution—the fields of vision and speech recognition have been turned over by great empirical successes created by deep neural architectures and more generally machine learning has found plentiful real-world uses.

At the same time, I find it hard to believe that we aren’t living in a bubble. There was an AI bubble in the 1980s (before my time), a techbubble around 2000, and we seem to have a combined AI/tech bubble going on right now. This is great in some ways—many companies are handing out professional sports scale signing bonuses to researchers. It’s a little worrisome in other ways—can the field effectively handle the stress of the influx?

It’s always hard to say when and how a bubble bursts. It might happen today or in several years and it may be a coordinated failure or a series of uncoordinated failures.

As a field, we should consider the coordinated failure case a little bit. What fraction of the field is currently at companies or in units at companies which are very expensive without yet justifying that expense? It’s no longer a small fraction so there is a chance for something traumatic for both the people and field when/where there is a sudden cut-off. My experience is that cuts typically happen quite quickly when they come.

As an individual researcher, consider this an invitation to awareness and a small amount of caution. I’d like everyone to be fully aware that we are in a bit of a bubble right now and consider it in their decisions. Caution should not be overdone—I’d gladly repeat the experience of going to Yahoo! Research even knowing how it ended. There are two natural elements here:

Where do you work as a researcher? The best place to be when a bubble bursts is on the sidelines.
1. Is it in the middle of a costly venture? Companies are not good places for this in the long term whether a startup or a business unit. Being a researcher at a place desperately trying to figure out how to make research valuable doesn’t sound pleasant.
2. Is it in the middle of a clearly valuable venture? That could be a good place. If you are interested we are hiring.
3. Is it in academia? Academia has a real claim to stability over time, but at the same time opportunity may be lost. I’ve greatly enjoyed and benefited from the opportunity to work with highly capable colleagues on the most difficult problems. Assembling the capability to do that in an academic setting seems difficult since the typical maximum scale of research in academia is a professor+students.
What do you work on as a researcher? Some approaches are more “bubbly” than others—they might look good, but do they really provide value?
1. Are you working on intelligence imitation or intelligence creation? Intelligence creation ends up being more valuable in the long term.
2. Are you solving synthetic or real-world problems? If you are solving real-world problems, you are almost certainly creating value. Synthetic problems can lead to real-world solutions, but the path is often fraught with unforeseen difficulties.
3. Are you working on a solution to one problem or many problems? A wide applicability for foundational solutions clearly helps when a bubble bursts.

Researchers have a great ability to survive a bubble bursting—a built up public record of their accomplishments. If you are in a good environment doing valuable things and that environment happens to implode one day the strength of your publications is an immense aid in landing on your feet.

4/16/2018

Reinforcement Learning Platforms

If you are interested in building an industrial Reinforcement Learning platform, we are hiring a data scientist and multiple developers as a followup to last year’s hiring. Please apply if interested as this is a real chance to be a part of building the future 🙂

3/5/20187/29/2018

ICML Board and Reviewer profiles

The outcome of the election for the IMLS (which runs ICML) adds Emma Brunskill and Hugo Larochelle to the board. The current members of the board (and the reason for board membership) are:

Andreas Krause (Elected & 2018 Program Chair)
Andrew McCallum (Past president)
Bernhard Schoelkopf (Elected)
Corinna Cortes (Elected)
David Blei (2020 General chair)
Doina Precup (2017 Program Chair)
Emma Brunskill (Elected)
Eric Xing (Elected & 2019 General Chair)
Francis Bach (Elected & 2018 General Chair)
Hanna Wallach (Elected)
Hugo Larochelle (Elected)
Jennifer Dy (2018 Program Chair & Secretary)
Joelle Pineau (Elected & President)
Kamalika Chaudhuri (2019 Program Chair)
John Langford (President Elect & 2016 General Chair)
Kilian Weinberger (Elected & 2016 Program Chair)
Nina Balcan (Elected & 2021 General Chair & 2016 Program Chair)
Ruslan Salakhutdinov (Elected & 2019 Program Chair)
Thorsten Joachims (Elected)
Tony Jebara (2017 General Chair)
Yee-Whye Teh (2017 Program Chair)

President Elect is a 2-year position with little responsibility, but I decided to look into two things. One is the website which seems relatively difficult to navigate. Ideas for how to improve are welcome.

The other is creating a longitudinal reviewer profile. I keenly remember the day after reviews were due when I was program chair (in 2012) which left a panic-inducing number of unfinished reviews. To help with this, I’m planning to create a profile of reviewers which program chairs can refer to in making decisions about who to ask to review. There are a number of ways to do this wrong which I’m avoiding with the following procedure:

After reviews are assigned, capture the reviewer/paper assignment. Call this set A.
After reviews are due, capture the completed & incomplete reviews for papers. Call these sets B & C respectively.
Strip the paper ids from B (completed reviews) turning it into a multiset D of reviewers completed reviews.
Compute C-A (as a set difference) then turn it into a multiset E of reviewers incomplete reviews.
Store D & E for long term reference.

This approach:

Is objectively defined. Approaches based on subjective measurements seem both fraught with judgment issues and inconsistent. Consider for example the impressive variation we all see in review quality.
Does not record a review as late for reviewers who are assigned a paper late in the process via step (1) and (4). We want to encourage reviewers to take on the unusual but important late tasks that arrive.
Does not record a review as late for reviewers who discover they are inappropriate after assignment and ask for reassignment. We want to encourage reviewers to look at their papers early and, if necessary, ask for a paper to be reassigned early.
Preserves anonymity of paper/reviewer assignments for authors who later become program chairs. The conversion into a multiset removes the paper id entirely.

Overall, my hope is that several years of this will provide a good and useful tool enabling program chairs and good (or at least not-bad) reviewers to recognize each other.

2/14/2018

Pervasive Simulator Misuse with Reinforcement Learning

The surge of interest in reinforcement learning is great fun, but I often see confused choices in applying RL algorithms to solve problems. There are two purposes for which you might use a world simulator in reinforcement learning:

Reinforcement Learning Research: You might be interested in creating reinforcement learning algorithms for the real world and use the simulator as a cheap alternative to actual real-world application.
Problem Solving: You want to find a good policy solving a problem for which you have a good simulator.

In the first instance I have no problem, but in the second instance, I’m seeing many head-scratcher choices.

A reinforcement learning algorithm engaging in policy improvement from a continuous stream of experience needs to solve an opportunity-cost problem. (The RL lingo for opportunity-cost is “advantage”.) Thinking about this in the context of a 2-person game, at a given state, with your existing rollout policy, is taking the first action leading to a win 1/2 the time good or bad? It could be good since the player is well behind and every other action is worse. Or it could be bad since the player is well ahead and every other action is better. Understanding one action’s long term value relative to another’s is the essence of the opportunity cost trade-off at the core of many reinforcement learning algorithms.

If you have a choice between an algorithm that estimates the opportunity cost and one which observes the opportunity cost, which works better? Using observed opportunity-cost is an almost pure winner because it cuts out the effect of estimation error. In the real world you can’t observe the opportunity cost directly Groundhog day style. How many times have you left a conversation and thought to yourself: I wish I had said something else? A simulator is different though—you can reset a simulator. And when you do reset a simulator, you can directly observe the opportunity-cost of an action which can then directly drive learning updates.

If you are coming from viewpoint 1, using a “reset cheat” is unappealing since it doesn’t work in the real world and the goal is making algorithms which work in the real world. On the other hand, if you are operating from viewpoint 2, the “reset cheat” is a gigantic opportunity to dramatically improve learning algorithms. So, why are many people with goal 2 using goal 1 designed algorithms? I don’t know, but here are some hypotheses.

Maybe people just aren’t aware that goal 2 style algorithms exist? They are out there. The most prominent examples of goal 2 style algorithms are from Learning to search and AlphaGo Zero.
Maybe people are worried about the additional sample complexity of doing multiple rollouts from reset points? But these algorithm typically require little additional sample complexity in the worst case and can provide gigantic wins. People commonly use a discount factor d values future rewards t timesteps ahead with a discount of d^t. Alternatively, you can terminate rollouts with probability 1 – d and value future rewards with no discount while preserving the expected value. Using this approach a rollout terminates after an expected 1/(1-d) timesteps bounding the cost of a reset and rollout. Since it is common to use very heavy discounting (e.g. d=0.9), the worst case additional sample complexity is only a small factor larger. On the upside, eliminating estimation error is can radically reduce sample complexity in theory and practice.
Maybe the implementation overhead for a second family of algorithms is to difficult? But the choice of whether or not you use resets is far more important than “oh, we’ll just run things for 10x longer”. It can easily make or break the outcome.

Maybe there is some other reason? As I said above, this is head-scratcher that I find myself trying to address regularly.

12/3/201712/5/2017

Vowpal Wabbit 8.5.0 & NIPS tutorial

Yesterday, I tagged VW version 8.5.0 which has many interactive learning improvements (both contextual bandit and active learning), better support for sparse models, and a new baseline reduction which I’m considering making a part of the default update rule.

If you want to know the details, we’ll be doing a mini-tutorial during the Friday lunch break at the Extreme Classification workshop at NIPS. Please join us if interested.

Edit: also announced at the Learning Systems workshop