8.3 should be backwards compatible with all 8.x series. There have been big changes since the last version related to

- Contextual bandits, particularly w.r.t. the decision service.
- Learning to search for which we have a paper at NIPS.
- Logarithmic time multiclass classification.

I also wanted to share some statistics from registration that might be of general interest.

The total number of people attending: 3103.

Industry: 47% University: 46%

Male: 83% Female: 14%

Local (NY, NJ, or CT): 27%

North America: 70% Europe: 18% Asia: 9% Middle East: 2% Remainder: <1% including 2 from Antarctica

]]>David Silver gave one of the best tutorials I’ve seen on his group’s recent work in “deep” reinforcement learning. I learned about a few new techniques, including the benefits of asychrononous updates in distributed Q-learning https://arxiv.org/abs/1602.01783, which was presented in more detail at the main conference. The new domains being explored were exciting, as were the improvements made on the computational side. I would love to seen more pointers to some of the related work from the tutorial, particularly given there was such an exciting mix of new techniques and old staples (e.g. experience replay http://www.dtic.mil/dtic/tr/fulltext/u2/a261434.pdf ), but the talk was so information packed it would have been difficult.

Pieter Abbeel gave an outstanding talk in the Abstraction in RL workshop http://rlabstraction2016.wix.com/icml#!schedule/bx34m, and (I heard) another excellent one during the deep learning workshop.

It was rumored that Aviv Tamar gave an exciting talk (I believe on this http://arxiv.org/abs/1602.02867) , but I was forced to miss it to see Rong Ge’s https://users.cs.duke.edu/~rongge/ outstanding talk on a new-ish geometric tool for understanding non-convex optimization, the *strict saddle.* I first read about the approach here http://arxiv.org/abs/1503.02101, but at ICML he and other authors have demonstrated a remarkable number of problems that have this property that enables efficient optimization via an stochastic gradient descent (and other) procedures.

This was a theme of ICML— an incredible amount of good material, so much that I barely saw the posters at all because there was nearly always a talk I wanted to see!

Rocky Duan surveyed some benchmark RL continuous control problems http://jmlr.org/proceedings/papers/v48/duan16.pdf An interesting theme of the conference— and came up in conversation with John Schulman and Yann LeCun– was really old methods working well. In fact, this group demonstrated that variants of the natural/covariant policy gradient proposed originally by Sham Kakade (with a derivation here: http://repository.cmu.edu/cgi/viewcontent.cgi?article=1080&context=robotics) are largely at the state-of-the-art on many benchmark problems. There are some clever tricks necessary for large policy classes like neural networks (like using a partial-least squares-style truncated conjugate gradient to solve for the change in policy in the usual F \delta = \nabla one solves in the natural gradient procedure) that dramatically improve performance (https://arxiv.org/abs/1502.05477). I had begun to view these methods as doing little better (or worse) then black-box search, so it’s exciting to see them make a comeback.

Chelsea Finn http://people.eecs.berkeley.edu/~cbfinn/ gave an outstanding talk on this work https://arxiv.org/abs/1603.00448. She and co-authors (Sergey Levine and Pieter) effectively came up with a technique that lets one apply Maximum Entropy Inverse Optimal Control without the double-loop procedure and using policy gradient techniques. Jonathan Ho described a related algorithm http://jmlr.org/proceedings/papers/v48/ho16.pdf that also appeared to mix policy gradient and an optimization over cost functions. Both are definitely on my reading list, and I want to understand the trade-offs of the techniques.

Both presentations were informative, and both made the interesting connection to Generative Adversarial Nets (GANS) http://arxiv.org/abs/1406.2661 . These were also a theme of the conference in both talks and during discussions. A very cool idea getting more traction, and being embraced by the neural net pioneers.

David Belanger https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf gave a interesting talk on using backprop to optimize a structured output relative to a a learned cost function. I left thinking the technique was closely related to inverse optimal control methods and the GANs, and wanting understand how implicit differentiation wasn’t being used to optimize the energy function parameters.

Speaking of neural net pioneers— there was lots of good talks during both the main conference and workshops on what’s new — and what’s old https://sites.google.com/site/nnb2tf/— in neural network architectures and algorithms.

I was intrigued by http://jmlr.org/proceedings/papers/v48/balduzzi16.pdf and particularly by the well written blog post it mentions http://colah.github.io/posts/2015-09-NN-Types-FP/ by Christopher Olah. The notion that we need language tools to structure the design of learning programs (e.g. http://www.umiacs.umd.edu/~hal/docs/daume14lts.pdf) and have tools to reason about them seems to be gaining currency. After reading these, I began to view some of the recent work of Wen, Arun, Byron, and myself (including at http://jmlr.org/proceedings/papers/v48/sun16.pdf ICML) in this light— generative RNNs “should” have a well defined hidden state whose “type” is effectively (moments of) future observations. I wonder now if there is a larger lesson here in the design of learning programs.

Nando de Freitas and colleagues approach of separating value and advantage function predictions in one network http://jmlr.org/proceedings/papers/v48/wangf16.pdf was quite interesting and had a lot of buzz.

Ian Osband gave an amazing talk on another topic that previously made me despair: exploration in RL http://jmlr.org/proceedings/papers/v48/osband16.pdf. This is one of few approaches that combines the ability to function approximation with rigorous exploration guarantees/sample complexity in the tabular case (and amazingly *better* sample complexity then previous papers that work only in the tabular case). Super cool and also very high on my reading list.

Boaz Barak http://www.boazbarak.org/ gave a truly inspired talk that mixed a kind of coherent computationally-bounded Bayesian-ism (Slogan: ”Compute like a frequentist, think like a Bayesian.”) with demonstrating a lower bound for SoS procedures. Well outside of my expertise, but delivered in a way that made you feel like you understood all of it.

Honglak Lee gave an exciting talk on the benefits of semi-supervision in CNNs http://web.eecs.umich.edu/~honglak/icml2016-CNNdec.pdf. The authors demonstrated that a remarkable amount of information needed to reproduce an input image was preserved quite deep in CNNs, and further that encouraging the ability to reconstruct could significantly enhance discriminative performance on real benchmarks.

The problem with this ICML is that I think it would take literally weeks of reading/watching talks to really absorb the high quality work that was presented. I’m *very* grateful to the organizing committee http://icml.cc/2016/?page_id=39 for making it so valuable.

Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is **Multiworld Testing**: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:

- Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what
*was*shown (and hence clicked on) rather that what*should*be shown to generate the most interest. - Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
- Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
- Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.

The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.

This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.

Have fun.

]]>At ICML last year and the year before the amount of capacity that needed to fit everyone on any single day was about 1500. My advice was to expect 2000 and have capacity for 2500 because “New York” and “Machine Learning”. Was history right? Or New York and buzz?

I was not involved in the venue negotiations, but my understanding is that they were difficult, with liabilities over $1M for IMLS the nonprofit which oversees ICML year to year. The result was a conference plan with a maximum capacity of 1800 for the main conference, a bit less for workshops, and perhaps 1000 for tutorials.

Then the NIPS registration numbers came in: 3900 last winter. It’s important to understand here that a registration is not a person since not everyone registers for the entire event. Nevertheless, NIPS was very large with perhaps 3K people attending at any one time. Historically, NIPS is the conference most similar to ICML with a history of NIPS being a bit larger. Most people I know treat these conferences as indistinguishable other than timing: ICML in the summer and NIPS in the winter.

Given this, I had to revise my estimate up: We should really have capacity for 3000, not 2500. It also convinced everyone that we needed to negotiate for more space with the Marriott. This again took quite awhile with the result being a modest increase in capacity for the conference (to 2100) and the workshops, but nothing for the tutorials.

The situation with tutorials looked terrible while the situation with workshops looked poor. Acquiring more space at the Marriott looked near impossible. Tutorials require a large room, so we looked into the Kimmel Center at NYU acquiring a large room and increasing capacity to 1450 for the tutorials. We also looked into additional rooms for workshops finding one at Columbia and another at the Microsoft Technology Center which has a large public use room 2 blocks from the Marriott. Other leads did not pan out.

This allowed us to cover capacity through early registration (May 7th). Based on typical early vs. late registration distributions I was expecting registrations might need to close a bit early similar to what happened with KDD in 2014.

Then things blew up. Tutorial registration reached capacity the week of May 23rd, and then all registration stopped May 28th, 3 weeks before the conference. Aside from simply failing to meet demand this also creates lots of problems. What do you do with authors? And when I looked into things in detail for workshops I realized we were badly oversubscribed for some workshops. It’s always difficult to guess which distribution of room sizes is needed to support the spectrum of workshop interests in advance so there were serious problems. What could we do?

The first step was tutorial and main conference registration which reopened last Tuesday using some format changes which allowed us to increase capacity further. We will use simulcast to extra rooms to support larger audiences for tutorials and plenary talks allowing us to up the limit for tutorials to 1590 and for the main conference to 2400. We’ve also shifted the poster session to run in parallel with main tracks rather than in the evening. Now, every paper will have 3-4 designated hours during the day (ending at 7pm) for authors to talk to people individually. As a side benefit, this will also avoid the competition between posters and company-sponsored parties which have become common. We’ll see how this works as a format, but it was unavoidable here: even without increasing registration the existing evening poster session plan was a space disaster.

The workshop situation was much more difficult. I walked all over the nearby area on Wednesday, finding various spaces and getting quotes. I also realized that the largest room at the Crown Plaza could help with our tutorials: it was both bigger and much closer than NYU. On Thursday, we got contract offers from the promising venues and debated into the evening. On Friday morning at 6am the Marriott suddenly gave us a bunch of additional space for the workshops. Looking through things, it was enough to shift us from ‘oversubscribed’ to ‘crowded’ with little capacity to register more given natural interests. We developed a new plan on the fly, changed contracts, negotiated prices down, and signed Friday afternoon.

The local chairs (Marek Petrik and Peder Olsen) and Mary Ellen were working hard with me through this process. Disruptive venue changes 3 weeks before the conference are obviously not the recommended way of doing things:-) And yet it seems to be working out now, much better than I expected last weekend. Here’s the situation:

- Tutorials ~1600 registered with capacity for 1850. I expect this to run out of capacity, but it will take a little while. I don’t see a good way to increase capacity further.
- The main conference has ~2200 registered with capacity for 2400. Maybe this can be increased a little bit, but it is quite possible the main conference will run out of capacity as well. If it does, only authors will be allowed to register.
- Workshops ~1900 registered with capacity for 3000. Only the Deep Learning workshop requires a simulcast. It seems very unlikely that we’ll run out of capacity so this should be the least crowded part of the conference. We even have some left-over little rooms (capacity for 125 or less) that are looking for a creative use if you have one.

In this particular case, “New York” was both part of the problem and much of the solution. Where else can you walk around and find large rooms on short notice within 3 short blocks? That won’t generally be true in the future, so we need to think carefully about how to estimate attendance.

]]>The price becomes much more reasonable if you can find roommates to share the price. For example, the conference hotel can have 3 beds in a room.

This still leaves a coordination problem: How do you find plausible roommates? If only there was a website where the participants in a conference could look for roommates. Oh wait, there is. Conferenceshare.co is something new which might measurably address the cost problem. Obviously, you’ll want to consider roommate possibilities carefully, but now at least there is a place to meet.

Note that the early registration deadline for ICML is May 7th.

]]>The program is shaping up and should be of interest. The 9 Tutorials(**), 4 Invited Speakers, and 23 Workshops are all chosen, with paper decisions due out in a couple weeks.

Early | Full (after May 7) | |

Student | 510 | 640 |

Regular | 840 | 1050 |

These numbers are as aggressively low as the local chairs and I can sleep with at night. The prices are higher than I’d like (New York is expensive), but a bit lower than last year, particularly for students(***).

(*) Relevant facts:

- ICML 2016: submissions up 30% to 1300.
- NIPS 2015 in Montreal: 3900 registrations (way up from last year).
- NIPS 2016 is in Barcelona.
- ICML 2015 in Lille: 1670 registrations.
- KDD 2014 in NYC: closed@3000 registrations 1 week before the conference.

I tried to figure out how to setup a prediction market to estimate what will happen this year, but didn’t find an easy-enough way to do that.

(**) I kind of wish we could make up the titles. How about: “Go is Too Easy” and “My Neural Network is Deeper than Yours”?

(***) Sponsors are very generous and are mostly giving to defray student costs. Approximately every dollar of the difference between Regular and Student registration is due to company donations. For students, also note that there will be some scholarship opportunities to defray costs coming out soon.

]]>However, some of the discussion around this seems like giddy overstatement. Wired says Machines have conquered the last games and Slashdot says We know now that we don’t need any big new breakthroughs to get to true AI. The truth is nowhere close.

For Go itself, it’s been well-known for a decade that Monte Carlo tree search (i.e. valuation by assuming randomized playout) is unusually effective in Go. Given this, it’s unclear that the AlphaGo algorithm extends to other board games where MCTS does not work so well. Maybe? It will be interesting to see.

Delving into existing computer games, the Atari results (see figure 3) are very fun but obviously unimpressive on about ¼ of the games. My hypothesis for why is that their solution does only local (epsilon-greedy style) exploration rather than global exploration so they can only learn policies addressing either very short credit assignment problems or with greedily accessible polices. Global exploration strategies are known to result in exponentially more efficient strategies in general for deterministic decision process(1993), Markov Decision Processes (1998), and for MDPs without modeling (2006).

The reason these strategies are not used is because they are based on tabular learning rather than function fitting. That’s why I shifted to Contextual Bandit research after the 2006 paper. We’ve learned quite a bit there, enough to start tackling a Contextual Deterministic Decision Process, but that solution is still far from practical. Addressing global exploration effectively is only one of the significant challenges between what is well known now and what needs to be addressed for what I would consider a real AI.

This is generally understood by people working on these techniques but seems to be getting lost in translation to public news reports. That’s dangerous because it leads to disappointment. The field will be better off without an overpromise/bust cycle so I would encourage people to keep and inform a balanced view of successes and their extent. Mastering Go is a great accomplishment, but it is quite far from everything.

]]>