2010 ICML discussion site

A substantial difficulty with the 2009 and 2008 ICML discussion system was a communication vacuum, where authors were not informed of comments, and commenters were not informed of responses to their comments without explicit monitoring. Mark Reid has setup a new discussion system for 2010 with the goal of addressing this.

Mark didn’t want to make it to intrusive, so you must opt-in. As an author, find your paper and “Subscribe by email” to the comments. As a commenter, you have the option of providing an email for follow-up notification.

The Good News on Exploration and Learning

Consider the contextual bandit setting where, repeatedly:

  1. A context x is observed.
  2. An action a is taken given the context x.
  3. A reward r is observed, dependent on x and a.

Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward.

This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions.

A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and butter supervised learning problems. Just as for supervised learning, we know how to create and reuse datasets, how to benchmark algorithms, how to reuse existing supervised learning algorithms in this setting, and how to achieve optimal rates of learning quantitatively similar to supervised learning.

This is also one of the times when understanding the basic theory can make a huge difference in your success. There are many wrong ways to attack contextual bandit problems or prepare datasets, and taking a wrong turn can easily mean the difference between failure and success. Understanding how contextual bandit problems differ from basic supervised learning problems is critical to routine success here.

All of the above is not meant to claim that everything is done research-wise here so we’ll try to outline where the current boundary of research lies as best we can. However, we are surely at a point both in terms of application demand (especially for internet applications of search, advertising, page optimization, but also medical applications and surely others) and methodology supply (with basic reliable techniques now easily available or created) where these techniques are shifting from theory esoterica to required education.

Given the above, Alina and I decided to prepare a tutorial to be given at Yahoo! Labs summer school (my first India trip!), ICML, KDD, and hopefully videolectures.net. Please join us. The subjects we plan to cover are essentially the keys to the kingdom of solving shallow interactive learning problems.

Google Predict

Slashdot points out Google Predict. I’m not privy to the details, but this has the potential to be extremely useful, as in many applications simply having an easy mechanism to apply existing learning algorithms can be extremely helpful. This differs goalwise from MLcomp—instead of public comparisons for research purposes, it’s about private utilization of good existing algorithms. It also differs infrastructurally, since a system designed to do this is much less awkward than using Amazon’s cloud computing. The latter implies that datasets several order of magnitude larger can be handled up to limits imposed by network and storage.

Aggregation of estimators, sparsity in high dimension and computational feasibility

(I’m channeling for Jean-Yves Audibert here, with some minor tweaking for clarity.)

Since Nemirovski’s Saint Flour lecture notes, numerous researchers have studied the following problem in least squares regression: predict as well as
(MS) the best of d given functions (like in prediction with expert advice; model = finite set of d functions)
(C) the best convex combination of these functions (i.e., model = convex hull of the d functions)
(L) the best linear combination of these functions (i.e., model = linear span of the d functions)
It is now well known (see, e.g., Sacha Tsybakov’s COLT’03 paper) that these tasks can be achieved since there exist estimators having an excess risk of order (log d)/n for (MS), min( sqrt((log d)/n), d/n ) for (C) and d/n for (L), where n is the training set size. Here, “risk” is amount of extra loss per example which may be suffered due to the choice of random sample.

The practical use of these results seems rather limited to trivial statements like: do not use the OLS estimator when the dimension d of the input vector is larger than n (here the d functions are the projections on each of the d components). Nevertheless, it provides a rather easy way to prove that there exists a learning algorithm having an excess risk of order s (log d)/n, with respect to the best linear combination of s of the d functions (s-sparse linear model). Indeed, it suffices to consider the algorithm which

  1. cuts the training set into two parts, say of equal size for simplicity,
  2. uses the first part to train linear estimators corresponding to every possible subset of s features. Here you can use your favorite linear estimator (the empirical risk minimizer on a compact set or robust but more involved ones are possible rather than the OLS), as long as it solves (L) with minimal excess risk.
  3. uses the second part to predict as well as the “d choose s” linear estimators built on the first part. Here you choose your favorite aggregate solving (MS). The one I prefer is described in p.5 of my NIPS’07 paper, but you might prefer the progressive mixture rule or the algorithm of Guillaume Lecué and Shahar Mendelson. Note that empirical risk minimization and cross-validation completely fail for this task with excess risk of order sqrt((log d)/n) instead of (log d)/n.

It is an easy exercise to combine the different excess risk bounds and obtain that the above procedure achieves an excess risk of s (log d)/n. The nice thing compared to works on Lasso, Dantzig selectors and their variants is that you do not need all these assumptions saying that your features should be “not too much” correlated. Naturally, the important limitation of the above procedure, which is often encountered when using classical model selection approach, is its computational intractability. So this leaves open the following fundamental problem:
is it possible to design a computationally efficient algorithm with the s (log d)/n guarantee without assuming low correlation between the explanatory variables?