Prediction Science

One view of machine learning is that it’s about how to program computers to predict well. This suggests a broader research program centered around the more pervasive goal of simply predicting well.
There are many distinct strands of this broader research program which are only partially unified. Here are the ones that I know of:

  1. Learning Theory. Learning theory focuses on several topics related to the dynamics and process of prediction. Convergence bounds like the VC bound give an intellectual foundation to many learning algorithms. Online learning algorithms like Weighted Majority provide an alternate purely game theoretic foundation for learning. Boosting algorithms yield algorithms for purifying prediction abiliity. Reduction algorithms provide means for changing esoteric problems into well known ones.
  2. Machine Learning. A great deal of experience has accumulated in practical algorithm design from a mixture of paradigms, including bayesian, biological, optimization, and theoretical.
  3. Mechanism Design. The core focus in game theory is on equilibria, mostly typically Nash equilibria, but also many other kinds of equilibria. The point of equilibria, to a large extent, is predicting how agents will behave. When this is employed well, principally in mechanism design for auctions, it can be a very powerful concept.
  4. Prediction Markets. The basic idea in a prediction market is that commodities can be designed so that their buy/sell price reflects a form of wealth-weighted consensus estimate of the probability of some event. This is not simply mechanism design, because (a) the thin market problem must be dealt with and (b) the structure of plausible guarantees is limited.
  5. Predictive Statistics. Part of statistics focuses on prediction, essentially becoming indistinguishable from machine learning. The canonical example of this is tree building algorithms such as CART, random forests, and some varieties of boosting. Similarly the notion of probability, counting, and estimation are all handy.
  6. Robust Search. I have yet to find an example of robust search which isn’t useful—and there are several varieties. This includes active learning, robust min finding, and (more generally) compressed sensing and error correcting codes.

The lack of unification is fertile territory for new research, so perhaps it’s worthwhile to think about how these different research programs might benefit from each other.

  1. Learning Theory. The concept of mechanism design is mostly missing from learning theory, but it is sure to be essential when interactive agents are learning. We’ve found several applications for robust search as well as new settings for robust search such as active learning, and error correcting tournaments, but there are surely others.
  2. Machine Learning and Predictive Statistics. Machine learning has been applied to auction design. There is a strong relationship between incentive compatibility and choice of loss functions, both for choosing proxy losses and approximating the real loss function imposed by the world. It’s easy to imagine designer loss functions from the study of incentive compatibility mechanisms giving learning algorithm an edge. I found this paper thought provoking that way. Since machine learning and information markets share a design goal, are there hybrid approaches which can outperform either?
  3. Mechanism Design. There are some notable similarities between papers in ML and mechanism design. For example there are papers about learning on permutations and pricing in combinatorial markets. I haven’t yet taken the time to study these carefully, but I could imagine that one suggests advances for the other, and perhaps vice versa. In general, the idea of using mechanism design with context information (as is done in machine learning), could also be extremely powerful.
  4. Prediction Markets. Prediction markets are partly an empirical field and partly a mechanism design field. There seems to be relatively little understanding about how well and how exactly information from multiple agents is supposed to interact to derive a good probability estimate. For example, the current global recession reminds us that excess leverage is a very bad idea. The same problem comes up in machine learning and is solved by the weighted majority algorithm (and even more thoroughly by the hedge algorithm). Can an information market be designed with the guarantee that an imperfect but best player decides the vote after not-too-many rounds? How would this scale as a function of the ratio of a participants initial wealth to the total wealth?
  5. Robust Search. Investigations into robust search are extremely diverse, essentially only unified in a mathematically based analysis. For people interested in robust search, machine learning and information markets provide a fertile ground for empirical application and new settings. Can all mechanisms for robust search be done with context information, as is common in learning? Do these approaches work empirically in machine learning or information markets?

There are almost surely many other interesting research topics and borrowable techniques here, and probably even other communities oriented around prediction. While the synthesis of these fields is almost sure to eventually happen, I’d like to encourage it sooner rather than later. For someone working on one of these branches, attending a conference on one of the other branches might be a good start. At a lesser time investment, Oddhead is a good start.

How do we get weak action dependence for learning with partial observations?

This post is about contextual bandit problems where, repeatedly:

  1. The world chooses features x and rewards for each action r1,…,rk then announces the features x (but not the rewards).
  2. A policy chooses an action a.
  3. The world announces the reward ra

The goal in these situations is to learn a policy which maximizes ra in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction.

One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k0.5, epoch-greedy (and the simpler epsilon greedy) have a dependence on k1/3, and the offset tree has a dependence on k-1. These results aren’t directly comparable because different things are being analyzed. The fact that all analyses have poor dependence on k is troublesome. The lower bounds in the EXP4 paper and the Offset Tree paper demonstrate that this isn’t a matter of lazy proof writing or a poor choice of algorithms: it’s essential to the nature of the problem.

In supervised learning, it’s typical to get no dependence or very weak dependence on the number of actions/choices/labels. For example, if we do empirical risk minimization over a finite hypothesis space H, the dependence is at most ln |H| using an Occam’s Razor bound. Similarly, the PECOC algorithm (page 12) has dependence bounded by a constant. This kind of dependence is great for the feasibility of machine learning: it means that we can hope to tackle seemingly difficult problems.

Why is there such a large contrast between these settings? At the level of this discussion, they differ only in step 3, where for supervised learning, all of the rewards are revealed instead of just one.

One of the intuitions you develop after working with supervised learning is that holistic information is often better. As an example, given a choice between labeling the same point multiple times (perhaps revealing and correcting noise) or labeling other points once, an algorithm with labels other points typically exists and typically yields as good or better performance in theory and in practice. This appears untrue when we have only partial observations.

For example, consider the following problem(*): “Find an action with average reward greater than 0.5 with probability at least 0.99” and consider two algorithms:

  1. Sample actions at random until we can prove (via Hoeffding bounds) that one of them has large reward.
  2. Pick an action at random, sample it 100 times, and if we can prove (via a Hoeffding bound) that it has large average reward return it, otherwise pick another action randomly and repeat.

When there are 1010 actions and 109 of them have average reward 0.6, it’s easy to prove that algorithm 2 is much better than algorithm 1.

Lower bounds for the partial observation settings imply that more tractable algorithms only exist under additional assumptions. Two papers which do this without context features are:

  1. Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandit problems in metric spaces, STOC 2008. Here the idea is that you have access to a covering oracle on the actions where actions with similar average rewards cover each other.
  2. Deepak Agarwal, , and Deepayan Chakrabati, Multi-armed Bandit Problems with Dependent Arms, ICML 2007. Here the idea is that the values of actions are generated recursively, preserving structure through the recursion.

Basic questions: Are there other kinds of natural structure which allows a good dependence on the total number of actions? Can these kinds of structures be extended to the setting with features? (Which seems essential for real applications.)

(*) Developed in discussion with Yisong Yue and Bobby Kleinberg.

Three levels of addressing the Netflix Prize

In October 2006, the online movie renter, Netflix, announced the Netflix Prize contest. They published a comprehensive dataset including more than 100 million movie ratings, which were performed by about 480,000 real customers on 17,770 movies.  Competitors in the challenge are required to estimate a few million ratings.  To win the “grand prize,” they need to deliver a 10% improvement in the prediction error compared with the results of Cinematch, Netflix’s proprietary recommender system. Best current results deliver 9.12% improvement, which is quite close to the 10% goal, yet painfully distant.

 The Netflix Prize breathed new life and excitement into recommender systems research. The competition allowed the wide research community to access a large scale, real life dataset. Beyond this, the competition changed the rules of the game. Claiming that your nice idea could outperform some mediocre algorithms on some toy dataset is no longer acceptable. Now researchers should face a new golden standard, and check how their seemingly elegant ideas are measured against best known results on an objective yardstick. I believe that this is a blessed change, which can help in shifting the focus to the few really useful ideas, rather than flooding us with a myriad of papers with questionable practical contributions. Well, days will tell…

 So where to start a truly meaningful research? What can really make a difference in perfecting a recommender system? I do not pretend have a real answer, but I will try to give some personal impressions. While working on the Netflix Prize, sifting through many ideas, implementing maybe a hundred different algorithms, we have come to recognize the few things that really matter. I will concentrate here on high level lessons that will hopefully help other practitioners in coming up with developments of a true practical value.

 I would like to characterize algorithms at three different levels. The first level answers the “what?” question – What do we want to model? Here we decide which features of the data to address. Do we want to model the numerical value of ratings, or maybe which movies people rate (regardless of rating value)? Do we want to address the date-dependent dynamics of users’ behavior? Some will want to model certain pieces of metadata associated with the movies, such as interactions with actors, directors, etc. Or, maybe, we would like to analyze the demographics of the users?

 The next level, the second one, answers the “which?” question – Which model are we going to pick? Will we model ratings through a neighborhood model or through a latent factor model? Within a neighborhood model, should we look at relationships between users, between movies, or maybe both? Within latent factor models we also have plenty of further choices – should we stick with the good old SVD, or move to fancier probabilistic models (e.g., pLSA, LDA)? Or maybe, we should jump to neural networks such as RBMs?

 Finally the last level answers the “how?” question – How are we going to implement the chosen model? Even after choosing a model, we have much flexibility in deciding how to optimize it. For example, nearest neighbor models can vary from quite simplistic correlation based models, to more sophisticated models that try to derive parameters directly from the data. Likewise, there are many ways to fit an SVD model, ranging from gradient descent and alternating least squares to deeper formulations such as EM, MAP, MCMC, Gibbs sampling and more.

 When designing an algorithm, one should go through the three levels, likely, but not necessarily, in the order I listed them. A major question is where most efforts should be invested? Which level has most influence on the quality of the outcome?

 My impression is that quite often most effort is allocated in the wrong direction. Most papers appear to concentrate on the third level, designing the best techniques for optimizing a single model or a particular cost function on which they are fixated. This is not very surprising, because the third level is the most technical one and offers the most flexibility. In particular, it allows researchers to express their prowess. Here, we can find papers with mathematical breakthroughs allowing squeezing some extra points from a model, getting us closer to the optimum, in a shorter time and with less overfitting. Well, no doubt, that’s wonderful… However, the practical value of these developments is quite limited, especially when using an ensemble of various models, where squeezing the best out of a single model is not really delivered to the bottom line.

 Concentrating efforts on the second level is more fruitful. Not all models are built equal for the task at hand. For example, user-based neighborhood models were found to be vastly inferior to item (movie) -based ones. Moreover, latent factor models were proven to be more accurate than the neighborhood ones (considering that you use the right latent factor model, which happens to be SVD). Most importantly, the design of a good ensemble blending complementing predictors should be mostly done at this level. It is very beneficial to blend SVD with a neighborhood technique and with an RBM. A simple mixture like this, involving quick and straightforward implementations, would probably vastly outperform some very well tuned and elaborated individual models. So this level is certainly important, receives quite a bit of attention at the literature, but not nearly as important as the first level.

 The first level, which decides the aspects of the data to be modeled, is where most pivotal choices are taken. Selecting the right features will make a huge impact on the quality of the results. For example, going beyond the numerical values of the ratings to analyzing which movies are chosen to be rated has a tremendous effect on prediction accuracy. On the other hand, modeling metadata associated with movies, such as identity of actors, or associated keywords, is not a prudent choice regarding the Netflix data. Similarly, modeling the date-dependent dynamics of users’ behavior is very useful. This first level receives less attention in the literature. Perhaps, because it is somewhat application dependant and harder to generalize. However, I can’t emphasize enough its importance.

 In practice, the borders between the three levels that I describe may be quite fuzzy. Moreover, these three levels can be sometimes strongly interlaced with each other, as at the end, a single implementation should fulfill all three levels. However, these days, whatever I think or hear about the Netflix data, I immediately try to relate to those three levels. The more it relates to the first level, the more interested I become, whereas I tend to almost completely ignore improvements related to the third level (well, that’s after exploring that level enough in the past). Just my 2 cents…

COLT Open Problems

COLT has a call for open problems due March 21. I encourage anyone with a specifiable open problem to write it down and send it in. Just the effort of specifying an open problem precisely and concisely has been very helpful for my own solutions, and there is a substantial chance others will solve it. To increase the chance someone will take it up, you can even put a bounty on the solution. (Perhaps I should raise the $500 bounty on the K-fold cross-validation problem as it hasn’t yet been solved).

Contextual Bandits

One of the fundamental underpinnings of the internet is advertising based content. This has become much more effective due to targeted advertising where ads are specifically matched to interests. Everyone is familiar with this, because everyone uses search engines and all search engines try to make money this way.

The problem of matching ads to interests is a natural machine learning problem in some ways since there is much information in who clicks on what. A fundamental problem with this information is that it is not supervised—in particular a click-or-not on one ad doesn’t generally tell you if a different ad would have been clicked on. This implies we have a fundamental exploration problem.

A standard mathematical setting for this situation is “k-Armed Bandits”, often with various relevant embellishments. The k-Armed Bandit setting works on a round-by-round basis. On each round:

  1. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
  2. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).

As information is accumulated over multiple rounds, a good policy might converge on a good choice of arm (i.e. ad).

This setting (and its variants) fails to capture a critical phenomenon: each of these displayed ads are done in the context of a search or other webpage. To model this, we might think of a different setting where on each round:

  1. The world announces some context information x (think of this as a high dimensional bit vector if that helps).
  2. A policy chooses arm a from 1 of k arms (i.e. 1 of k ads).
  3. The world reveals the reward ra of the chosen arm (i.e. whether the ad is clicked on).

We can check that this is a critical distinction in 2 ways. First, note that policies using x can encode much more rich decisions than a policy not using x. Just think about: “if a search has the word flowers display a flower advertisement”. Second, we can try to reduce this setting to the k-Armed Bandit setting, and note that it can not be done well. There are two methods that I know of:

  1. Run a different k-Armed Bandit for every value of x. The amount of information required to do well scales linearly in the number of contexts. In contrast, good supervised learning algorithms often require information which is (essentially) independent of the number of contexts.
  2. Take some set of policies and treat every policy h(x) as a different arm. This removes an explicit dependence on the number of contexts, but it creates a linear dependence on the number of policies. Via Occam’s razor/VC dimension/Margin bounds, we already know that supervised learning requires experience much smaller than the number of policies.

We know these are bad reductions by contrast to direct methods for solving the problem. The first algorithm for solving this problem is EXP4 (page 19 = 66) which has a regret with respect to the best policy in a set of O( T0.5 (ln |H|)0.5) where T is the number of rounds and |H| is the number of policies. (Dividing by T gives error-rate like quantities.) This result is independent of the number of contexts x and only weakly dependent (similar to supervised learning) on the number of policies.

EXP4 has a number of drawbacks—it has severe computational requirements and doesn’t work for continuously parameterized policies (*). Tong and I worked out a reasonably simple meta-algorithm Epoch-Greedy which addresses these drawbacks (**), at the cost of sometimes worsening the regret bound to O(T2/3S1/3) where S is related to the representational complexity of supervised learning on the set of policies.

This T dependence is of great concern to people who have worked on bandit problems in the past (where, basically, only the dependence on T could be optimized). In many applications, the S dependence is more important. However, this does leave an important open question: Is it possible to get the best properties of EXP4 and Epoch-Greedy?

Reasonable people could argue about which setting is more important: k-Armed Bandits or Contextual Bandits. I favor Contextual Bandits, even though there has been far more work in the k-Armed Bandit setting. There are several reasons:

  1. I’m having difficulty finding interesting real-world k-Armed Bandit settings which aren’t better thought of as Contextual Bandits in practice. For myself, bandit algorithms are (at best) motivational because they can not be applied to real-world problems without altering them to take context into account.
  2. Doing things in context is one of the underlying (and very successful) tenets of machine learning. Applying this tenet here seems wise.
  3. If we want to eventually solve big problems, we must have composable subelements. Composition doesn’t work without context, because there is no “input” for an I/O diagram.

Any insights into the open question above or Contextual Bandits in general are of great interest to me.

(*) There are some simple modifications to deal with the second issue but not the first.
(**) You have to read between the lines a little bit to see this in the paper. The ERM-style algorithm in the paper could be replaced with an efficient approximate ERM algorithm which is often possible in practice.