A Variance only Deviation Bound

At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.

For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.

The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.

Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.

The Efficient Robust Conditional Probability Estimation Problem

I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be 🙂

The Problem

The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions.

In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

A learning reduction consists of two algorithms R and R-1 which transform examples from the original input problem into examples for the oracle and then transform the oracle’s predictions into a prediction for the original problem.

The algorithm R takes as input a single example (x,y) where x is an feature vector and y is a discrete variable taking values in {1,…,k}. R then specifies a training example (x’,y’) for the oracle B. R can then create another training example for B based on all available information. This process repeats some finite number of times before halting without returning information.

A basic observation is that for any oracle algorithm, a distribution D(x,y) over multiclass examples and a reduction R induces a distribution over a sequence (x’,y’)* of oracle examples. We collapse this into a distribution D'(x’,y’) over oracle examples by drawing uniformly from the sequence.

The algorithm R-1 takes as input a single example (x,y) and returns a value in [0,1] after using (only) the testing interface of B zero or more times.

We measure the power of an oracle and a reduction according to squared-loss regret. In particular we have:


reg(D,R-1)=E(x,y)~ D[(R-1(x,y)-D(y|x))2]

and similarly letting mx’=E(x’,y’)~ D’[y’].

reg(D’,B)=E(x’,y’)~ D’(B(x’) – mx’)2

The open problem is to specify R and R-1 satisfying the following theorem:

For all multiclass distributions D(x,y), for all binary oracles B: The computational complexity of R and R-1 are O(log k)
and


reg(D,R-1) < = C reg(D’,B)

where C is a universal constant.

Alternatively, this open problem is satisfied by proving there exists no deterministic algorithms R,R-1 satisfying the above theorem statement.

Motivation

The problem of conditional probability estimation is endemic to machine learning applications. In fact, in some branches of machine learning, this is simply considered “the problem”. Typically conditional probability estimation is done in situations where the conditional probability of only one bit is required, however there are a growing number of applications where a well-estimated conditional probability over a more complex object is required. For example, all known methods for solving general contextual bandit problems require knowledge of or good estimation of P(a | x) where a is an action.

There is a second intrinsic motivation which is matching the lower bound. No method faster than O(log k) can be imagined because the label y requires log2 k bits to specify and hence read. Similarly it’s easy to prove no learning reduction can provide a regret ratio with C<1.

The motivation for using the learning reduction framework to specify this problem is a combination of generality and the empirical effectiveness in application of learning reductions. Any solution to this will be general because any oracle B can be plugged in, even ones which use many strange kinds of prior information, features, and active multitask hierachical (insert your favorite adjective here) structure.

Related Results

The state of the art is summarized here which shows it’s possible to have a learning reduction satisfying the above theorem with either:

  1. C replaced by (log2 k)2 (using a binary tree structure)
  2. or the computational time increased to O(k) (using an error correcting code structure).

Hence, answering this open problem in the negative shows that there is an inherent computation vs. robustness tradeoff.

There are two other closely related problems, where similar analysis can be done.

  1. For multiclass classification, where the goal is predicting the most likely class, a result analogous to the open problem is provable using error correcting tournaments.
  2. For multiclass classification in a partial label setting, no learning reduction can provide a constant regret guarantee.

Silly tricks that don’t work

Because Learning reductions are not familiar to everyone, It’s helpful to note certain tricks which do not work here to prevent false leads and provide some intuition.

Ignore B‘s predictions and use your favorite learning algorithm instead.

This doesn’t work, because the quantification is for all D. Any specified learning algorithm will have some D on which it has nonzero regret. On the other hand, because R calls the oracle at least once, there is a defined induced distribution D’. Since the theorem must hold for all D and B, it must hold for a D your specified learning algorithm fails on and for a B for which reg(D’,B)=0 implying the theorem is not satisfied.

Feed random examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work because the theorem is stated in terms of squared loss regret rather than squared loss. In particular, if the oracle is given examples of the form (x’,y’) where y’ is uniformly at random either 0 or 1, any oracle specifying B(x’)=0.5 has zero regret.

Feed pseudorandom examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work, because the quantification is “for all binary oracles B”, and there exists one which, knowing the pseudorandom seed, can achieve zero loss (and hence zero regret).

Just use Boosting to drive the LHS to zero.

Boosting theorems require a stronger oracle—one which provides an edge over some constant baseline for each invocation. The oracle here is not limited in this fashion since it could completely err for a small fraction of invocations.

Take an existing structure, parameterize it, randomize over the parameterization, and then average over the random elements.

Employing this approach is not straightforward, because the average in D’ is over an increased number of oracle examples. Hence, at a fixed expected (over oracle examples) regret, the number of examples allowed to have a large regret is increased.

Netflix Challenge 2 Canceled

The second Netflix prize is canceled due to privacy problems. I continue to believe my original assessment of this paper, that the privacy break was somewhat overstated. I still haven’t seen any serious privacy failures on the scale of the AOL search log release.

I expect privacy concerns to continue to be a big issue when dealing with data releases by companies or governments. The theory of maintaining privacy while using data is improving, but it is not yet in a state where the limits of what’s possible are clear let alone how to achieve these limits in a manner friendly to a prediction competition.

Yahoo! ML events

Yahoo! is sponsoring two machine learning events that might interest people.

  1. The Key Scientific Challenges program (due March 5) for Machine Learning and Statistics offers $5K (plus bonuses) for graduate students working on a core problem of interest to Y! If you are already working on one of these problems, there is no reason not to submit, and if you aren’t you might want to think about it for next year, as I am confident they all press the boundary of the possible in Machine Learning. There are 7 days left.
  2. The Learning to Rank challenge (due May 31) offers an $8K first prize for the best ranking algorithm on a real (and really used) dataset for search ranking, with presentations at an ICML workshop. Unlike the Netflix competition, there are prizes for 2nd, 3rd, and 4th place, perhaps avoiding the heartbreak the ensemble encountered. If you think you know how to rank, you should give it a try, and we might all learn something. There are 3 months left.

Specializations of the Master Problem

One thing which is clear on a little reflection is that there exists a single master learning problem capable of encoding essentially all learning problems. This problem is of course a very general sort of reinforcement learning where the world interacts with an agent as:

  1. The world announces an observation x.
  2. The agent makes a choice a.
  3. The world announces a reward r.

The goal here is to maximize the sum of the rewards over the time of the agent. No particular structure relating x to a or a to r is implied by this setting so we do not know effective general algorithms for the agent. It’s very easy to prove lower bounds showing that an agent cannot hope to succeed here—just consider the case where actions are unrelated to rewards. Nevertheless, there is a real sense in which essentially all forms of life are agents operating in this setting, somehow succeeding. The gap between these observations drives research—How can we find tractable specializations of the master problem general enough to provide an effective solution in real problems?

The process of specializing is a tricky business, as you want to simultaneously achieve tractable analysis, sufficient generality to be useful, and yet capture a new aspect of the master problem not otherwise addressed. Consider: How is it even possible to choose a setting where analysis is tractable before you even try to analyze it? What follows is my mental map of different specializations.

Online Learning

The online learning setting is perhaps the most satisfying specialization more general than standard batch learning at present, because it turns out to additionally provide tractable algorithms for many batch learning settings.

Standard online learning models specialize in two ways: You assume that the choice of action in step 2 does not influence future observations and rewards, and you assume additional information is available in step 3, a retrospectively available reward for each action. The algorithm for an agent in this setting typically has a given name—gradient descent, weighted majority, Winnow, etc…

The general algorithm here is a more refined version of follow-the-leader than in batch learning, with online update rules. An awesome discovery about this setting is that it’s possible to compete with a set of predictors even when the world is totally adversarial, substantially strengthening our understanding of what learning is and where it might be useful. For this adversarial setting, the algorithm alters into a form of follow-the-perturbed leader, where the learning algorithm randomizes it’s action amongst the set of plausible alternatives in order to defeat an adversary.

The standard form of argument in this setting is a potential argument, where at each step you show that if the learning algorithm performs badly, there is some finite budget from which an adversary deducts it’s ability. The form of the final theorem is that you compete with the accumulated reward of a set any one-step policies h:X – > A, with a dependence log(#policies) or weaker in regret, a measure of failure to compete.

A good basic paper to read here is:
Nick Littlestone and Manfred Warmuth, The Weighted Majority Algorithm, which shows the basic information-theoretic claim clearly. Vovk‘s page on aggregating algorithms is also relevant, although somewhat harder to read.

Provably computationally tractable special cases all have linear structure, either on rewards or policies. Good results are often observed empirically by applying backpropagation for nonlinear architectures, with the danger of local minima understood.

Bandit Analysis

In the bandit setting, step 1 is omitted, and the difficulty of the problem is weakened by assuming that action in step (2) don’t alter future rewards. The goal is generally to compete with all constant arm strategies.

Analysis in this basic setting started very specialized with Gittin’s Indicies and gradually generalized over time to include IID and fully adversarial settings, with EXP3 a canonical algorithm. If there are k strategies available, the standard theorem states that you can compete with the set of all constant strategies up to regret k. The most impressive theoretical discovery in this setting is that the dependence on T, the number of timesteps, is not substantially worse than supervised learning despite the need to explore.

Given the dependence on k all of these algorithms are computationally tractable.

However, the setting is flawed, because the set of constant strategies is inevitably too weak in practice—it’s an example of optimal decision making given that you ignore almost all information. Adding back the observation in step 1 allows competing with a large set of policies, while the regret grows only as log(#policies) or weaker. Canonical algorithms here are EXP4 (computationally intractable, but information theoretically near-optimal), Epoch-Greedy (computationally tractable given an oracle optimizer), and the Offset Tree providing a reduction to supervised binary classification.

MDP analysis

A substantial fraction of reinforcement learning has specialized on the Markov Decision Process setting, where the observation x is a state s, which is a sufficient statistic for predicting all future observations. Compared to the previous settings, dealing with time dependence is explicitly required, but learning typically exists in only primitive forms.

The first work here was in the 1950’s where the actual MDP was assumed known and the problem was simply computing a good policy, typically via dynamic programming style solutions. More recently, principally in the 1990’s, the setting where the MDP was not assumed known was analyzed. A very substantial theoretical advancement was the E3 algorithm which requires only O(S2A) experience to learn a near-optimal policy where the world is an MDP with S state and A actions per state. A further improvement on this is Delayed Q-Learning, where only O(SA) experience is required. There are many variants on the model-based approach and not much for the model-free approach. Lihong Li‘s thesis probably has the best detailed discussion at present.

There are some unsatisfactory elements of the analysis here. First, I’ve suppressed the dependence on the definition of “approximate” and the typical time horizon, for which the dependence is often bad and the optimality is unclear. The second is the dependence on S, which is intuitively unremovable, with this observation formalized in the lower bound Sham and I worked on (section 8.6 of Sham’s thesis). Empirically, these and related algorithms are often finicky, because in practice the observation isn’t a sufficient statistic and the number of states isn’t small, so approximating things as such is often troublesome.

A very different variant of this setting is given by Control theory, which I know less about than I should. The canonical setting for control theory is with a known MDP having linear transition dynamics. More exciting are the system identification problems where the system must be first identified. I don’t know any good relatively assumption free results for this setting.

Oracle Advice Shortcuts

Techniques here specialize the setting to situations in which some form of oracle advice is available when a policy is being learned. A good example of this is an oracle which provides samples from the distribution of observations visited by a good policy. Using this oracle, conservative policy iteration is guaranteed to perform well, so long as a base learning algorithm can predict well. This algorithm was refined and improved a bit by PSDP, which works via dynamic programming, improving guarantees to work with regret rather than errors.

An alternative form of oracle is provide by access to a good policy at training time. In this setting, Searn has similar provable guarantees with a similar analysis.

The oracle based algorithms appear to work well anywhere these oracles are available.

Uncontrolled Delay

In the uncontrolled delay setting, step (2) is removed, and typically steps (1) and (3) are collapsed into one observation, where the goal becomes state tracking. Most of the algorithms for state tracking are heavily model dependent, implying good success within particular domains. Examples include Kalman filters, hidden markov models, and particle filters which typical operate according to an explicit probabilistic model of world dynamics.

Relatively little is known for a nonparametric version of this problem. One observation is that the process of predicting adjacent observations well forms states as a byproduct when the observations are sufficiently rich as detailed here.

A basic question is: What’s missing from the above? A good answer is worth a career.