March 2010 – Machine Learning (Theory)

At the PAC-Bayes workshop earlier this week, Olivier Catoni described a result that I hadn’t believed was possible: a deviation bound depending only on the variance of a random variable.

For people not familiar with deviation bounds, this may be hard to appreciate. Deviation bounds, are one of the core components for the foundations of machine learning theory, so developments here have a potential to alter our understanding of how to learn and what is learnable. My understanding is that the basic proof techniques started with Bernstein and have evolved into several variants specialized for various applications. All of the variants I knew had a dependence on the range, with some also having a dependence on the variance of an IID or martingale random variable. This one is the first I know of with a dependence on only the variance.

The basic idea is to use a biased estimator of the mean which is not influenced much by outliers. Then, a deviation bound can be proved by using the exponential moment method, with the sum of the bias and the deviation bounded. The use of a biased estimator is clearly necessary, because an unbiased empirical average is inherently unstable—which was precisely the reason I didn’t think this was possible.

Precisely how this is useful for machine learning isn’t clear yet, but it opens up possibilities. For example, it’s common to suffer from large ranges in exploration settings, such as contextual bandits or active learning.

I’m offering a reward of $1000 for a solution to this problem. This joins the cross validation problem which I’m offering a $500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be 🙂

The Problem

The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions.

In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

A learning reduction consists of two algorithms R and R^-1 which transform examples from the original input problem into examples for the oracle and then transform the oracle’s predictions into a prediction for the original problem.

The algorithm R takes as input a single example (x,y) where x is an feature vector and y is a discrete variable taking values in {1,…,k}. R then specifies a training example (x’,y’) for the oracle B. R can then create another training example for B based on all available information. This process repeats some finite number of times before halting without returning information.

A basic observation is that for any oracle algorithm, a distribution D(x,y) over multiclass examples and a reduction R induces a distribution over a sequence (x’,y’)^* of oracle examples. We collapse this into a distribution D'(x’,y’) over oracle examples by drawing uniformly from the sequence.

The algorithm R^-1 takes as input a single example (x,y) and returns a value in [0,1] after using (only) the testing interface of B zero or more times.

We measure the power of an oracle and a reduction according to squared-loss regret. In particular we have:

reg(D,R^-1)=E_{(x,y)~ D}[(R^-1(x,y)-D(y|x))²]
and similarly letting m_x’=E_{(x’,y’)~ D’}[y’].

reg(D’,B)=E_{(x’,y’)~ D’}(B(x’) – m_x’)²

The open problem is to specify R and R^-1 satisfying the following theorem:

For all multiclass distributions D(x,y), for all binary oracles B: The computational complexity of R and R^-1 are O(log k)
and

reg(D,R^-1) < = C reg(D’,B)
where C is a universal constant.

Alternatively, this open problem is satisfied by proving there exists no deterministic algorithms R,R^-1 satisfying the above theorem statement.

Motivation

The problem of conditional probability estimation is endemic to machine learning applications. In fact, in some branches of machine learning, this is simply considered “the problem”. Typically conditional probability estimation is done in situations where the conditional probability of only one bit is required, however there are a growing number of applications where a well-estimated conditional probability over a more complex object is required. For example, all known methods for solving general contextual bandit problems require knowledge of or good estimation of P(a | x) where a is an action.

There is a second intrinsic motivation which is matching the lower bound. No method faster than O(log k) can be imagined because the label y requires log₂ k bits to specify and hence read. Similarly it’s easy to prove no learning reduction can provide a regret ratio with C<1.

The motivation for using the learning reduction framework to specify this problem is a combination of generality and the empirical effectiveness in application of learning reductions. Any solution to this will be general because any oracle B can be plugged in, even ones which use many strange kinds of prior information, features, and active multitask hierachical (insert your favorite adjective here) structure.

Silly tricks that don’t work

Because Learning reductions are not familiar to everyone, It’s helpful to note certain tricks which do not work here to prevent false leads and provide some intuition.

Ignore B‘s predictions and use your favorite learning algorithm instead.

This doesn’t work, because the quantification is for all D. Any specified learning algorithm will have some D on which it has nonzero regret. On the other hand, because R calls the oracle at least once, there is a defined induced distribution D’. Since the theorem must hold for all D and B, it must hold for a D your specified learning algorithm fails on and for a B for which reg(D’,B)=0 implying the theorem is not satisfied.

Feed random examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work because the theorem is stated in terms of squared loss regret rather than squared loss. In particular, if the oracle is given examples of the form (x’,y’) where y’ is uniformly at random either 0 or 1, any oracle specifying B(x’)=0.5 has zero regret.

Feed pseudorandom examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work, because the quantification is “for all binary oracles B”, and there exists one which, knowing the pseudorandom seed, can achieve zero loss (and hence zero regret).

Just use Boosting to drive the LHS to zero.

Boosting theorems require a stronger oracle—one which provides an edge over some constant baseline for each invocation. The oracle here is not limited in this fashion since it could completely err for a small fraction of invocations.

Take an existing structure, parameterize it, randomize over the parameterization, and then average over the random elements.

Employing this approach is not straightforward, because the average in D’ is over an increased number of oracle examples. Hence, at a fixed expected (over oracle examples) regret, the number of examples allowed to have a large regret is increased.

Month: March 2010

A Variance only Deviation Bound

The Efficient Robust Conditional Probability Estimation Problem

The Problem

Motivation

Related Results

Silly tricks that don’t work

Netflix Challenge 2 Canceled