# Machine Learning (Theory)

## 3/15/2010

### The Efficient Robust Conditional Probability Estimation Problem

I’m offering a reward of \$1000 for a solution to this problem. This joins the cross validation problem which I’m offering a \$500 reward for. I believe both of these problems are hard but plausibly solvable, and plausibly with a solution of substantial practical value. While it’s unlikely these rewards are worth your time on an hourly wage basis, the recognition for solving them definitely should be

## The Problem

The problem is finding a general, robust, and efficient mechanism for estimating a conditional probability P(y|x) where robustness and efficiency are measured using techniques from learning reductions.

In particular, suppose we have access to a binary regression oracle B which has two interfaces—one for specifying training information and one for testing. Training information is specified as B(x’,y’) where x’ is a feature vector and y’ is a scalar in [0,1] with no value returned. Testing is done according to B(x’) with a value in [0,1] returned.

A learning reduction consists of two algorithms R and R-1 which transform examples from the original input problem into examples for the oracle and then transform the oracle’s predictions into a prediction for the original problem.

The algorithm R takes as input a single example (x,y) where x is an feature vector and y is a discrete variable taking values in {1,…,k}. R then specifies a training example (x’,y’) for the oracle B. R can then create another training example for B based on all available information. This process repeats some finite number of times before halting without returning information.

A basic observation is that for any oracle algorithm, a distribution D(x,y) over multiclass examples and a reduction R induces a distribution over a sequence (x’,y’)* of oracle examples. We collapse this into a distribution D'(x’,y’) over oracle examples by drawing uniformly from the sequence.

The algorithm R-1 takes as input a single example (x,y) and returns a value in [0,1] after using (only) the testing interface of B zero or more times.

We measure the power of an oracle and a reduction according to squared-loss regret. In particular we have:

reg(D,R-1)=E(x,y)~ D[(R-1(x,y)-D(y|x))2]

and similarly letting mx’=E(x’,y’)~ D’[y’].

reg(D’,B)=E(x’,y’)~ D’(B(x’) – mx’)2

The open problem is to specify R and R-1 satisfying the following theorem:

For all multiclass distributions D(x,y), for all binary oracles B: The computational complexity of R and R-1 are O(log k)
and

reg(D,R-1) < = C reg(D’,B)

where C is a universal constant.

Alternatively, this open problem is satisfied by proving there exists no deterministic algorithms R,R-1 satisfying the above theorem statement.

## Motivation

The problem of conditional probability estimation is endemic to machine learning applications. In fact, in some branches of machine learning, this is simply considered “the problem”. Typically conditional probability estimation is done in situations where the conditional probability of only one bit is required, however there are a growing number of applications where a well-estimated conditional probability over a more complex object is required. For example, all known methods for solving general contextual bandit problems require knowledge of or good estimation of P(a | x) where a is an action.

There is a second intrinsic motivation which is matching the lower bound. No method faster than O(log k) can be imagined because the label y requires log2 k bits to specify and hence read. Similarly it’s easy to prove no learning reduction can provide a regret ratio with C<1.

The motivation for using the learning reduction framework to specify this problem is a combination of generality and the empirical effectiveness in application of learning reductions. Any solution to this will be general because any oracle B can be plugged in, even ones which use many strange kinds of prior information, features, and active multitask hierachical (insert your favorite adjective here) structure.

## Related Results

The state of the art is summarized here which shows it’s possible to have a learning reduction satisfying the above theorem with either:

1. C replaced by (log2 k)2 (using a binary tree structure)
2. or the computational time increased to O(k) (using an error correcting code structure).

Hence, answering this open problem in the negative shows that there is an inherent computation vs. robustness tradeoff.

There are two other closely related problems, where similar analysis can be done.

1. For multiclass classification, where the goal is predicting the most likely class, a result analogous to the open problem is provable using error correcting tournaments.
2. For multiclass classification in a partial label setting, no learning reduction can provide a constant regret guarantee.

## Silly tricks that don’t work

Because Learning reductions are not familiar to everyone, It’s helpful to note certain tricks which do not work here to prevent false leads and provide some intuition.

Ignore B‘s predictions and use your favorite learning algorithm instead.

This doesn’t work, because the quantification is for all D. Any specified learning algorithm will have some D on which it has nonzero regret. On the other hand, because R calls the oracle at least once, there is a defined induced distribution D’. Since the theorem must hold for all D and B, it must hold for a D your specified learning algorithm fails on and for a B for which reg(D’,B)=0 implying the theorem is not satisfied.

Feed random examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work because the theorem is stated in terms of squared loss regret rather than squared loss. In particular, if the oracle is given examples of the form (x’,y’) where y’ is uniformly at random either 0 or 1, any oracle specifying B(x’)=0.5 has zero regret.

Feed pseudorandom examples into B and vacuously satisfy the theorem by making sure that the right hand side is larger than a constant.

This doesn’t work, because the quantification is “for all binary oracles B”, and there exists one which, knowing the pseudorandom seed, can achieve zero loss (and hence zero regret).

Just use Boosting to drive the LHS to zero.

Boosting theorems require a stronger oracle—one which provides an edge over some constant baseline for each invocation. The oracle here is not limited in this fashion since it could completely err for a small fraction of invocations.

Take an existing structure, parameterize it, randomize over the parameterization, and then average over the random elements.

Employing this approach is not straightforward, because the average in D’ is over an increased number of oracle examples. Hence, at a fixed expected (over oracle examples) regret, the number of examples allowed to have a large regret is increased.

###### 19 Comments to “The Efficient Robust Conditional Probability Estimation Problem”
1. x10000year says:

How about a naive approach:

R(x,y):
First oracle example: x’=(x,y),y’=1
Second oracle example: x’=(x,y0),y’=0. y0\neq y is randomly picked with probability 1/(k-1)

let q=B'((x,y))
R'(x,y)=q/(q+(1-q)*(k-1))

• John Langford says:

This approach is clearly efficient and consistent.

Suppose that k is 1000 and P(y=1|x) = 1. Suppose B’ errs slightly, by predicting B'((x,1))=1-g for g small. Then, the output is (1-g)/(1-g + g*(k-1)) ~= 1-g*(k-1). So, a regret of g2 becomes a regret of (g*(k-1))2 and so C is at least (k-1)2, implying the approach is not robust.

2. x10000year says:

How about another naive approach:

R(x,y):
oracle example: x’=(x,y0), 1<=y0<=k is randomly picked
when y0=y, y'=1, otherwise, y'=0

R'(x,y)=B'((x,y))

• jl says:

This again isn’t robust.

Suppose k is large and P(y=1|x)=1, with B((x,1))=1-eps and B((x,y))=0 for y not equal to 1. Then, the average regret of B is eps^2/k and the regret of the final prediction is eps^2, implying that C is at least k for this construction.

(My apologies—for unclear reasons this comment ended up in the spam bucket.)

3. idolumone says:

I wonder whether C is a reasonable measure for the robustness of the reduction. If we could construct a distribution D’ that is essentially more difficult to learn than D, achieving reg(D,R-1)<=Creg(D’,B) means nothing important.

• jl says:

This is a fair concern, reminiscent of “is polynomial time enough?” from algorithms development.

So far, in every case where some complex construction plausibly realizing this fear exists, another simpler construction plausibly not realizing this fear has been found.

In the longer term we’ll see, but in the shorter term I don’t know a good mechanism for avoiding this fear so it isn’t relevant to the challenge.

4. JB says:

Could you explain how this is incorrect? It would help me to better understand the problem. I added a square root so the solution is robust to small errors in the oracle.

For k=2:
First oracle example: x’=(x,y1), y’=1 with probability sqrt(p(y1|x))
Second oracle example: x’=(x,y2), y’=0 with probability 1-sqrt(p(y1|x))
R'(x,y1) = sqrt(B'((x,y1),1))
R'(x,y2) = 1-sqrt(B'((x,y1),0))

For k>2
First subdivide into problems, y=T where T is a threshold in the middle, then solve using “k=2” method
Then solve the subproblem either using “k=2” method or “k>2” method

5. JB says:

One correction to above,
R’(x,y1) = B’(x,y1)^2
R’(x,y2) = 1-B’(x,y1)^2

• jl says:

I don’t see how to implement this construction, because you don’t know P(y1|x), but rather have samples from it.

6. JB says:

Let’s call N1 the number of samples provided to algorithm R. If N1=infinity, the algorithm R could determine p(y1|x). Based on your comment, I assume the problem requires N1 < infinity.

I would like to show this problem can't be solved. This is done by showing there is no algorithm R and R-1 where reg(D,R-1)=0 for all distributions, when reg(D',B) = 0.

We use k=2 where distribution D has
(x=0, y1), with p(y1|x)
(x=0, y2), with p(y2|x)

where
p(y1|x) = 0 or Z, where Z is arbitrarily small (e.g. 1/1000)
p(y2|x) = 1 – p(y1|x)

If we pick N1 samples, there is a possibility that every R input sample is y2 and never y1. For p(y1|x) = Z, this probability is (1-Z)^N1.

If every input sample to R is always y2, then whether p(y1|x) = 0 or Z,
– D' is always the same, because R is deterministic
– Oracle B is always the same, because we use reg(D',B) = 0 and D' is always the same
– R-1 output is always the same, because oracle B is the same

There are (1-Z)^N1 odds that the output of R-1 is identical whether p(y1|x)=0 or Z. However for non-zero reg(D,R-1), the R-1 required for p(y1|x)=0 is different than the R-1 required for p(y1|x)=Z.

This shows there is no R & R-1 where reg(D,R-1)=0 for both p(y1|x)=0 and Z. In this scenario, there is no C where reg(D,R-1) <= Creg(D',B) where reg(D',B) = 0.

What do you think?

• x10000year says:

You don’t understand the problem. You are saying that there is no consistent algorithm, which is clearly wrong. Your sentence “D’ is always the same” is problematic. Both D and D’ are unobserved distribution. D’ depends on D, instead of the instances of input samples.

• jl says:

== x10000year here. Having a D_1 and D_2 which are indistinguishable with high probability on a finite sample doesn’t imply that D_1′ and D_2′ are equal (although it would imply they are close).

7. jb says:

Thanks for the comments. I think the following algorithm works assuming the problem allows sufficient samples for R to converge. Otherwise, this won’t work.

R(x, y):
total(x)++
hits(x,y)++
target_prob = hits(x,y)/total(x)
current_prob = y’_sum(x,y)/hits(x,y)
y’ = (target_prob – current_prob)*hits(x,y)
if y’ > 1, y’ = 1
if y’ < 0, y' = 0
y'_sum(x,y) += y'
generate example x'=(x,y), y'

R(x,y) initialization:
For all x, total(x) = 0
For all x,y, hits(x,y) = 0
For all x,y, y'_sum(x,y) = 0

R-1(x,y) = B-1((x,y))

• x10000year says:

It obviously doesn’t work.
The problem requires you to achieve reg(D,R-1)<=Creg(D’,B) for each call to R-1.

• jb says:

Are you saying R-1 is called before R converges? In the above algorithm, after R converges, D’ has E[y’|(x,y)] = p(y|x), and I think each call to R-1 should be efficient and robust.

• x10000year says:

If R really converges, you can just output hits(x,y)/total(x) for R-1, resulting to zero regret. That doesn’t make sense at all. Note that, the samples are alway finite, but the convergence is just an asymptotic notation.

8. JB says:

I think algorithms which use a random number 1<=y0<=k share the same trait as the algorithm I described. Both require extra time to converge.

In a sense, any algorithm takes time to converge. The tree training algorithm described in John's paper converges as fast as possible. Every input to R updates affects all p(y|x) estimated by R-1.

The 1<=y0<=k algorithms converge k times slower than the tree training algorithm. This is because, for every example provided to R, only one p(y|x) will be adjusted instead of all p(y|x) estimated by R-1.

It is easy to create a deterministic version of the algorithms which use a random number 1<=y0<=k. The rate of convergence is the same as the tree training algorithm, but the complexity increases to k.

In other words, we can reduce complexity of R by slowing down the convergence rate.

I suspect the intention is for the solution algorithm to converge as quickly as possible (such as with the tree training algorithm). Maybe John can clarify.

• x10000year says:

I think you misunderstand the problem. It has nothing to do with the convergence.

The training time complexity (computational complexity of R) is not important, since anyway we can reduce the number of oracle examples by setting a small sample rate. This sampling strategy doesn’t affect the upper bound of regret ratio. The bottlenect is in R-1 — the number of times B-1 is called in each R-1.

In addition, you cannot expect that the regret ratio bound would decrease or converge when you have accumulated sufficient training data. In particular, x might be continuous value, in which case each R(x,y) may introduce a new x that occurs the first time and never repeats itself. Thus, to guarantee robustness, the prediction from R-1 can only rely on B-1.

9. Robert says:

Whenever I hear about solving something “for all distributions” I think of the no free lunch theorem by Wolpert and Macready and tend to dismiss it. I immediately assume that for any R and R^-1 where the requirement holds for some distributions, it must be possible to construct a distribution where it does not.
But this is based on very coarse grained intuition and I’m sure you’r just as familiar with this obstacle.

Understanding how the NFL theorem does NOT lead to a non-existence proof in this case could perhaps help rule out a wide class of possible approaches.

Sorry, the comment form is closed at this time.