Binomial Weighting

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than minc f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.

Problem: Reductions and Relative Ranking Metrics

This, again, is something of a research direction rather than a single problem.

There are several metrics people care about which depend upon the relative ranking of examples and there are sometimes good reasons to care about such metrics. Examples include AROC, “F1”, the proportion of the time that the top ranked element is in some class, the proportion of the top 10 examples in some class (google‘s problem), the lowest ranked example of some class, and the “sort distance” from a predicted ranking to a correct ranking. See here for an example of some of these.

Problem What does the ability to classify well imply about performance under these metrics?

Past Work

  1. Probabilistic classification under squared error can be solved with a classifier. A counterexample shows this does not imply a good AROC.
  2. Sample complexity bounds for AROC (and here).
  3. A paper on “Learning to Order Things“.

Difficulty Several of these may be easy. Some of them may be hard.

Impact Positive or negative results will broaden our understanding of the relationship between different learning goals. It might also yield new algorithms (via the reduction) for solving these problems.

Problem: Online Learning

Despite my best intentions, this is not a fully specified problem, but rather a research direction.

Competitive online learning is one of the more compelling pieces of learning theory because typical statements of the form “this algorithm will perform almost as well as a large set of other algorithms” rely only on fully-observable quantities, and are therefore applicable in many situations. Examples include Winnow, Weighted Majority, and Binomial Weighting. Algorithms with this property haven’t taken over the world yet. Here might be some reasons:

  1. Lack of caring. Many people working on learning theory don’t care about particular applications much. This means constants in the algorithm are not optimized, usable code is often not produced, and empirical studies aren’t done.
  2. Inefficiency. Viewed from the perspective of other learning algorithms, online learning is terribly inefficient. It requires that every hypothesis (called an expert in the online learning setting) be enumerated and tested on every example. (This is similar to the inefficiency of using Bayes law as an algorithm directly, and there are strong similarities in the algorithms.)

For an example of (1), there is a mysterious factor of 2 in the Binomial Weighting algorithm which has not been fully resolved. Some succesful applications also exist such as those based on SNoW.

The way to combat problem (2) is to introduce more structure into the hypothesis/experts. Some efforts have already been made in this direction. For example, it’s generally feasible to introduce an arbitrary bias or “prior” over the experts in the form of some probability distribution, and perform well with respect to that bias. Another nice piece of work by Adam Kalai and Santosh Vempala discusses how to efficiently handle several forms of structured experts. At an intuitive level, further development discovering how to efficiently work with new forms of structure might payoff well.

The basic research direction here is to address the gap between theory and practice, possibly by solving the above problems and possibly by discovering and addressing other problems.

Problem: Reinforcement Learning with Classification

At an intuitive level, the question here is “Can reinforcement learning be solved with classification?”

Problem Construct a reinforcement learning algorithm with near-optimal expected sum of rewards in the direct experience model given access to a classifier learning algorithm which has a small error rate or regret on all posed classification problems. The definition of “posed” here is slightly murky. I consider a problem “posed” if there is an algorithm for constructing labeled classification examples.

Past Work

  1. There exists a reduction of reinforcement learning to classification given a generative model. A generative model is an inherently stronger assumption than the direct experience model.
  2. Other work on learning reductions may be important.
  3. Several algorithms for solving reinforcement learning in the direct experience model exist. Most, such as E3, Factored-E3, and metric-E3 and Rmax require that the observation be the state. Recent work extends this approach to POMDPs.
  4. This problem is related to predictive state representations, because we are trying to solve reinforcement learning with prediction ability.

Difficulty It is not clear whether this problem is solvable or not. A proof that it is not solvable would be extremely interesting, and even partial success one way or another could be important.

Impact At the theoretical level, it would be very nice to know if the ability to generalize implies the ability to solve reinforcement learning because (in a general sense) all problems can be cast as reinforcement learning. Even if the solution is exponential in the horizon time it can only motivate relaxations of the core algorithm like RLgen.

Problem: Cross Validation

The essential problem here is the large gap between experimental observation and theoretical understanding.

Method K-fold cross validation is a commonly used technique which takes a set of m examples and partitions them into K sets (“folds”) of size m/K. For each fold, a classifier is trained on the other folds and then test on the fold.

Problem Assume only independent samples. Derive a classifier from the K classifiers with a small bound on the true error rate.

Past Work (I’ll add more as I remember/learn.)

  1. Devroye, Rogers, and Wagner analyzed cross validation and found algorithm specific bounds. Not all of this is online, but here is one paper.
  2. Michael Kearns and Dana Ron analyzed cross validation and found that under additional stability assumptions the bound for the classifier which learns on all the data is not much worse than for a test set of size m/K.
  3. Avrim Blum, Adam Kalai, and myself analyzed cross validation and found that you can do at least as well as a test set of size m/K with no additional assumptions using the randomized classifier which draws uniformly from the set of size K.
  4. Yoshua Bengio and Yves Grandvalet analyzed cross validation and concluded that there was no unbiased estimator of variance.
  5. Matti Kääriäinen noted that you can safely derandomize a stochastic classifier (such as one that randomizes over the K folds) using unlabeled data without additional assumptions.

Some Extreme Cases to Sharpen Intuition

  1. Suppose on every fold the learned classifier is the same. Then, the cross-validation error should behave something like a test set of size m. This is radically superior to a test set of size m/K.
  2. Consider leave-one-out cross validation. Suppose we have a “learning” algorithm that uses the classification rule “always predict the parity of the labels on the training set”. Suppose the learning problem is defined by a distribution which picks y=1 with probability 0.5. Then, with probability 0.5, all leave-one-out errors will be 0 and otherwise 1 (like a single coin flip).

(some discussion is here)

Difficulty I consider this a hard problem. I’ve worked on it without success and it’s an obvious problem (due to the pervasive use of cross validation) that I suspect other people have considered. Analyzing the dependency structure of cross validation is quite difficult.

Impact On any individual problem, solving this might have only have a small impact due to slightly improved judgement of success. But, because cross validation is used extensively, the overall impact of a good solution might be very significant.