Conditional Tournaments for Multiclass to Binary

This problem has been cracked (but not quite completely solved) by Alina, Pradeep, and I. The problem is essentially finding a better way to reduce multiclass classification to binary classification. The solution is to use a carefully crafted tournament, the simplest version of which is a single elimination tournament where the “players” are the different classes. An example of the structure is here:


For the single elimination tournament, we can prove that:
For all multiclass problems D, for all learned binary classifiers c, the regret of an induced multiclass classifier is bounded by the regret of the binary classifier times log2 k. Restated:

regmulticlass(D,Filter_tree_test(c)) <= regbinary (Filter_tree_train(D),c)

Here:

  1. Filter_tree_train(D) is the induced binary classification problem
  2. Filter_tree_test(c) is the induced multiclass classifier.
  3. regmulticlass is the multiclass regret (= difference between error rate and minimum possible error rate)
  4. regbinary is the binary regret

This result has a slight dependence on k which we suspect is removable. The current conjecture is that this dependence can be removed by using higher order tournaments such as double elimination, triple elimination, up to log2 k-elimination.

The key insight which makes the result possible is conditionally defining the prediction problems at interior nodes. In essence, we use the learned classifiers from the first level of the tree to filter the distribution over examples reaching the second level of the tree. This process repeats, until the root node is reached. Further details, including a more precise description and some experimental results are in the draft paper.

Multitask learning is Black-Boxable

Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems. It seems the answer to this is “yes”, in a fairly straightforward manner.

The basic idea is that a controlled input feature is equivalent to an extra output. Suppose we have some process generating examples: (x,y1,y2) in S where y1 and y2 are labels for two different tasks. Then, we could reprocess the data to the form Sb(S) = {((x,i),yi): (x,y1,y2) in S, i in {1,2}} and then learn a classifier c:X x {1,2} -> Y. Note that (x,i) is the (composite) input. At testing time, given an input x, we can query c for the predicted values of y1 and y2 using (x,1) and (x,2).

A strong form of equivalence can be stated between these tasks. In particular, suppose we have a multitask learning algorithm ML which learns a multitask predictor m:X -> Y x Y. Then the following theorem can be proved:

For all ML for all S, there exists an inverse reduction Sm such that ML(S) = ML(Sm(Sb(S)).

In other words, no information is lost in the transformation Sb which means everything which was learnable previously remains learnable.

This may not be the final answer to the question because there may be some algorithm-dependent (mis)behavior associated with controlled feature i. It may also be the case that single task classification is computationally distinguishable from multitask classification. Certainly, computational concerns are one of the reasons specialized multitask classification algorithms exist.

Progress in Active Learning

Several bits of progress have been made since Sanjoy pointed out the significant lack of theoretical understanding of active learning. This is an update on the progress I know of. As a refresher, active learning as meant here is:

  1. There is a source of unlabeled data.
  2. There is an oracle from which labels can be requested for unlabeled data produced by the source.
  3. The goal is to perform well with minimal use of the oracle.

Here is what I’ve learned:

  1. Sanjoy has developed sufficient and semi-necessary conditions for active learning given the assumptions of IID data and “realizability” (that one of the classifiers is a correct classifier).
  2. Nina, Alina, and I developed an algorithm for active learning relying on only the assumption of IID data. A draft is here.
  3. Nicolo, Claudio, and Luca showed that it is possible to do active learning in an entirely adversarial setting for linear threshold classifiers here. This was published a year or two ago and I recently learned about it.

All of these results are relatively ‘rough’: they don’t necessarily make good algorithms as stated (although the last one has a few experiments). None of these results are directly comparable because the assumptions vary. Comparing the assumptions and the results leads to a number of remaining questions:

  1. Do the sufficient and seminecessary conditions apply to the IID only case? The adversarial case?
  2. Is there a generic algorithm for any hypothesis space that works in the fully adversarial setting?
  3. What are special cases of these algorithms which are computationally tractable and useful?

The Foundations of Active Learning workshop at NIPS should be a good place to discuss these questions.

Exact Online Learning for Classification

Jacob Abernethy and I have found a computationally tractable method for computing an optimal (or near optimal depending on setting) master algorithm combining expert predictions addressing this open problem. A draft is here.

The effect of this improvement seems to be about a factor of 2 decrease in the regret (= error rate minus best possible error rate) for the low error rate situation. (At large error rates, there may be no significant difference.)

There are some unfinished details still to consider:

  1. When we remove all of the approximation slack from online learning, is the result a satisfying learning algorithm, in practice? I consider online learning is one of the more compelling methods of analyzing and deriving algorithms, but that expectation must be either met or not by this algorithm
  2. Some extra details: The algorithm is optimal given a small amount of side information (k in the draft). What is the best way to remove this side information? The removal is necessary for a practical algorithm. One mechanism may be the k->infinity limit.

Solution: Reinforcement Learning with Classification

I realized that the tools needed to solve the problem just posted were just created. I tried to sketch out the solution here (also in .lyx and .tex). It is still quite sketchy (and probably only the few people who understand reductions well can follow).

One of the reasons why I started this weblog was to experiment with “research in the open”, and this is an opportunity to do so. Over the next few days, I’ll be filling in details and trying to get things to make sense. If you have additions or ideas, please propose them.