What is the best regret transform reduction from multiclass to binary?

This post is about an open problem in learning reductions.

Background A reduction might transform a a multiclass prediction problem where there are k possible labels into a binary learning problem where there are only 2 possible labels. On this induced binary problem we might learn a binary classifier with some error rate e. After subtracting the minimum possible (Bayes) error rate b, we get a regret r = e – b. The PECOC(Probabilistic Error Correcting Output Code) reduction has the property that binary regret r implies multiclass regret at most 4r0.5.

The problem This is not the “rightest” answer. Consider the k=2 case, where we reduce binary to binary. There exists a reduction (the identity) with the property that regret r implies regret r. This is substantially superior to the transform given by the PECOC reduction, which suggests that a better reduction may exist for general k. For example, we can not rule out the possibility that a reduction R exists with regret transform guaranteeing binary regret r implies at most multiclass regret c(k) r where c(k) is a k dependent constant between 1 and 4.

Difficulty I believe this is a solvable problem, given some serious thought.

Impact The use of some reduction from multiclass to binary is common practice, so a good solution could be widely useful. One thing to be aware of is that there is a common and reasonable concern about the ‘naturalness’ of induced problems. There seems to be no way to address this concern other than via empirical testing. On the theoretical side, a better reduction may help us understand whether classification or l2 regression is the more natural primitive for reduction. The PECOC reduction essentially first turns a binary classifier into an l2 regressor and then uses the regressor repeatedly to make multiclass predictions.

Some background material which may help:

  1. Dietterich and Bakiri introduce Error Correcting Output Codes.
  2. Guruswami and Sahai analyze ECOC as an error transform reduction. (see lemma 2)
  3. Allwein, Schapire, and Singer generalize ECOC to use loss-based decoding.
  4. Beygelzimer and Langford showed that ECOC is not a regret transform and proved the PECOC regret transform.

Yahoo’s Learning Problems.

I just visited Yahoo Research which has several fundamental learning problems near to (or beyond) the set of problems we know how to solve well. Here are 3 of them.

  1. Ranking This is the canonical problem of all search engines. It is made extra difficult for several reasons.
    1. There is relatively little “good” supervised learning data and a great deal of data with some signal (such as click through rates).
    2. The learning must occur in a partially adversarial environment. Many people very actively attempt to place themselves at the top of
      rankings.
    3. It is not even quite clear whether the problem should be posed as ‘ranking’ or as ‘regression’ which is then used to produce a
      ranking.
  2. Collaborative filtering Yahoo has a large number of recommendation systems for music, movies, etc… In these sorts of systems, users specify how they liked a set of things, and then the system can (hopefully) find some more examples of things they might like
    by reasoning across multiple such sets.
  3. Exploration with Generalization The cash cow of
    search engines is displaying advertisements which are relevant to search along with search results. Better targeting these advertisements makes money (a small improvement might be worth $millions) and improves the value of the search engine for the user.

    It is natural to predict the set of advertisements which maximize the advertising payoff. This natural idea is stymied by both the extreme
    multiplicity of advertisements under contract (think millions) and a lack of ability to measure hypotheticals like “What would have
    happened if we had displayed a different set of advertisements for this (query,user) pair instead?” This is a combined exploration and
    generalization problem.

Good solutions to any of these problems would be extremely useful (and not just at Yahoo). Even further small improvements on the existing solutions may be very useful.

For those interested, Yahoo (as an organization) knows these are learning problems and is very actively interested in solving them. Yahoo Research is committed to a relatively open method of solving these problems. Dennis DeCoste is one contact point for machine learning research at Yahoo Research.

Is Multitask Learning Black-Boxable?

Multitask learning is the learning to predict multiple outputs given the same input. Mathematically, we might think of this as trying to learn a function f:X -> {0,1}n. Structured learning is similar at this level of abstraction. Many people have worked on solving multitask learning (for example Rich Caruana) using methods which share an internal representation. On other words, the the computation and learning of the ith prediction is shared with the computation and learning of the jth prediction. Another way to ask this question is: can we avoid sharing the internal representation?

For example, it might be feasible to solve multitask learning by some process feeding the ith prediction f(x)i into the jth predictor f(x,f(x)i)j,

If the answer is “no”, then it implies we can not take binary classification as a basic primitive in the process of solving prediction problems. If the answer is “yes”, then we can reuse binary classification algorithms to solve multitask learning problems.

Finding a satisfying answer to this question at a theoretical level appears tricky. If you consider the infinite data limit with IID samples for any finite X, the answer is “yes” because any function can be learned. However, this does not take into account two important effects:

  1. Using a shared representation alters the bias of the learning process. What this implies is that fewer examples may be required to learn all of the predictors. Of course, changing the input features also alters the bias of the learning process. Comparing these altered biases well enough to distinguish their power seems very tricky. For reference, Jonathon Baxter has done some related analysis (which still doesn’t answer the question).
  2. Using a shared representation may be computationally cheaper.

One thing which can be said about multitask learning (in either black-box or shared representation form), is that it can make learning radically easier. For example, predicting the first bit output by a cryptographic circuit is (by design) extraordinarily hard. However, predicting the bits of every gate in the circuit (including the first bit output) is easily done based upon a few examples.

Text Entailment at AAAI

Rajat Raina presented a paper on the technique they used for the PASCAL Recognizing Textual Entailment challenge.

“Text entailment” is the problem of deciding if one sentence implies another. For example the previous sentence entails:

  1. Text entailment is a decision problem.
  2. One sentence can imply another.

The challenge was of the form: given an original sentence and another sentence predict whether there was an entailment. All current techniques for predicting correctness of an entailment are at the “flail” stage—accuracies of around 58% where humans could achieve near 100% accuracy, so there is much room to improve. Apparently, there may be another PASCAL challenge on this problem in the near future.