Multitask learning is the learning to predict multiple outputs given the same input. Mathematically, we might think of this as trying to learn a function f:X -> {0,1}n. Structured learning is similar at this level of abstraction. Many people have worked on solving multitask learning (for example Rich Caruana) using methods which share an internal representation. On other words, the the computation and learning of the ith prediction is shared with the computation and learning of the jth prediction. Another way to ask this question is: can we avoid sharing the internal representation?
For example, it might be feasible to solve multitask learning by some process feeding the ith prediction f(x)i into the jth predictor f(x,f(x)i)j,
If the answer is “no”, then it implies we can not take binary classification as a basic primitive in the process of solving prediction problems. If the answer is “yes”, then we can reuse binary classification algorithms to solve multitask learning problems.
Finding a satisfying answer to this question at a theoretical level appears tricky. If you consider the infinite data limit with IID samples for any finite X, the answer is “yes” because any function can be learned. However, this does not take into account two important effects:
- Using a shared representation alters the bias of the learning process. What this implies is that fewer examples may be required to learn all of the predictors. Of course, changing the input features also alters the bias of the learning process. Comparing these altered biases well enough to distinguish their power seems very tricky. For reference, Jonathon Baxter has done some related analysis (which still doesn’t answer the question).
- Using a shared representation may be computationally cheaper.
One thing which can be said about multitask learning (in either black-box or shared representation form), is that it can make learning radically easier. For example, predicting the first bit output by a cryptographic circuit is (by design) extraordinarily hard. However, predicting the bits of every gate in the circuit (including the first bit output) is easily done based upon a few examples.
Is this true? If we take our “internal representation” to also be in the form of binary prediction (ala boosting, stage-wise additive logistic regression, etc…) it might still be reasonable to talk about the problem in terms of binary classification as the base element. Perhaps it will just limit what can consider internal representation.
Hi John ……. Just came across your blog. This is good stuff. Have been looking for something like this for a while. Have subscribed to the RSS feed. We’ve met a while back when you had visited Tony Jebara’s lab at Columbia. I am doing my PhD in Computational Biology and Machine learning there.
A “no” answer certainly does not forbid using classifiers internally.
“can not” may be too strong, but a “no” answer seems to suggest that we should think about designing multitask algorithms directly rather than indirectly. And once we have done that, perhaps we should reuse it as much as possible.
Aha! The multitask case seems strikingly familiar to a previous discussion about cross-time-dependent features in sequence labeling with Hamming loss (see the last 3-4 comments…is there any way to link to them directly?). Specifically, consider sequence labeling with Hamming loss as individual prediction problems, which is the argument you made in that other post. But it is very easy to consider this as a multitask learning problem: these tasks are very related. Then apply your above argument for multitask learning, and you have an argument for why (for instance) Markov features might be useful, even if the loss function doesn’t care about them.