Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems. It seems the answer to this is “yes”, in a fairly straightforward manner.

The basic idea is that a controlled input feature is equivalent to an extra output. Suppose we have some process generating examples: *(x,y _{1},y_{2}) in S* where

*y*and

_{1}*y*are labels for two different tasks. Then, we could reprocess the data to the form

_{2}*S*and then learn a classifier

_{b}(S) = {((x,i),y_{i}): (x,y_{1},y_{2}) in S, i in {1,2}}*c:X x {1,2} -> Y*. Note that

*(x,i)*is the (composite) input. At testing time, given an input

*x*, we can query

*c*for the predicted values of y

_{1}and y

_{2}using

*(x,1)*and

*(x,2)*.

A strong form of equivalence can be stated between these tasks. In particular, suppose we have a multitask learning algorithm *ML* which learns a multitask predictor *m:X -> Y x Y*. Then the following theorem can be proved:

For all *ML* for all *S*, there exists an inverse reduction *S _{m}* such that

*ML(S) = ML(S*.

_{m}(S_{b}(S))In other words, no information is lost in the transformation *S _{b}* which means everything which was learnable previously remains learnable.

This may not be the final answer to the question because there may be some algorithm-dependent (mis)behavior associated with controlled feature *i*. It may also be the case that single task classification is computationally distinguishable from multitask classification. Certainly, computational concerns are one of the reasons specialized multitask classification algorithms exist.

I apologize if I am saying something silly (because I know next to nothing about learning), but as I see it this is trivial or wrong. If for each x at most one tuple (x,y1,y2) may appear, then it is trivial that you can express precisely the same information by (x,1,y1), (x,2,y2), whereas if there can be (x,y1,y2) as well as (x,y1′,y2′) then it is wrong, since you won’t know how to recombine (x,1,y1), (x2,y2) and (x,1,y1′) and (x,2,y2′). Is there more to it?

I think the observation is trivial (once made), but it isn’t quite what you think. The thing to understand is that

Sis not really a set (ordering matters in machine learning in general), but rather a sequence. Consequently, we can handle examples with the samexbut differenty.If S is a sequence, is Sb a sequence too? In case it is, it seems that the inverse transformation Sm is easy.

Also, I think the claim falls apart when you restrict the class of classifiers (which I suppose you always do). So (x,y

_{1}) and (x,y_{2}) may both be perfectly predicted by hyperplane classifiers while no hyperplane classifier predicts ((x,i),y_{i}) within a reasonable error rate.I think the reduction would only be interesting if for a reasonable class of classifier C or a learning algorithm A, it is the case that whenever (x,y

_{1}) is C-classifiable (A-learnable) and (x,y_{2}) is C-classifiable (A-learnable), then ((x,i),y_{i}) is also C-Classifiable (A-learnable).What you point out is a generable bugaboo with the reductions approach: you could reduce to a problem which is somehow hard to learn for a particular learning problem. An extreme case of this would be encrypting the input feature

x. I have not figured out how to eliminate this difficulty in general.For your particular example, there is an input encoding which preserves information and has the property that linear separability is preserved. Essentially, you map

(x,1)to(x,and0)(x,2)to(where0,x)0is a zero vector.