Multitask learning is Black-Boxable

Multitask learning is the problem of jointly predicting multiple labels simultaneously with one system. A basic question is whether or not multitask learning can be decomposed into one (or more) single prediction problems. It seems the answer to this is “yes”, in a fairly straightforward manner.

The basic idea is that a controlled input feature is equivalent to an extra output. Suppose we have some process generating examples: (x,y₁,y₂) in S where y₁ and y₂ are labels for two different tasks. Then, we could reprocess the data to the form S_b(S) = {((x,i),y_i): (x,y₁,y₂) in S, i in {1,2}} and then learn a classifier c:X x {1,2} -> Y. Note that (x,i) is the (composite) input. At testing time, given an input x, we can query c for the predicted values of y₁ and y₂ using (x,1) and (x,2).

A strong form of equivalence can be stated between these tasks. In particular, suppose we have a multitask learning algorithm ML which learns a multitask predictor m:X -> Y x Y. Then the following theorem can be proved:

For all ML for all S, there exists an inverse reduction S_m such that ML(S) = ML(S_m(S_b(S)).

In other words, no information is lost in the transformation S_b which means everything which was learnable previously remains learnable.

This may not be the final answer to the question because there may be some algorithm-dependent (mis)behavior associated with controlled feature i. It may also be the case that single task classification is computationally distinguishable from multitask classification. Certainly, computational concerns are one of the reasons specialized multitask classification algorithms exist.

5 Replies to “Multitask learning is Black-Boxable”

andrej says:

3/18/2006 at 4:21 pm

I apologize if I am saying something silly (because I know next to nothing about learning), but as I see it this is trivial or wrong. If for each x at most one tuple (x,y1,y2) may appear, then it is trivial that you can express precisely the same information by (x,1,y1), (x,2,y2), whereas if there can be (x,y1,y2) as well as (x,y1′,y2′) then it is wrong, since you won’t know how to recombine (x,1,y1), (x2,y2) and (x,1,y1′) and (x,2,y2′). Is there more to it?
jl says:

3/20/2006 at 6:33 pm

I think the observation is trivial (once made), but it isn’t quite what you think. The thing to understand is that S is not really a set (ordering matters in machine learning in general), but rather a sequence. Consequently, we can handle examples with the same x but different y.
dmitry says:

3/22/2006 at 6:44 pm

If S is a sequence, is Sb a sequence too? In case it is, it seems that the inverse transformation Sm is easy.
Anonymous says:

3/23/2006 at 12:34 am

Also, I think the claim falls apart when you restrict the class of classifiers (which I suppose you always do). So (x,y₁) and (x,y₂) may both be perfectly predicted by hyperplane classifiers while no hyperplane classifier predicts ((x,i),y_i) within a reasonable error rate.
I think the reduction would only be interesting if for a reasonable class of classifier C or a learning algorithm A, it is the case that whenever (x,y₁) is C-classifiable (A-learnable) and (x,y₂) is C-classifiable (A-learnable), then ((x,i),y_i) is also C-Classifiable (A-learnable).
jl says:

3/23/2006 at 6:21 pm

What you point out is a generable bugaboo with the reductions approach: you could reduce to a problem which is somehow hard to learn for a particular learning problem. An extreme case of this would be encrypting the input feature x. I have not figured out how to eliminate this difficulty in general.

For your particular example, there is an input encoding which preserves information and has the property that linear separability is preserved. Essentially, you map (x,1) to (x,0) and (x,2) to (0,x) where 0 is a zero vector.

Comments are closed.