Fernando Pereira pointed out Ando and Zhang‘s paper on “structural” learning. Structural learning is multitask learning on subproblems created from unlabeled data.

The basic idea is to take a look at the unlabeled data and create many supervised problems. On text data, which they test on, these subproblems might be of the form “Given surrounding words predict the middle word”. The hope here is that successfully predicting on these subproblems is relevant to the prediction of your core problem.

In the long run, the precise mechanism used (essentially, linear predictors with parameters tied by a common matrix) and the precise problems formed may not be critical. What seems critical is that the hope is realized: the technique provides a significant edge in practice.

Some basic questions about this approach are:

- Are there effective automated mechanisms for creating the subproblems?
- Is it necessary to use a shared representation?

John,

I attended Zhang’s talk. My main questions were: (a) Is generating sub-problems a matter of art in the algorithm ? For example, it is unclear to me how to apply the method to a digit recognition problem. In what sense is a certain set of sub-problems better than another, in practice. (b) The orthogonality constraint over theta makes the problem non-convex. The alternating procedure doesnt run into local minima issues ? (c) It is not immediately clear to me how to cleanly kernelize the method.(d) I wonder how the method performs in comparison to something like a transductive SVM on textual domains.

Vikas

The answers I know are:

(a) Human input is required in question formation right now.

(b) Apparently not.

Going back to your post, I believe that in order to realize the hope that a technique provides significant edge in practice one has to look at the precise mechanisms, optimization and scalability issues etc (since all these become major factors when it comes down to practice). Another question is about assumptions for semi-supervised learning. My main intuition is that semi-supervised learning algorithms (e.g transductive SVM, graph regularization methods) work well if the so-called cluster or manifold assumptions hold true. Can one interpret co-training/structural learning in those terms ? A particular semi-supervised method is likely to work better that others on a domain if the assumptions it makes are satisfied to a

greater degree on that domain.

I don’t know how to interpret co-training/structural learning in those terms.