Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.
We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.
A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:
This form of modularity seems to be preferred for several reasons:
- Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
- There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
- There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.
The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.
So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?