Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.
We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.
A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:
This form of modularity seems to be preferred for several reasons:
- Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
- There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
- There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.
The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.
So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?
I am not an expert in this domaine. But I am interested in this problem. However, like you before, I am not really convinced at the hamming loss of O(eT^2). Do you have some sort of prove or paper which describes how you arrived at it?
I should have explained that more thoroughly. The basic idea is that one factor of T comes because there are T more classification problems. The other factor of T comes because an error, once introduced, can propagate down the entire length.
Perhaps Matti can post a more thorough version of this intuition.
The intuition is roughly as John explained, but it breaks down if T is “large” as compared to e. This is because the propagating errors start interacting and cancelling out each other. Actually, it is sort of evident that a lower bound behaving like e*T^2 cannot hold in general as the Hamming loss is always bounded by T.
The exact lower bound for the expected Hamming loss is T/2-[(1-2e)-(1-2e)^(T+1)]/4e, which behaves roughly like 0.5*e*T^2 for “small” T. For “large” T, it is about T/2. The lower bound applies if the reduction is forced to respect the dependence structure depicted in John’s original post and predict y_i based on x_i and the prediction for y_{i-1} alone. The proof is by exhibiting a learning problem respecting the comb structure on which it is impossible to do any better.
With the admission that I’m not 100% positive I’m understanding John’s suggestion to optimize the individual components (though I think I have it right), I’m curious how such an approach would work for models with more complex “graphical structure.” Linear chains are nice and simple, but they pretty much represent the simplest possible structure above simple classification, and are really only applicable to a very small number of problems. What happens, for instance, if you have two chains (ala Sutton+McCallum ICML04) or long distance dependencies (Sutton+McCallum ICML Wrk on Rel Lrn 04) or even more complex models such as the nearly fully connected graphs you encounter in record linkage? In these cases (and going to even harder problems, which I would finally describe as “real” problems), the GM framework is nice primarily (IMO) because of John’s reason #1 (with the addendum that we have some sense of how to do inference in general models, even if not efficiently). How could you do this in the reductionist sense?
Just to throw something else into the mix: when you try to train a structured predictor (such as a directed or undirected chain), you can explicitly take into account the possibility of making errors on previous/future labels and somewhat avoid the extra factor of T that comes from error propagation. One very simple way to do this is to optimize an objective function closer to the Hamming distance John proposed, namely in a probabilistic model optimize \sum_i log p(yi|X) as opposed to the traditional log p(Y|X). This is exactly the difference between maximizing the probability of getting *all* the labels exactly right and maximizing the *expected number of correct labels* (implicitly summing over all possibile settings for the other labels you aren’t considering). Sham Kakade, YeeWhye Teh and I wrote a paper about this at ICML02 (Auz) which is available here: http://www.cs.toronto.edu/~roweis/papers/newcost_draft.ps.gz
[Incidentally, after that ICML John Sham and I toured around Australia for a bit and those two showed me that they are just as hardcore when climbing mountains as when deriving equations.]
The ICML02 paper can be understood as optimizing Hamming loss rather than optimizing to get the entire sequence right. That’s certainly useful in practice, but it doesn’t dodge the lower bound in general.
For Hal’s question about more complex structures, the (trivial) reduction solution outlined (learned based on _all_ the x to predict each yi) seems to be what the mathematics suggests. How reasonable this is in practice seems unclear. A very large amount of work has gone into optimizing the graphical models approach, so it would be nice to reconcile these viewpoints and win from both strengths.
I may be misunderstanding here, but it seems to me that inference in the HMM carefully combines both kinds of modularity. I’d love to see if/where I’m confused.. :
Offline inference in the HMM (assert all the x_is, find beliefs about the y_is), on the other hand, is very close to the first scenario: you end up with a marginal over the y_is, which depends on every x_i, and can derive a separate predictor for each y_i (based indirectly on every x_i) from this according to the loss function you like. However, the dependence on every x_i takes a particularly nice form (thanks to the independence assumptions in the HMM, exploited by e.g. sum-product) – for predicting y_i, all we need are x_i, and our complete beliefs about y_{i-1} and y_{i+1}. This is a weaker statement of modularity than “y_i can be predicted from x_i and a single prediction for y_{i-1}”, but the actual predictions should still be optimal.
Online inference, on the other hand – successively observe each x_i and make predictions about the corresponding y_i, comes closer to the prediction approach with the bad lower bound. It’s slightly different – one makes predictions based on a distribution over y_{i-1}, not a single best prediction, to retain correctness of the final answer. When one does approximate inference, however, because the exact distribution over y_{i-1} isn’t represented, one runs into precisely the error propagation problem you mentioned. I’m not a DBN expert, but I thought the purpose of e.g. the Boyen-Koller algorithm was to exploit specific conditions where this error propagation could be controlled. Its bound depends on the mixing rate of the hidden chain, corresponding to the idea that errors in prediction ‘wash out’ at a certain rate.
Is there something in the semantics of the modularity statements you were making that I’m missing?
There are two difficulties with the HMM approach.
1) It optimizes the wrong thing. Sam’s paper discusses a fix for this which I recommend reading.
2) It accumulates error. At each step in an HMM, there are probability tables governing emission and transition probabilities. These probability tables are wrong nearly always in practice (sometimes, they are “alot wrong” and other times “nearly right”). These wrong probabilities can accumulate quickly to give poor predictions via the HMM inference mechanism. Avoiding this is what half of this post is about. Restated, the HMM inference mechanism is not robust to either perturbations or assumption failure.
Ah, I see. I suppose if you worked in the “HMM with unknown parameters” model, did full inference, and chose predictions via min hamming error, you’d preserve the first kind of modularity you mentioned, at least in principle – predictions for label i would be based
on just its marginal. However, you’d lose on the second kind of modularity, since the hidden parameters would end up coupling everything. Thanks.
I have two examples that I think break the “optimize each output separately” paradigm, but I may be missing something.
(1) Consider a sequence labeling problem, where we can essentially divide the output alphabet into two “partitions,” but where there is a bijection between the two (i.e., alphabet_1 = {A_1, B_1, …, Z_1} and alphabet_2 = {A_2, B_2, …, Z_2}). Suppose also that any valid output sequence will contain tags from exactly one alphabet. Now, suppose that there is some input element that deterministically selects an element from alphabet_1 and some element that deterministically seelcts an element from alphabet_2, and each input example contains exactly one of these. Now, regardless of what your loss function is, if you optimize on a per-label basis, you are going to have to include features for every position in the input in order to learn to predict the output (otherwise you might miss the deterministic input element and get every other label wrong with 50% probability). This will lead to an enormous number of features, which can lead to bad generalization. On the other hand, if we optimize the full structured output, it should be easy to learn the deterministic element, and then using neighborhoods of size 1, we will always predict with the right alphabet. (I know I’ve done a poor job of explaining this, but hopefully it is possible to reconstruct what I mean from what I’ve said.)
(2) What if the output is self constraining? There are two simple examples of this. One is in the labeling+segmenting problem; for instance, syntactic chunking. Here, we have the constraint that, for instance, a label I-NP cannot follow B-VP or I-VP or anything but B-NP and I-NP. Another example of this is in the supervised clustering problem, where we are trying to learn to partition a set into clusters; in this case, we cannot say that elements x_1 and x_2 are in the same cluster, x_2 and x_3 are in the same cluster, but x_1 and x_3 are in different clusters. Sure, you can treat the outputs of such predictors as noisy and use error correction techniques to repair them, but I believe this opens up a bigger can of worms than optimizing against the entire prediction.
Hal’s (1) is pointing out that for some learning algorithms, doing “optimize each entry separately” is a bad idea because it leads to severe overfitting. This is correct, but not fundamentally so. In particular, there exist learning algorithms capable of coping with an arbitrary number of features that are robust against overfitting.
Hal’s point (2) seems to be that when the “real” loss isn’t defined by Hamming loss, “optimize each separately” isn’t optimal. That is correct.
I think point 1 is even more subtle than just overfitting. I believe that you must either (a) do structured learning or (b) [effectively] use a poly2 kernel. Let me try to be more concrete. Say our input alphabet is { A1, A2, B, C } and our output alphabet is { a1, a2, b1, b2, c1, c2}. Suppose that A1 always gets a1, A2 always gets a2 and B->{b1,b2} and C->{c1,c2}. Every input sequence always contains exactly one of A1 and A2 and any output sequence must contain either only x1s or only x2s.
If we use structured learning with standard “emission” features and 1st order Markov features (with these disregarding the input entirely), we can learn to predict with 100% accuracy.
However, if we predict one position at a time, if we get an input sequence like:
A1 B C B B B B C B C
For predicting that the final C must be “c1” instead of “c2”, we’d need to consider a cross-product of features, so that we have a feature on the last element that says something like “our input elem is C *and* there is an A1 in the sequence.” Of course, if we know that A1 is the deterministic element and build this in to our model, this is trivial and actually reduces the feature space (since we needn’t care about the Markov features). OTOH, if we don’t know which elements behave like A1 and A2, we need to consider a cross-product feature set (I beleive).
I’m not entirely understanding the import of Hal’s point (which appears correct). I don’t fear operating in the cross product feature space.