John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003.
There are some things I find odd about the paper. For instance, it says of probabilistic models
“cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.”
I’m aware of no such limitations. Also:
“Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.”
This is quite interesting and contradicts my own experience as well as that of a number of people I greatly
respect. I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with?
The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. Further, the authors carry out some nice experiments that speak well for the practicality of their approach. In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. And I think this gets us to the crux of the matter, and ties back to John’s post. CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable.
The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that.)