John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003.

There are some things I find odd about the paper. For instance, it says of probabilistic models

“cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.”

I’m aware of no such limitations. Also:

“Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.”

This is quite interesting and contradicts my own experience as well as that of a number of people I greatly

respect. I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with?

The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. Further, the authors carry out some nice experiments that speak well for the practicality of their approach. In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. And I think this gets us to the crux of the matter, and ties back to John’s post. CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable.

The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that.)

The CRFs labeling algorithm computes the MAP label sequence. If you’re interested in getting the best label for each element in the sequence, then you should instead compute the maximum marginals. In other words, for each element in the sequence, compute the most likely label (marginalizing over the rest of the labels in the sequence). This should improve performance on a per-variable cost function.

Theoretically, computing marginals should give you all the benefits of structuring the output space and none of the disadvantages (aside from practical ones).

The M3N cost function optimizes a per-variable score, so it makes sense that it performs well on this score. The labeling is defined as a maximization w.r.t. the entire label sequence, so it makes sense that a Viterbi-like method would work for this. The interesting observation is that defining an objective function akin to max-marginals might perform better.

Just to be a little pedantic: there are two kinds of BP, max-product and sum-product. Viterbi is a form of max-product, and Forward-Backward is a form of sum-product. It seems that the difference between these two approaches (sequence likelihood vs. marginal likelihood-per-variable) is central to this discussion.

The comments about “high-dimensional feature spaces” and “strong theoretical guarantees” in the M3N paper seem to saying that CRFs don’t have SVM-like properties. “High-dimensional feature spaces” seems to refer to the Hilbert spaces implicit in the kernel trick, although I don’t see why using Hilbert spaces is desirable

a priori. (Kernel functions in Gaussian Processes achieve similar behavior to the kernel trick, at least in some cases). “Strong theoretical guarantees” appears to refer to existence of generalization bounds, which are interesting, but more interesting to frequentists than Bayesians.Hmmm. Perhaps I was unclear: I am using BP as synomymous with sum-prod and viterbi as synomous with max-prod. Other people mean different things for BP.

Exactly what I was suggesting. Maximizing the marginal likelihoods is the train time equivalent of this and is exactly what the Roweis,Kakade, and Teh paper recommends.

To me CRFs define a training procedure, not an inference procedure. You can run Viterbi or BP at inference time depending on what you want as an answer. The paper mentioned describes a different training procedure that better matches (?) the BP for test time procedure.

It’s possible we’re on the same page: to me Hamming loss (or variants) for training and viterbi for inference seem to be rather different. I agree that a different

inference procedure might be right, but to my mind M3N is optimizing the right thing given the over-all objective. There is something slightly funny going on here…

Right– there is nothing that keeps any of the probabilistic models from being kernelized, from logitistic regression to CRFs, which is why I find the comment odd. As for the bounds, I don’t see any reason that a max-margin method should have tighter bounds than the probabilistic models: they are rather close. I defer to John on such discussion, but I presume it’s really attributable to a cultural difference not anything fundemental.

Aha, sorry for missing the subtext. For what it’s worth, I’ve heard a number of people in the computer vision community use BP as synonymous with max-product!

I just looked at the Teh et al. paper. They are doing something that looks funny to me, namely, learning by optimizing the max marginals (C_1) rather than the likelihood of the sequence (C_0). This doesn’t seem consistent from a purely probabilistic view. In learning, our goal should be to get the best (most likely) estimate of the model parameters given the data, which entails maximizing C_0, the likelihood of the training data (assuming uniform priors). Then, max-marginals (C_1) can be computed on new data, in order to get the best estimate for each variable.

Of course, if one’s notion of “best” is determined by test-set performance, then one might not want the MAP model parameters after all, one might want to optimize a cost function like C_1. Still, I wonder if the max-marginal cost function (C_1) skews the learning procedure at all towards a less-likely set of model parameters. This cost function apparently ignores information about adjacent labels in the training sequence at each variable and may thus choose a weaker model than is justified. In other words, there is no uncertainty in the training labels, so C_0 seems more appropriate for learning. (If there is uncertainty in the labels, this should be modeled separately).

I suppose that you could argue that maximizing C_1 is an approximation to optimizing the expected cost on new data, provided that your training set is a fair sample in some sense. The paper argues something related to this.

Your right: different communities use BP differently. Sum-product is less ambigious, but certainly less sexy as well. Sigh…

The optimization your calling C_1 appears to me to be the consistent thing to do if you are going to try to infer marginals at test time. (I.e. you ought to maximize the marginal probabilities not the joint if you care about the marginal probabilities and not the joint.) I agree that I see no probabilistic argument that backs that up.

Interestingly, I believe M3N does the opposite of what the standard probabilistic algorithm does: it maximizes the “marginals” (via a per node loss function) at training, but does inference by the sequence loss. That’s a pretty stark contrast and I’d much appreciate some clear intuition on it.

For the record, I disagree with this and think many other people do as well. In many situations, the goal really is good prediction according to some other loss function. For those situations, telling your boss “I can’t predict very well but I got the most likely model parameters” is not satisfying to either of you. Achieving that situation should not be your goal.

Yes, you’re right, I should have qualified that with “from a probabilistic point of view” or something similar. FWIW, I did refer later to that goal, namely, optimizing with respect to expected loss (cost).

I suppose what I am really asking is: even if your goal is to optimize C_1 at test time, could learning w.r.t. C_0 still give better performance? For one thing, optimizing C_1 on the training set isn’t what you really want; what you want is C_1 on the (as yet unseen) test set. Optimizing C_0 gives the parameters that best explain the data, and I conjecture that this set of parameters will generalize better than ones that optimize C_1. Perhaps the results depend on the data set and whether the training set is a fair sample in some sense. It would be intriguing to see how the two approaches compare. It is possible that optimizing C_0 will be more prone to overfitting, although the intuition for this is fuzzy and may also depend on the quantity of training data.

Aaron and others,

As far I understood what you mean by C_0 and C_1, this distinction is sometimes called

generativevs.discriminativelearning. The difference between them has been considered in various contexts in papers such as Ng and Jordan (NIPS-2001). You can find more references in an upcoming paper of ours in Machine Learning.The general message is that generative tends to be better for small sample sizes but discriminative is asymptotically (for large sample sizes) never worse and often better. The generative approach has problems when the model is misspecified.

I believe this discussion is quite orthogonal to generative vs. discriminative. All cost functions under consideration here including what Aaron is calling C_0, C_1, the regular margin and the per-variable margin are discriminative. The question seems to revolve around per-sequence vs. per-variable and whether testing and training should match.

Drew,

I realized that the main issue in the discussion involved more refined distinctions than generative vs. discriminative. Anyway, I would argue that comments 5, 7, and 8 are related to it (correct me if I’m misinterpreting your comments). In particular, I would argue that C_0 (optimizing ”the likelihood of the sequence”) is generative. It agrees with discriminative only if the loss function is log-loss.

I don’t understand your use of “generative” then. Generative to me implies a proper generative probabilistic model: here all models are

conditional.

Oops, I made a mistake, conditional likelihood maximization

isdiscriminative (at least ‘more discriminative’ than joint likelihood maximization).Maybe I still see everything through my ‘generative-discriminative’ glasses. (I’ve got my hammer, where are the nails!?) Going back to Aaron’s comment 8, I believe C_0 is in fact

lessprone to overfitting than C_1. To me, C_0 has a generative ‘flavor’ in that, as Aaron says, “from a probabilistic point of view” we should estimate the model and use it for prediction, which basically amounts to C_0. (As you mentioned Radford Neal, I noticed that he uses conditional likelihood in a Bayesian way in the Hard Linear Classifier example in his NIPS-2004 tutorial. I daresay, most Bayesians are very generative-minded creatures.) However, in general, minimizing training set loss, which more or less amounts to the same as maximizing the per-symbol margin, is asymptotically never worse and sometimes better than maximizing conditional likelihood, although it may over-fit for small sample sizes. Per-sequence margin and C_1 should be somewhere inbetween.Slightly off topic, this opens up a question I’ve been curious about for a while: when is the right time to insert the loss function. The Bayesian in me says to do it during prediction, in a decision theoretic perspective. That is, estimate your posteriors and then compute the minimum Bayes risk solution, dependent on the loss/utility function we care about (importantly, one doesn’t have to be Bayesian to take this approach).

On the other hand, it is also very reasonable to insert the loss function into the training criterea and optimize weights so as to minimize the training loss (or some approximation thereof) plus a regularization term.

Which is right? Which is better? How can we combine them?

Regarding Hal’s question: you can fertilize your likelihood function with the loss function. Kenji Yamanishi and Jorma Rissanen have written on this topic, but in the context of MDL. One can take their expressions almost verbatim and make them into likelihood functions. For a subjective Bayesian the likelihood function is one of the subjective but explicit assumptions anyway.

Aleks — Can you provide a reference for the Yamanishi and Rissanen paper (or are you citing multiple papers by each individually)? I can’t find a jointly authored paper anywhere.

Hal – here are three possible starting points:

Kenji Yamanishi: A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning. IEEE Transactions on Information Theory 44(4): 1424-1439 (1998)

Jorma Rissanen. Complexity of Nonlogarithmic Loss Functions, IEEE Trans. Information Theory, vol. 49(2), pp. 476-484

Peter Grunwald’s thesis (see the stuff on entropification).

When it comes to Bayesian inference, H. Rubin’s has a paper that ends with “the non-separability of utility from prior”, and he stated “likelihood function must come from the investigator, not the statistician, as well as the loss-prior combination”. So, the choice of the prior depends on your loss function. Secondly, one can conjecture that the likelihood function is very similar to a loss-of-parameters-on-the-data, and use the above stochastic complexity-like expressions to replace the likelihood function.

Can anyone explain to me why CRFs seem to be predominantly used for tasks with discrete input spaces and trivially determinable segmentation (e.g. natural language processing)? More specifically, is there a reason they don’t perform well on large scale sequence labeling tasks with ambiguous segmentation and real-valued inputs (such as continuous speech recognition or cursive handwriting recognition)? Or maybe they do and I can’t find the papers…

The M3N paper mentions difficulties with high dimensional input spaces (although the posts here dispute that). Elsewhere I read that CRFs require explicit feature representations. Are either of these relevant here?

This paper seems to address both continuous features and ambiguous segmentation with CRF

http://www.cis.hut.fi/inips2005/inips2005-do.pdf

Standard CRF is a form of logistic regression, so there’s nothing limiting it to discrete input space. Output space on other hand is more limited because you need to be able to sum over all possible outputs efficiently to compute the gradient. Some people however are able to be pretty flexible in the choice of output space by using approximations to the gradient (ie, Chapter 5 of http://www.cs.toronto.edu/~liam/pubs/thesis.pdf)

Well, one reason there have been so many NLP CRF applications is that the first people who worked on CRFs were in NLP. Since then, there have been applications to many other areas (all discrete outputs, some continuous inputs. Several people have applied CRFs to computer vision. Additionally, the standard technique for automatic speech recognition (training an HMM by maximum mutual information) can be viewed as training a “directed” CRF. Finally, several people at Microsoft (Paul Viola, Martin Szummer, Tom Minka, and others) have applied Bayesian CRFs to recognizing digital ink.

Finally, it is possible to kernelize CRFs (just as it is logistic regression). See this paper: http://www.cs.cmu.edu/~lafferty/pub/kcrf.ps

Excuse me, I am confused with the definition of hamming loss, for example, for 1st order markov chain model , what is the hamming loss between 2 chains, the number of different edges or the number of different nodes? If the former one is true, how about the hamming loss between 2 single nodes? Thanks!

For example,

hamming loss of “0-0-0” and “1-1-1” = 2 or 3?

hamming loss of “0” and “1” = 0 or 1?

I don’t fully get the question, but for the specific examples:

3

1