Maximum Margin Mismatch?

John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003.

There are some things I find odd about the paper. For instance, it says of probabilistic models

“cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.”

I’m aware of no such limitations. Also:

“Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.”

This is quite interesting and contradicts my own experience as well as that of a number of people I greatly
respect. I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with?

The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. Further, the authors carry out some nice experiments that speak well for the practicality of their approach. In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. And I think this gets us to the crux of the matter, and ties back to John’s post. CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable.

The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that.)

Bad ideas

I found these two essays on bad ideas interesting. Neither of these is written from the viewpoint of research, but they are both highly relevant.

  1. Why smart people have bad ideas by Paul Graham
  2. Why smart people defend bad ideas by Scott Berkun (which appeared on slashdot)

In my experience, bad ideas are common and over confidence in ideas is common. This overconfidence can take either the form of excessive condemnation or excessive praise. Some of this is necessary to the process of research. For example, some overconfidence in the value of your own research is expected and probably necessary to motivate your own investigation. Since research is a rather risky business, much of it does not pan out. Learning to accept when something does not pan out is a critical skill which is sometimes never acquired.

Excessive condemnation can be a real ill when it’s encountered. This has two effects:

  1. When the penalty for being wrong is too large, it means people have a great investment in defending “their” idea. Since research is risky, “their” idea is often wrong (or at least in need of amendment).
  2. A large penalty implies people are hesitant to introduce new ideas.

Both of these effects slow the progress of research. How much, exactly, is unclear and very difficult to imagine measuring.

While it may be difficult to affect the larger community of research, you can and should take these considerations into account when choosing coauthors, advisors, and other people you work with. The ability to say “oops, I was wrong”, have that be accepted without significant penalty, and move on is very valuable for the process of thinking.

Running A Machine Learning Summer School

We just finished the Chicago 2005 Machine Learning Summer School. The school was 2 weeks long with about 130 (or 140 counting the speakers) participants. For perspective, this is perhaps the largest graduate level machine learning class I am aware of anywhere and anytime (previous MLSSs have been close). Overall, it seemed to go well, although the students are the real authority on this. For those who missed it, DVDs will be available from our Slovenian friends. Email Mrs Spela Sitar of the Jozsef Stefan Institute for details.

The following are some notes for future planning and those interested.
Good Decisions

  1. Acquiring the larger-than-necessary “Assembly Hall” at International House. Our attendance came in well above our expectations, so this was a critical early decision that made a huge difference.
  2. The invited speakers were key. They made a huge difference in the quality of the content.
  3. Delegating early and often was important. One key difficulty here is gauging how much a volunteer can (or should) do. Many people are willing to help a little, so breaking things down into small chunks is important.

Unclear Decisions

  1. Timing (May 16-27, 2005): We wanted to take advantage of the special emphasis on learning quarter here. We also wanted to run the summer school in the summer. These goals did not have a good solution. By starting as late as possible in the quarter, we were in the “summer” for universities on a semester schedule but not those on a quarter schedule. Thus, we traded some students and scheduling conflicts at University of chicago for the advantages of the learning quarter.
  2. Location (Hyde Park, Chicago):
    Advantages:

    1. Easy to fly to.
    2. Easy to get funding. (TTI and Uchicago were both significant contributors.)
    3. Easy (on-site) organization.

    Disadvantages:

    1. US visas were too slow or rejected 7+ students.
    2. Location in Chicago implied many locals drifted in and out.
    3. The Hyde Park area lacks real hotels, creating housing difficulties.
  3. Workshop colocation: We colocated with two workshops. The advantage of this is more content. The disadvantage was that it forced talks to start relatively early. This meant that attendance at the start of the first lecture was relatively low (60-or-so), ramping up through the morning. Although some students benefitted from the workshop talks, most appeared to gain much more from the summer school.

Things to do Differently Next Time

  1. Delegate harder and better. Doing various things rather than delegating means you feel like you are “doing your part”, but it also means that you are distracted and do not see other things which need to be done….and they simply don’t get done unless you see it.
  2. Have a ‘sorting session’. With 100+ people in the room, it is difficult to meet people of similar interests. This should be explicitly aided. One good suggestion is “have a poster session for any attendees”. Sorting based on other dimensions might also be helpful. The wiki helped here for social events.
  3. Torture the speakers more. Presenting an excess of content in a minimum of time to an audience of diverse backgrounds is extremely difficult. This difficulty can not be avoided, but it can be ameliorated. Having presentation slides and suggested reading well in advance helps. The bad news here is that it is very difficult to get speakers to make materials available in advance. They naturally want to tweak slides at the last minute and include the newest cool discoveries.
  4. Schedules posted at the entrance.

The Future There will almost certainly be future machine learning summer schools in the series and otherwise. My impression is that the support due to being “in series” is not critical to success, but it is considerable. For those interested, running one “in series” starts with a proposal consisting of {organizers,time/location,proposed speakers,budget} sent to Alex Smola and Bernhard Schoelkopf. I am sure they are busy, so conciseness is essential.

What is the right form of modularity in structured prediction?

Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.

We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.

A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:

This form of modularity seems to be preferred for several reasons:

  1. Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
  2. There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
  3. There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.

The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.

So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?

A Short Guide to PhD Graduate Study

Graduate study is a mysterious and uncertain process. This easiest way to see this is by noting that a very old advisor/student mechanism is preferred. There is no known succesful mechanism for “mass producing” PhDs as is done (in some sense) for undergraduate and masters study. Here are a few hints that might be useful to prospective or current students based on my own experience.

  1. Masters or PhD (a) You want a PhD if you want to do research. (b) You want a masters if you want to make money. People wanting (b) will be manifestly unhappy with (a) because it typically means years of low pay. People wanting (a) should try to avoid (b) because it prolongs an already long process.
  2. Attitude. Many students struggle for awhile with the wrong attitude towards research. Most students come into graduate school with 16-19 years of schooling where the principle means of success is proving that you know something via assignments, tests, etc… Research does not work this way. Research is what a PhD is about.

    The right attitude is something more like “I have some basic questions about the world and want to understand the answer to them.” The process of discovering the answers, writing it up, and convincing others that you have the right answers is what a PhD is about.

    Let me repeat this another way: you cannot get a PhD by doing homework (even homework assigned by your advisor). Many students fall into this failure mode because it is very natural to continue as you have for the last n years of education. The difficulty is often exacerbated by the mechanics of the PhD process. For example, students are often dependent on their advisors for funding and typical advisors have much more accumulated experience and knowledge than the typical student.

    Their are several reasons why you cannot succeed with the “homework” approach:

    1. Your advisor doesn’t have time to micromanage your work.
    2. A very significant part of doing good research is understanding the motivations (and failures of motivations) of different approaches. Offloading thinking about this on your advisor means that you are missing a critical piece of your own education.
    3. It invites abuse. Advisors are often trapped in their own twisty maze of too many things to do, so they are very tempted to offload work (including nonresearch work) onto students. Students doing some of this can make some sense. A bit of help with nonresearch can be useful here and there and even critical when it comes to funding the students. But a student should never be given (or even allowed to take on) more load than comfortably leaves room for significant research.
    4. With respect to the wider research community, a PhD is an opportunity to develop a personality independent of your advisor. Doing your advisor’s homework doesn’t accomplish this.
  3. Advisor. The choice of advisor is the most important choice in a PhD education. You want one that is comfortable with your own independent streak. You want one that is well enough off to fund you and who won’t greatly load you down with nonresearch tasks. You want one who’s research style fits yours and who has the good regard of the larger research community. This combination of traits is difficult to come by.
    Even more difficult is coming by it twice. I recommend having two advisors because it gives you twice the sources of good advice, wisdom, and funding.
  4. Institution. The choice of advisor is more important than the choice of institution. A good advisor is a make-or-break decision with respect to succeess. The institution is a less important choice of the form “make or make well”. A good institution will have sufficient computational resources and sufficient funding to cover student costs. Quality of life outside of school should be a significant concern because you will be spending years in the same place.
  5. Lifestyle. Before choosing to go for a PhD, try to understand the research lifestyle. If it doesn’t fit reasonably well, don’t try. You will just end up unhappy after years of your life wasted. This is a common failure mode.