What is the right form of modularity in structured prediction?

Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.

We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.

A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:

This form of modularity seems to be preferred for several reasons:

  1. Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
  2. There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
  3. There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.

The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.

So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?

A Short Guide to PhD Graduate Study

Graduate study is a mysterious and uncertain process. This easiest way to see this is by noting that a very old advisor/student mechanism is preferred. There is no known succesful mechanism for “mass producing” PhDs as is done (in some sense) for undergraduate and masters study. Here are a few hints that might be useful to prospective or current students based on my own experience.

  1. Masters or PhD (a) You want a PhD if you want to do research. (b) You want a masters if you want to make money. People wanting (b) will be manifestly unhappy with (a) because it typically means years of low pay. People wanting (a) should try to avoid (b) because it prolongs an already long process.
  2. Attitude. Many students struggle for awhile with the wrong attitude towards research. Most students come into graduate school with 16-19 years of schooling where the principle means of success is proving that you know something via assignments, tests, etc… Research does not work this way. Research is what a PhD is about.

    The right attitude is something more like “I have some basic questions about the world and want to understand the answer to them.” The process of discovering the answers, writing it up, and convincing others that you have the right answers is what a PhD is about.

    Let me repeat this another way: you cannot get a PhD by doing homework (even homework assigned by your advisor). Many students fall into this failure mode because it is very natural to continue as you have for the last n years of education. The difficulty is often exacerbated by the mechanics of the PhD process. For example, students are often dependent on their advisors for funding and typical advisors have much more accumulated experience and knowledge than the typical student.

    Their are several reasons why you cannot succeed with the “homework” approach:

    1. Your advisor doesn’t have time to micromanage your work.
    2. A very significant part of doing good research is understanding the motivations (and failures of motivations) of different approaches. Offloading thinking about this on your advisor means that you are missing a critical piece of your own education.
    3. It invites abuse. Advisors are often trapped in their own twisty maze of too many things to do, so they are very tempted to offload work (including nonresearch work) onto students. Students doing some of this can make some sense. A bit of help with nonresearch can be useful here and there and even critical when it comes to funding the students. But a student should never be given (or even allowed to take on) more load than comfortably leaves room for significant research.
    4. With respect to the wider research community, a PhD is an opportunity to develop a personality independent of your advisor. Doing your advisor’s homework doesn’t accomplish this.
  3. Advisor. The choice of advisor is the most important choice in a PhD education. You want one that is comfortable with your own independent streak. You want one that is well enough off to fund you and who won’t greatly load you down with nonresearch tasks. You want one who’s research style fits yours and who has the good regard of the larger research community. This combination of traits is difficult to come by.
    Even more difficult is coming by it twice. I recommend having two advisors because it gives you twice the sources of good advice, wisdom, and funding.
  4. Institution. The choice of advisor is more important than the choice of institution. A good advisor is a make-or-break decision with respect to succeess. The institution is a less important choice of the form “make or make well”. A good institution will have sufficient computational resources and sufficient funding to cover student costs. Quality of life outside of school should be a significant concern because you will be spending years in the same place.
  5. Lifestyle. Before choosing to go for a PhD, try to understand the research lifestyle. If it doesn’t fit reasonably well, don’t try. You will just end up unhappy after years of your life wasted. This is a common failure mode.

Regret minimizing vs error limiting reductions

This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration.

  1. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate.
  2. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0. An analagous result for reget is that for a probabilistic error correcting output code, multiclass regret is at most 4 * (binary regret)0.5.

The use of “regret” is more desirable than the use of error rates, because (for example) the ECOC error rate bound implies nothing when there is enough noise so that the binary classifiers always have error rate 0.25. However the square root dependence introduced when analyzing regret is not desirable. A basic question is: Can we have the best of both worlds? Can we find some algorithm doing multiclass classification with binary classifiers such that average regret r for the binary classifiers implies average regret bounded by 4r for the multiclass classifier?

If the answer is “yes”, that reduction algorithm may be empirically superior to the one we use now.
If the answer is “no”, that is a sharp and unexpected distinction between error rate analysis and regret analysis.

NIPS

NIPS is the big winter conference of learning.

  1. Paper due date: June 3rd. (Tweaked thanks to Fei Sha.)
  2. Location: Vancouver (main program) Dec. 5-8 and Whistler (workshops) Dec 9-10, BC, Canada

NIPS is larger than all of the other learning conferences, partly because it’s the only one at that time of year. I recommend the workshops which are often quite interesting and energetic.

Math on the Web

Andrej Bauer has setup a Mathematics and Computation Blog. As a first step he has tried to address the persistent and annoying problem of math on the web. As a basic tool for precisely stating and transfering understanding of technical subjects, mathematics is very necessary. Despite this necessity, every mechanism for expressing mathematics on the web seems unnaturally clumsy. Here are some of the methods and their drawbacks:

  1. MathML This was supposed to be the answer, but it has two severe drawbacks: “Internet Explorer” doesn’t read it and the language is an example of push-XML-to-the-limit which no one would ever consider writing in. (In contrast, html is easy to write in.) It’s also very annoying that math fonts must be installed independent of the browser, even for mozilla based browsers.
  2. Create inline images. This has several big drawbacks: font size is fixed for all viewers, you can’t cut & paste inside the images, and you can’t hyperlink from (say) symbol to definition. Math World is a good example using this approach.
  3. Html Extensions. For example, yi = x2. The drawback here is that the available language is very limited (no square roots, integrals, sums, etc…). This is what I have been using for posts.
  4. Raw latex. Researchers are used to writing math in latex and compile into postscript or pdf. It is possible to simply communicate in that language. Unfortunately, the language can make simple things like fractions appear (syntactically) much more complicated. More importantly, latex is not nearly as universally known as the mathematics layed out in math books.
  5. Translation. An obvious trick is to translate this human-editable syntax into something. There are two difficulties here:
    1. What do you translate to? None of the presentations mechanisms above are fully satisfying.
    2. Lost in translation. For example in latex, it’s hard to make a hyperlink from a variable in one formula to an anchor in the variable definition of another formula and have that translated correctly into (say) MathML.

Approach (4) is what Andrej’s blog is using, with a javascript translator that changes output depending on the destination browser. Ideally, the ‘smart translator’ would use whichever of {MathML, image, html extensions, human-edit format} was best and supported by the destination browser, but that is not yet the case. Nevertheless, it is a good start.