Machine Learning (Theory)

5/29/2005

Maximum Margin Mismatch?

John makes a fascinating point about structured classification (and slightly scooped my post!). Maximum Margin Markov Networks (M3N) are an interesting example of the second class of structured classifiers (where the classification of one label depends on the others), and one of my favorite papers. I’m not alone: the paper won the best student paper award at NIPS in 2003.

There are some things I find odd about the paper. For instance, it says of probabilistic models

“cannot handle high dimensional feature spaces and lack strong theoretical guarrantees.”

I’m aware of no such limitations. Also:

“Unfortunately, even probabilistic graphical models that are trained discriminatively do not achieve the same level of performance as SVMs, especially when kernel features are used.”

This is quite interesting and contradicts my own experience as well as that of a number of people I greatly
respect. I wonder what the root cause is: perhaps there is something different about the data Ben+Carlos were working with?

The elegance of M3N, I think, is unrelated to this probabilistic/margin distinction. M3N provided the first implementation of the margin concept that was computationally efficient for multiple output variables and provided a sample complexity result with a much weaker dependence than previous approaches. Further, the authors carry out some nice experiments that speak well for the practicality of their approach. In particular, M3N’s outperform Conditional Random Fields (CRFs) in terms of per-variable (Hamming) loss. And I think this gets us to the crux of the matter, and ties back to John’s post. CRFs are trained by a MAP approach that is effectively per sequence, while the loss function at run time we care about is per variable.

The mismatch the post title refers to is that, at test time, M3N’s are viterbi decoded: a per sequence decoding. Intuitively, viterbi is an algorithm that only gets paid for its services when it classifies an entire sequence correctly. This seems an odd mismatch, and makes one wonder: How well does a per-variable approach like the variable marginal likelihood approach mentioned previously of Roweis,Kakade, and Teh combined with runtime belief propagation compare with the M3N procedure? Does the mismatch matter, and if so, is there a more appropriate decoding procedure like BP, appropriate for margin-trained methods? And finally, it seems we need to answer John’s question convincingly: if you really care about per-variable probabilities or classifications, isn’t it possible that structuring the output space actually hurts? (It seems clear to me that it can help when you insist on getting the entire sequence right, although perhaps others don’t concur with that.)

Bad ideas

Tags: General jl@ 8:24 am

I found these two essays on bad ideas interesting. Neither of these is written from the viewpoint of research, but they are both highly relevant.

  1. Why smart people have bad ideas by Paul Graham
  2. Why smart people defend bad ideas by Scott Berkun (which appeared on slashdot)

In my experience, bad ideas are common and over confidence in ideas is common. This overconfidence can take either the form of excessive condemnation or excessive praise. Some of this is necessary to the process of research. For example, some overconfidence in the value of your own research is expected and probably necessary to motivate your own investigation. Since research is a rather risky business, much of it does not pan out. Learning to accept when something does not pan out is a critical skill which is sometimes never acquired.

Excessive condemnation can be a real ill when it’s encountered. This has two effects:

  1. When the penalty for being wrong is too large, it means people have a great investment in defending “their” idea. Since research is risky, “their” idea is often wrong (or at least in need of amendment).
  2. A large penalty implies people are hesitant to introduce new ideas.

Both of these effects slow the progress of research. How much, exactly, is unclear and very difficult to imagine measuring.

While it may be difficult to affect the larger community of research, you can and should take these considerations into account when choosing coauthors, advisors, and other people you work with. The ability to say “oops, I was wrong”, have that be accepted without significant penalty, and move on is very valuable for the process of thinking.

5/28/2005

Running A Machine Learning Summer School

Tags: General jl@ 3:34 pm

We just finished the Chicago 2005 Machine Learning Summer School. The school was 2 weeks long with about 130 (or 140 counting the speakers) participants. For perspective, this is perhaps the largest graduate level machine learning class I am aware of anywhere and anytime (previous MLSSs have been close). Overall, it seemed to go well, although the students are the real authority on this. For those who missed it, DVDs will be available from our Slovenian friends. Email Mrs Spela Sitar of the Jozsef Stefan Institute for details.

The following are some notes for future planning and those interested.
Good Decisions

  1. Acquiring the larger-than-necessary “Assembly Hall” at International House. Our attendance came in well above our expectations, so this was a critical early decision that made a huge difference.
  2. The invited speakers were key. They made a huge difference in the quality of the content.
  3. Delegating early and often was important. One key difficulty here is gauging how much a volunteer can (or should) do. Many people are willing to help a little, so breaking things down into small chunks is important.

Unclear Decisions

  1. Timing (May 16-27, 2005): We wanted to take advantage of the special emphasis on learning quarter here. We also wanted to run the summer school in the summer. These goals did not have a good solution. By starting as late as possible in the quarter, we were in the “summer” for universities on a semester schedule but not those on a quarter schedule. Thus, we traded some students and scheduling conflicts at University of chicago for the advantages of the learning quarter.
  2. Location (Hyde Park, Chicago):
    Advantages:
    1. Easy to fly to.
    2. Easy to get funding. (TTI and Uchicago were both significant contributors.)
    3. Easy (on-site) organization.

    Disadvantages:

    1. US visas were too slow or rejected 7+ students.
    2. Location in Chicago implied many locals drifted in and out.
    3. The Hyde Park area lacks real hotels, creating housing difficulties.
  3. Workshop colocation: We colocated with two workshops. The advantage of this is more content. The disadvantage was that it forced talks to start relatively early. This meant that attendance at the start of the first lecture was relatively low (60-or-so), ramping up through the morning. Although some students benefitted from the workshop talks, most appeared to gain much more from the summer school.

Things to do Differently Next Time

  1. Delegate harder and better. Doing various things rather than delegating means you feel like you are “doing your part”, but it also means that you are distracted and do not see other things which need to be done….and they simply don’t get done unless you see it.
  2. Have a ’sorting session’. With 100+ people in the room, it is difficult to meet people of similar interests. This should be explicitly aided. One good suggestion is “have a poster session for any attendees”. Sorting based on other dimensions might also be helpful. The wiki helped here for social events.
  3. Torture the speakers more. Presenting an excess of content in a minimum of time to an audience of diverse backgrounds is extremely difficult. This difficulty can not be avoided, but it can be ameliorated. Having presentation slides and suggested reading well in advance helps. The bad news here is that it is very difficult to get speakers to make materials available in advance. They naturally want to tweak slides at the last minute and include the newest cool discoveries.
  4. Schedules posted at the entrance.

The Future There will almost certainly be future machine learning summer schools in the series and otherwise. My impression is that the support due to being “in series” is not critical to success, but it is considerable. For those interested, running one “in series” starts with a proposal consisting of {organizers,time/location,proposed speakers,budget} sent to Alex Smola and Bernhard Schoelkopf. I am sure they are busy, so conciseness is essential.

5/21/2005

What is the right form of modularity in structured prediction?

Tags: Problems, Reductions, structured jl@ 10:57 pm

Suppose you are given a sequence of observations x1,…,xT from some space and wish to predict a sequence of labels y1,…,yT so as to minimize the Hamming loss: sumi=1 to T I(yi != c(x1,…,xT)i) where c(x1,…,xT)i is the ith predicted component. For simplicity, suppose each label yi is in {0,1}.

We can optimize the Hamming loss by simply optimizing the error rate in predicting each individual component yi independently since the loss is a linear combination of losses on each individual component i. From a learning reductions viewpoint, we can learn a different classifier for each individual component. An average error rate of e over these classifiers implies an expected Hamming loss of Te. This breakup into T different prediction problems is not the standard form of modularity in structured prediction.

A more typical form of modularity is to predict yi given xi, yi-1, yi+1 where the circularity (predicting given other predictions) is handled in various ways. This is often represented with a graphical model like so:

This form of modularity seems to be preferred for several reasons:

  1. Graphical models of this sort are a natural language for expressing what we know (or believe we know) about a problem in advance.
  2. There may be computational advantages to learning to predict from fewer features. (But note that handling the circularity is sometimes computationally difficult.)
  3. There may be sample complexity advantages to learning to predict from fewer features. This is particularly true for many common learning algorithms.

The difficulty with this approach is that “errors accumulate”. In particular, an average error rate of e for each of the predictors can easily imply a hamming loss of O(eT2). Matti Kaariainen convinced me this is not improvable for predictors of this form.

So, we have two forms of modularity. One is driven by the loss function while the other driven by simplicity of prediction descriptions. Each has advantages and disadvantages from a practical viewpoint. Can these different approaches be reconciled? Is there a compelling algorithm for solving structured prediction which incorporated both intuitions?

5/17/2005

A Short Guide to PhD Graduate Study

Tags: General jl@ 3:05 pm

Graduate study is a mysterious and uncertain process. This easiest way to see this is by noting that a very old advisor/student mechanism is preferred. There is no known succesful mechanism for “mass producing” PhDs as is done (in some sense) for undergraduate and masters study. Here are a few hints that might be useful to prospective or current students based on my own experience.

  1. Masters or PhD (a) You want a PhD if you want to do research. (b) You want a masters if you want to make money. People wanting (b) will be manifestly unhappy with (a) because it typically means years of low pay. People wanting (a) should try to avoid (b) because it prolongs an already long process.
  2. Attitude. Many students struggle for awhile with the wrong attitude towards research. Most students come into graduate school with 16-19 years of schooling where the principle means of success is proving that you know something via assignments, tests, etc… Research does not work this way. Research is what a PhD is about.

    The right attitude is something more like “I have some basic questions about the world and want to understand the answer to them.” The process of discovering the answers, writing it up, and convincing others that you have the right answers is what a PhD is about.

    Let me repeat this another way: you cannot get a PhD by doing homework (even homework assigned by your advisor). Many students fall into this failure mode because it is very natural to continue as you have for the last n years of education. The difficulty is often exacerbated by the mechanics of the PhD process. For example, students are often dependent on their advisors for funding and typical advisors have much more accumulated experience and knowledge than the typical student.

    Their are several reasons why you cannot succeed with the “homework” approach:

    1. Your advisor doesn’t have time to micromanage your work.
    2. A very significant part of doing good research is understanding the motivations (and failures of motivations) of different approaches. Offloading thinking about this on your advisor means that you are missing a critical piece of your own education.
    3. It invites abuse. Advisors are often trapped in their own twisty maze of too many things to do, so they are very tempted to offload work (including nonresearch work) onto students. Students doing some of this can make some sense. A bit of help with nonresearch can be useful here and there and even critical when it comes to funding the students. But a student should never be given (or even allowed to take on) more load than comfortably leaves room for significant research.
    4. With respect to the wider research community, a PhD is an opportunity to develop a personality independent of your advisor. Doing your advisor’s homework doesn’t accomplish this.
  3. Advisor. The choice of advisor is the most important choice in a PhD education. You want one that is comfortable with your own independent streak. You want one that is well enough off to fund you and who won’t greatly load you down with nonresearch tasks. You want one who’s research style fits yours and who has the good regard of the larger research community. This combination of traits is difficult to come by.
    Even more difficult is coming by it twice. I recommend having two advisors because it gives you twice the sources of good advice, wisdom, and funding.
  4. Institution. The choice of advisor is more important than the choice of institution. A good advisor is a make-or-break decision with respect to succeess. The institution is a less important choice of the form “make or make well”. A good institution will have sufficient computational resources and sufficient funding to cover student costs. Quality of life outside of school should be a significant concern because you will be spending years in the same place.
  5. Lifestyle. Before choosing to go for a PhD, try to understand the research lifestyle. If it doesn’t fit reasonably well, don’t try. You will just end up unhappy after years of your life wasted. This is a common failure mode.

5/16/2005

Regret minimizing vs error limiting reductions

Tags: Problems, Reductions jl@ 2:21 pm

This post is about a reductions-related problem that I find mysterious. There are two kinds of reductions analysis currently under consideration.

  1. Error limiting reductions. Here, the goal is to bound the error rate of the created classifier in terms of the error rate of the binary classifiers that you reduce to. A very simple example of this is that error correcting output codes where it is possible to prove that for certain codes, the multiclass error rate is at most 4 * the binary classifier error rate.
  2. Regret minimizing reductions. Here, the goal is to bound the regret of the created classifier in terms of the regret of the binary classifiers reduced to. The regret is the error rate minus the minimum error rate. When the learning problem is noisy the minimum error rate may not be 0. An analagous result for reget is that for a probabilistic error correcting output code, multiclass regret is at most 4 * (binary regret)0.5.

The use of “regret” is more desirable than the use of error rates, because (for example) the ECOC error rate bound implies nothing when there is enough noise so that the binary classifiers always have error rate 0.25. However the square root dependence introduced when analyzing regret is not desirable. A basic question is: Can we have the best of both worlds? Can we find some algorithm doing multiclass classification with binary classifiers such that average regret r for the binary classifiers implies average regret bounded by 4r for the multiclass classifier?

If the answer is “yes”, that reduction algorithm may be empirically superior to the one we use now.
If the answer is “no”, that is a sharp and unexpected distinction between error rate analysis and regret analysis.

5/14/2005

NIPS

Tags: Announcements jl@ 1:23 pm

NIPS is the big winter conference of learning.

  1. Paper due date: June 3rd. (Tweaked thanks to Fei Sha.)
  2. Location: Vancouver (main program) Dec. 5-8 and Whistler (workshops) Dec 9-10, BC, Canada

NIPS is larger than all of the other learning conferences, partly because it’s the only one at that time of year. I recommend the workshops which are often quite interesting and energetic.

5/12/2005

Math on the Web

Tags: General jl@ 10:23 am

Andrej Bauer has setup a Mathematics and Computation Blog. As a first step he has tried to address the persistent and annoying problem of math on the web. As a basic tool for precisely stating and transfering understanding of technical subjects, mathematics is very necessary. Despite this necessity, every mechanism for expressing mathematics on the web seems unnaturally clumsy. Here are some of the methods and their drawbacks:

  1. MathML This was supposed to be the answer, but it has two severe drawbacks: “Internet Explorer” doesn’t read it and the language is an example of push-XML-to-the-limit which no one would ever consider writing in. (In contrast, html is easy to write in.) It’s also very annoying that math fonts must be installed independent of the browser, even for mozilla based browsers.
  2. Create inline images. This has several big drawbacks: font size is fixed for all viewers, you can’t cut & paste inside the images, and you can’t hyperlink from (say) symbol to definition. Math World is a good example using this approach.
  3. Html Extensions. For example, yi = x2. The drawback here is that the available language is very limited (no square roots, integrals, sums, etc…). This is what I have been using for posts.
  4. Raw latex. Researchers are used to writing math in latex and compile into postscript or pdf. It is possible to simply communicate in that language. Unfortunately, the language can make simple things like fractions appear (syntactically) much more complicated. More importantly, latex is not nearly as universally known as the mathematics layed out in math books.
  5. Translation. An obvious trick is to translate this human-editable syntax into something. There are two difficulties here:
    1. What do you translate to? None of the presentations mechanisms above are fully satisfying.
    2. Lost in translation. For example in latex, it’s hard to make a hyperlink from a variable in one formula to an anchor in the variable definition of another formula and have that translated correctly into (say) MathML.

Approach (4) is what Andrej’s blog is using, with a javascript translator that changes output depending on the destination browser. Ideally, the ’smart translator’ would use whichever of {MathML, image, html extensions, human-edit format} was best and supported by the destination browser, but that is not yet the case. Nevertheless, it is a good start.

5/11/2005

Visa Casualties

Tags: General jl@ 8:52 am

For the Chicago 2005 machine learning summer school we are organizing, at least 5 international students can not come due to visa issues. There seem to be two aspects to visa issues:

  1. Inefficiency. The system rejected the student simply by being incapable of even starting to evaluate their visa in less than 1 month of time.
  2. Politics. Border controls became much tighter after the September 11 attack. Losing a big chunk of downtown of the largest city in a country will do that.

What I (and the students) learned is that (1) is a much larger problem than (2). Only 1 prospective student seems to have achieved an explicit visa rejection. Fixing problem (1) should be a no-brainer, because the lag time almost surely indicates overload, and overload on border controls should worry even people concerned with (2). The obvious fixes to overload are “spend more money” and “make the system more efficient”.

With respect to (2), (which is a more minor issue by the numbers) it is unclear that the political calculus was done right. There is an obvious demonstrated risk that letting the wrong people through border controls means large buildings can be destroyed. However there is a subtle risk in making acquiring a visa a more uncertain process: it contributes towards shifting science, (human) learning, and technology outside of the US. This shift is economically detrimental to the US. For some anecdotal evidence of this effect, note that this is the first machine learning summer school in the US but the 6th in the series. Less striking, but perhaps a surer measurement is to notice that many of the machine learning related summer conferences are in Europe this year.

5/10/2005

Learning Reductions are Reductionist

Tags: Reductions jl@ 8:23 am

This is about a fundamental motivation for the investigation of reductions in learning. It applies to many pieces of work other than my own.

The reductionist approach to problem solving is characterized by taking a problem, decomposing it into as-small-as-possible subproblems, discovering how to solve the subproblems, and then discovering how to use the solutions to the subproblems to solve larger problems. The reductionist approach to solving problems has often payed off very well. Computer science related examples of the reductionist approach include:

  1. Reducing computation to the transistor. All of our CPUs are built from transistors.
  2. Reducing rendering of images to rendering a triangle (or other simple polygons). Computers can now render near-realistic scenes in real time. The big breakthrough came from learning how to render many triangles quickly.

This approach to problem solving extends well beyond computer science. Many fields of science focus on theories making predictions about very simple systems. These predictions are then composed to make predictions about where space craft go, how large a cannonball needs to be, etc… Obviously this approach has been quite successful.

It is an open question whether or not this approach can really succeed at learning.

  1. Against: We know that succesful learning requires the incorporation of prior knowledge in fairly arbitrary forms. This suggests that we can not easily decompose the process of learning.
  2. For: We know that humans can succeed at general purpose learning. It may be that arbitrary prior knowledge is required to solve arbitrary learning problems, but perhaps there are specific learning algorithms incorporating specific prior knowledge capable of solving the specific problems we encounter.
  3. Neutral: We know that learning reductions sometimes work. We don’t yet have a good comparison of how well they work with other approaches.

5/6/2005

Don’t mix the solution into the problem

Tags: General jl@ 10:37 am

A common defect of many pieces of research is defining the problem in terms of the solution. Here are some examples in learning:

  1. “The learning problem is finding a good seperating hyperplane.”
  2. “The goal of learning is to minimize (y-p)2 + C w2 where y = the observation, p = the prediction and w = a parameter vector.”
  3. Defining the loss function to be the one that your algorithm optimizes rather than the one imposed by the world.

The fundamental reason why this is a defect is that it creates artificial boundaries to problem solution. Artificial boundaries lead to the possibility of being blind-sided. For example, someone committing (1) or (2) above might find themselves themselves surprised to find a decision tree working well on a problem. Example (3) might result in someone else solving a learning problem better for real world purposes, even if it’s worse with respect to the algorithm optimization. This defect should be avoided so as to not artificially limit your learning kungfu.

The way to avoid this defect while still limiting the scope of investigation to something you can manage is to be explicit.

  1. Admit what the real world-imposed learning problem is. For example “The problem is to find a classifier minimizing error rate”.
  2. Be explicit about where the problem ends and the solution begins. For example “We use a support vector machine with a l2 loss to train a classifier. We use the l2 loss because it is an upper bound on the error rate which is computationally tractable to optimize.”
  3. Finish the solution. For example “The error rate on our test set was 0.34.”

It is important to note that this is not a critique about any particular method for solving learning problems, but rather about the process of thinking about learning problems. Eliminating this thinking-bug will leave people more capable of appreciating and using different approaches to solve the real learning problem.

5/3/2005

Conference attendance is mandatory

Tags: Announcements jl@ 8:25 am

For anyone planning to do research, conference attendance is virtually mandatory for success. Aside from exposing yourself to a large collection of different ideas, many interesting conversations leading to new research happen at conferences. If you are a student, you should plan to go to at least one summer conference. Your advisor should cover the costs.

Conference Location Early Registration deadline normal/student cost in US dollars
AAAI Pittsburgh, PA, USA May 13 590/170
IJCAI Edinburgh, Scotland May 21 663/351
COLT Bertinoro, Italy May 30 256/178
KDD Chicago, IL, USA July 15 590/260
ICML Bonn, Germany July 1 448
UAI Edinburgh, Scotland not ready yet ???

5/2/2005

Reviewing techniques for conferences

Tags: Reviewing jl@ 9:28 pm

The many reviews following the many paper deadlines are just about over. AAAI and ICML in particular were experimenting with several reviewing techniques.

  1. Double Blind: AAAI and ICML were both double blind this year. It seemed (overall) beneficial, but two problems arose.
    1. For theoretical papers, with a lot to say, authors often leave out the proofs. This is very hard to cope with under a double blind review because (1) you can not trust the authors got the proof right but (2) a blanket “reject” hits many probably-good papers. Perhaps authors should more strongly favor proof-complete papers sent to double blind conferences.
    2. On the author side, double blind reviewing is actually somewhat disruptive to research. In particular, it discourages the author from talking about the subject, which is one of the mechanisms of research. This is not a great drawback, but it is one not previously appreciated.
  2. Author feedback: AAAI and ICML did author feedback this year. It seemed helpful for several papers. The ICML-style author feedback (more space, no requirement of attacking the review to respond), appeared somewhat more helpful and natural. It seems ok to pass a compliment from author to reviewer.
  3. Discussion Periods: AAAI seemed more natural than ICML with respect to discussion periods. For ICML, there were “dead times” when reviews were submitted but discussions amongst reviewers were not encouraged. This has the drawback of letting people forget their review before discussing it.

Powered by WordPress