2007 Summer Machine Learning Conferences

It’s conference season once again.

Conference Due? When? Where? double blind? author feedback? Workshops?
AAAI February 1/6 (and 27) July 22-26 Vancouver, British Columbia Yes Yes Done
UAI February 28/March 2 July 19-22 Vancouver, British Columbia No No No
COLT January 16 June 13-15 San Diego, California (with FCRC) No No No
ICML February 7/9 June 20-24 Corvallis, Oregon Yes Yes February 16
KDD February 23/28 August 12-15 San Jose, California Yes No? February 28

The geowinner this year is the west coast of North America. Last year‘s geowinner was the Northeastern US, and the year before it was mostly Europe. It’s notable how tightly the conferences cluster, even when they don’t colocate.

Retrospective

It’s been almost two years since this blog began. In that time, I’ve learned enough to shift my expectations in several ways.

  1. Initially, the idea was for a general purpose ML blog where different people could contribute posts. What has actually happened is most posts come from me, with a few guest posts that I greatly value. There are a few reasons I see for this.
    1. Overload. A couple years ago, I had not fully appreciated just how busy life gets for a researcher. Making a post is not simply a matter of getting to it, but rather of prioritizing between {writing a grant, finishing an overdue review, writing a paper, teaching a class, writing a program, etc…}. This is a substantial transition away from what life as a graduate student is like. At some point the question is not “when will I get to it?” but rather “will I get to it?” and the answer starts to become “no” most of the time.
    2. Feedback failure. This blog currently receives about 3K unique visitors per day from about 13K unique sites per month. This number of visitors is large enough that it scares me somewhat—having several thousand people read a post is more attention than almost all papers published in academia get. But the nature of things is that only a small fraction of people leave comments, and the rest are essentially invisible. Adding a few counters to the site may help with this.
    3. Content Control. The internet has a huge untapped capacity to support content, so one of the traditional reasons for editorial control (limited space) simply no longer exists. Nevertheless, the time of readers is important and there is a focus-of-attention issue since one blog with all posts on all topics would be virtually useless. In an ideal world, the need for explicit content control would disappear and be replaced by a massive cooperative collaborative filtering process. This shift is already well underway since anyone can start their own blog and read anything they choose. Tighter integration of collaborative filtering into the overall process will surely be useful. I’ve reorganized my links to other blogs to make this a little bit easier. In the last couple years, many new machine learning related blogs have started (just recently: Yee Whye’s), which is great in several ways.
    4. Difficulty. Talking clearly about things you barely understand (and no one else does) is simply very difficult. Expending the effort to write clearly about them in a post is not too difficult from expending the effort to write clearly about them in a paper, which is the traditional mechanism of publishing. There is no simply way around this problem, although changing people’s expectations may be helpful. Right now, the expectation in academia is (partially) set by the academic paper. A different expectation, more akin to the way we discuss problems with each other in person, may be helpful.

    For the record, I’m always happy to consider posts by others. If you are considering your own blog, trying a guest post or two is a great way to experiment. Many people don’t have the time or inclination to run their own blog, so guest posts are essential.

  2. What is a good post? A good post is fundamentally an interesting post, but “interesting” can be broken down further.
    1. Speak plainly. The review process in academia can sometimes favor the convoluted over the plain. A blog strongly encourages otherwise since the backgrounds of readers are very diverse. If you aren’t self-editing for simplicity, you aren’t being simple enough.
    2. Believe in it.
    3. Lack of comments is not always lack of interest. As an example the posts of the form “interesting papers at <conference>” tend to get very few comments, but they are some of the most viewed.
    4. Avoid duplication. The most obvious way to use a blog is as a mechanism for posting finished research. It’s ok for this, but the most interesting way of using the blog are for topics which could not be stated as a research paper.

Interesting Papers at NIPS 2006

Here are some papers that I found surprisingly interesting.

  1. Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, Greedy Layer-wise Training of Deep Networks. Empirically investigates some of the design choices behind deep belief networks.
  2. Long Zhu, Yuanhao Chen, Alan Yuille Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing. An unsupervised method for detecting objects using simple feature filters that works remarkably well on the (supervised) caltech-101 dataset.
  3. Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira, Analysis of Representations for Domain Adaptation. This is the first analysis I’ve seen of learning with respect to samples drawn differently from the evaluation distribution which depends on reasonable measurable quantities.

All of these papers turn out to have a common theme—the power of unlabeled data to do generically useful things.

The Spam Problem

The New York Times has an article on the growth of spam. Interesting facts include: 9/10 of all email is spam, spam source identification is nearly useless due to botnet spam senders, and image based spam (emails which consist of an image only) are on the growth.

Estimates of the cost of spam are almost certainly far to low, because they do not account for the cost in time lost by people.

The image based spam which is currently penetrating many filters should be catchable with a more sophisticated application of machine learning technology. For the spam I see, the rendered images come in only a few formats, which would be easy to recognize via a support vector machine (with RBF kernel), neural network, or even nearest-neighbor architecture. The mechanics of setting this up to run efficiently is the only real challenge. This is the next step in the spam war.

The response to this system is to make the image based spam even more random. We should (essentially) expect to see Captcha spam, and our inability to recognize captcha spam should persist as long as the vision problem is not solved. This hopefully degrades the value of spam to the spammers, but it may not make the value of spam nonzero.

Solutions beyond machine learning may be necessary. One simple economic solution is to transfer from first time sender to receiver a small amount (10 cents?) in a verifiable manner. If the receiver classifies the email as spam then the charge repeats on the next receipt, and otherwise it goes away.

There are several difficulties with this approach: How do you change a huge system in heavy use which no one controls? How do you deal with mailing lists? These problems appear surmountable. For example, we could extend the mail protocol to include a payment system (using the “X-” lines) and use the existence of a payment as a feature in existing spam-or-not prediction systems. Over time, this feature may become the most useful feature encouraging every legitimate email user to offer a small payment with the first email to a recipient.

Recruitment Conferences

One of the subsidiary roles of conferences is recruitment. NIPS is optimally placed in time for this because it falls right before the major recruitment season.

I personally found job hunting embarrassing, and was relatively inept at it. I expect this is true of many people, because it is not something done often.

The basic rule is: make the plausible hirers aware of your interest. Any corporate sponsor is a “plausible”, regardless of whether or not there is a booth. CRA and the acm job center are other reasonable sources.

There are substantial differences between the different possibilities. Putting some effort into understanding the distinctions is a good idea, although you should always remember where the other person is coming from.