MLTV

As part of a PASCAL project, the Slovenians have been filming various machine learning events and placing them on the web here. This includes, for example, the Chicago 2005 Machine Learning Summer School as well as a number of other summer schools, workshops, and conferences.

There are some significant caveats here—for example, I can’t access it from Linux. Based upon the webserver logs, I expect that is a problem for most people—Computer scientists are particularly nonstandard in their choice of computing platform.

Nevertheless, the core idea here is excellent and details of compatibility can be fixed later. With modern technology toys, there is no fundamental reason why the process of announcing new work at a conference should happen only once and only for the people who could make it to that room in that conference. The problems solved include:

  1. The multitrack vs. single-track debate. (“Sometimes the single track doesn’t interest me” vs. “When it’s multitrack I miss good talks”
  2. “I couldn’t attend because I was giving birth/going to a funeral/a wedding”
  3. “What was that? I wish there was a rewind on reality.”

There are some fears here too. For example, maybe a shift towards recording and placing things on the web will result in lower attendance at a conference. Such a fear is confused in a few ways:

  1. People go to conferences for many more reasons than just announcing new work. Other goals include doing research, meeting old friends, worrying about job openings, skiing, and visiting new places. There also a subtle benefit of going to a conference: it represents a commitment of time to research. It is this commitment which makes two people from the same place start working together at a conference. Given all these benefits of going to a conference, there is plenty of reason for them to continue to exist.
  2. It is important to remember that a conference is a process in aid of research. Recording and making available for download the presentations at a conference makes research easier by solving all the problems listed above.
  3. This is just another new information technology. When the web came out, computer scientists and physicists quickly adopted a “place any paper on your webpage” style even when journals forced them to sign away the rights of the paper to publish. Doing this was simply healthy for the researcher because his papers were more easily readable. The same logic applies to making presentations at a conference available on the web.

Deadline Season

Many different paper deadlines are coming up soon so I made a little reference table. Out of curiosity, I also computed the interval between submission deadline and conference.

Conference Location Date Deadline interval
COLT Pittsburgh June 22-25 January 21 152
ICML Pittsburgh June 26-28 January 30/February 6 140
UAI MIT July 13-16 March 9/March 16 119
AAAI Boston July 16-20 February 16/21 145
KDD Philadelphia August 23-26 March 3/March 10 166

It looks like the northeastern US is the big winner as far as location this year.

Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.