John Langford – Page 73 – Machine Learning (Theory)

12/17/200512/17/2005

Workshops as Franchise Conferences

Founding a successful new conference is extraordinarily difficult. As a conference founder, you must manage to attract a significant number of good papers—enough to entice the participants into participating next year and to (generally) to grow the conference. For someone choosing to participate in a new conference, there is a very significant decision to make: do you send a paper to some new conference with no guarantee that the conference will work out? Or do you send it to another (possibly less related) conference that you are sure will work?

The conference founding problem is a joint agreement problem with a very significant barrier. Workshops are a way around this problem, and workshops attached to conferences are a particularly effective means for this. A workshop at a conference is sure to have people available to speak and attend and is sure to have a large audience available. Presenting work at a workshop is not generally exclusive: it can also be presented at a conference. For someone considering participation, the only overhead is the direct time and effort involved in participation.

All of the above says that workshops are much easier than conferences, but it does not address a critical question: “Why run a workshop at a conference rather than just a session at the conference?” A session at the conference would have all the above advantages.

There is one more very signficant and direct advantage of a workshop over a special session: workshops are run by people who have a direct and significant interest in their success. The workshop organizers do the hard work of developing a topic, soliciting speakers, and deciding what the program will be. Reputations for the workshop organizer are then built on the success or flop of the workshop. This “direct and signficant interest” aspect of a workshop is the basic reason why franchise systems (think 7-11 or McDonalds) are common and successful.

What does this observation imply about how things could be? For example, we could imagine a conference that is “all workshops”. Instead of having a program committee and program chair, the conference might just have a program chair that accepts or rejects workshop chairs who then organize their own workshop/session. This mode doesn’t seem to exist which is always cautioning, but on the other hand it ‘s not clear this mode has even been tried. NIPS is probably the conference closest to using this approach. For example, a significant number of people attend only the workshops at NIPS.

12/9/200512/9/2005

Some NIPS papers

Here is a set of papers that I found interesting (and why).

A PAC-Bayes approach to the Set Covering Machine improves the set covering machine. The set covering machine approach is a new way to do classification characterized by a very close connection between theory and algorithm. At this point, the approach seems to be competing well with SVMs in about all dimensions: similar computational speed, similar accuracy, stronger learning theory guarantees, more general information source (a kernel has strictly more structure than a metric), and more sparsity. Developing a classification algorithm is not very easy, but the results so far are encouraging.
Off-Road Obstacle Avoidance through End-to-End Learning and Learning Depth from Single Monocular Images both effectively showed that depth information can be predicted from camera images (using notably different techniques). This ability is strongly enabling because cameras are cheap, tiny, light, and potentially provider longer range distance information than the laser range finders people traditionally use.
The Forgetron: A Kernel-Based Perceptron on a Fixed Budget proved that a bounded memory kernelized perceptron algorithm (which might be characterizable as “stochastic functional gradient descent with weight decay and truncation”) competes well with respect to an unbounded memory algorithm when the data contains a significant margin. Roughly speaking, this implies that the perceptron approach can learn arbitary (via the kernel) reasonably simple concepts from unbounded quantities of data.

In addition, Sebastian Thrun‘s “How I won the Darpa Grand Challenge” and Sanjoy Dasgupta‘s “Coarse Sample Complexity for Active Learning” talks were both quite interesting.

(Feel free to add any that you found interesting.)

12/9/200512/9/2005

Machine Learning Thoughts

I added a link to Olivier Bousquet’s machine learning thoughts blog. Several of the posts may be of interest.

12/7/200512/7/2005

Is the Google way the way for machine learning?

Urs Hoelzle from Google gave an invited presentation at NIPS. In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following:

Make a cluster of machines.
Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.)
Interact with data via MapReduce.

Creating a cluster of machines is, by this point, relatively straightforward.

Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters.

MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interaction which are necessary.
Apparently, the Nutch project has an open source implementation of mapreduce (but this is clearly the most nonstandard element).

Shifting towards this paradigm has several effects:

It makes “big data” applications more viable.
It makes some learning algorithms more viable than others. One way to think about this is in terms of statistical query learning algorithms. The (generalized) notion of statistical query algorithms is algorithms that rely upon only the results of expections of a (relatively small) number of functions. Any such algorithm can be implemented via mapreduce. The “naive bayes” algorithm and most decision tree algorithms can be easily phrased as statistical query algorithms. Support vector machines can (technically) be phrased as statistical query algorithms, but the number of queries scales with the number of datapoints. Gradient descent algorithms can also be phrased as statistical query algorithms. Learning algorithms which work on one example at a time are not generally statistical query algorithms.
Another way to think about this is in terms of the complexity of the computation. Roughly speaking, as the amount of data scales, only O(n) or (perhaps) O(n log(n)) algorithms are tractable. This strongly favors online learning algorithms. Decision trees and naive bayes are (again) relatively reasonable. Support vector machines (or gaussian processes) encounter difficulties related to scaling.

There is a reasonable argument that the “low hanging fruit” of machine learning research is in the big data with enabling tools paradigm. This is because (a) the amount of data available has been growing far faster than the amount of computation and (b) we just haven’t had the tools to scale here, until recently.

I expect Urs is right: we should look in this direction.

12/4/200512/4/2005

Watchword: model

In everyday use a model is a system which explains the behavior of some system, hopefully at the level where some alteration of the model predicts some alteration of the real-world system. In machine learning “model” has several variant definitions.

Everyday. The common definition is sometimes used.
Parameterized. Sometimes model is a short-hand for “parameterized model”. Here, it refers to a model with unspecified free parameters. In the Bayesian learning approach, you typically have a prior over (everyday) models.
Predictive. Even further from everyday use is the predictive model. Examples of this are “my model is a decision tree” or “my model is a support vector machine”. Here, there is no real sense in which an SVM explains the underlying process. For example, an SVM tells us nothing in particular about how alterations to the real-world system would create a change.

Which definition is being used at any particular time is important information. For example, if it’s a parameterized or predictive model, this implies some learning is required. If it’s a predictive model, then the set of operations which can be done to the model are restricted with respect to everyday usage. I don’t have any particular advice here other than “watch out”—be aware of the distinctions, watch for this source of ambiguity, and clarify when necessary.