Two more UAI papers of interest

In addition to Ed Snelson’s paper, there were (at least) two other papers that caught my eye at UAI.

One was this paper by Sanjoy Dasgupta, Daniel Hsu and Nakul Verma at UCSD which shows in a surprisingly general and strong way that almost all linear projections of any jointly distributed vector random variable with finite first and second moments look sphereical and unimodal (in fact look like a scale mixture of Gaussians). Great result, as you’d expect from Sanjoy.

The other paper which I found intriguing but which I just haven’t groked yet is this beast by Manfred and Dima Kuzmin.
You can check out the (beautiful) slides
if that helps. I feel like there is something deep here, but my brain is too small to understand it. The COLT and last NIPS papers/slides are also on Manfred’s page. Hopefully someone here can illuminate.

Upcoming conference

The Workshop for Women in Machine Learning will be held in San Diego on October 4, 2006.

For details see the workshop website:
http://www.seas.upenn.edu/~wiml/

A Winner

Ed Snelson won the Predictive Uncertainty in Environmental Modelling Competition in the temp(erature) category using this algorithm. Some characteristics of the algorithm are:

  1. Gradient descent
  2. … on about 600 parameters
  3. … with local minima
  4. … to solve regression.

This bears a strong resemblance to a neural network. The two main differences seem to be:

  1. The system has a probabilistic interpretation (which may aid design).
  2. There are (perhaps) fewer parameters than a typical neural network might have for the same problem (aiding speed).

Regression vs. Classification as a Primitive

For learning reductions we have been concentrating on reducing various complex learning problems to binary classification. This choice needs to be actively questioned, because it was not carefully considered.

Binary clasification is learning a classifier c:X -> {0,1} so as to minimize the probability of being wrong, Prx,y~D(c(x) <> y).

The primary alternative candidate seems to be squared error regression. In squared error regression, you learn a regressor s:X -> [0,1] so as to minimize squared error, Ex,y~D (s(x)-y)2.

It is difficult to judge one primitive against another. The judgement must at least partially be made on nontheoretical grounds because (essentially) we are evaluating a choice between two axioms/assumptions.

These two primitives are significantly related. Classification can be reduced to regression in the obvious way: you use the regressor to predict D(y=1|x), then threshold at 0.5. For this simple reduction a squared error regret of r implies a classification regret of at most r0.5. Regression can also be used to reduce to classification using the Probing algorithm. (This is much more obvious when you look at an updated proof. ) Under this reduction, a classification regret of r implies a squared error regression regret of at most r.

Both primitives enjoy a significant amount of prior work with (perhaps) classification enjoying more work in the machine learning community and regression having more emphasis in the statistics community.

The (nonhistoric) reasons for preferring classification seem to be:

  1. Aesthetically, what could be a more natural primitive than predicting a single bit?
  2. Classification is (potentially) inherently more representationally concise. When translated into transistors, a classification algorithm might use fewer transistors than a regressor, simply because the natural representation is bits rather than reals (~= floats).

There are several reasons to consider regression over classification:

  1. More uniform convergence. For a single squared error regressor, the rate of convergence of the estimate of squared error to the expected squared error goes as 1/m, assuming IID samples. For a single binary classifier, the rate of convergence of the estimate of binary loss to the expected binary loss goes as a function varying between 1/m and 1/m0.5.
  2. There is a computational inequivalence between the primitives, as far as we know. In particular, the Probing algorithm must call a classifier several times in several ways to make a high precision regression prediction. On the other hand, classification via regression requires one call to the underlying regressor.
  3. Squared error regression learning is often easier than 0/1 classification learning. This is becaue squared error regression is convex, but 0/1 loss is not. Note: this does not imply that squared error regression is convex (It isn’t for general regression algorithms). Instead, it just means that nonconvexity is not enforced by the loss function.

The mathematical evidence points toward squared error regression as a better primitive, although doubts still seem reasonable to entertain.

Who is having visa problems reaching US conferences?

Many of the large machine learning conferences were in the US this summer. A common problem which students from abroad encounter is visa issues.
Just getting a visa to visit can be pretty rough: you stand around in lines, sometimes for days. Even worse is the timing with respect to ticket buying. Airplane tickets typically need to be bought well in advance on nonrefundable terms to secure a reasonable rate for air travel. When a visa is denied, as happens reasonably often, a very expensive ticket is burnt.

A serious effort is under way to raise this as in issue in need of fixing. Over the long term, effectively driving research conferences to locate outside of the US seems an unwise policy. Robert Schapire is planning to talk to a congressman. Sally Goldman suggested putting together a list of problem cases, and Phil Long setup an email address immigration.and.confs@gmail.com to collect them.

If you (or someone you know) has had insurmountable difficulties reaching a conference in the US, please send an email with:

Name:
Email address:
Conference:
Difficulty:
Details: (be brief please)

We expect most of the problem cases are students, so don’t be shy.