More NIPS Papers II

I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order…

  • Preconditioner Approximations for Probabilistic Graphical Models.
    Pradeeep Ravikumar and John Lafferty.
    I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations.
  • Rodeo: Sparse nonparametric regression in high dimensions.
    John Lafferty and Larry Wasserman.
    A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs.
  • Interpolating between types and tokens by estimating power law generators.
    Goldwater, S., Griffiths, T. L., & Johnson, M.
    I had wondered how Chinese restaurant processes and Pitman-Yor processes related to Zipf’s plots and power laws for word frequencies. This paper seems to have the answers.
  • A Bayesian spatial scan statistic.
    Daniel B. Neill, Andrew W. Moore, and Gregory F. Cooper.
    When I first learned about spatial scan statistics I wondered what a Bayesian counterpart would be. I liked the fact they their method was simple, more accurate, and much faster than the usual frequentist method.
  • Q-Clustering.
    M. Narasimhan, N. Jojic and J. Bilmes.
    A very interesting application of sub-modular function optimization to clustering. This feels like a hot area.
  • Worst-Case Bounds for Gaussian Process Models.
    Sham M. Kakade, Matthias W. Seeger, & Dean P. Foster.

    It’s useful for Gaussian process practitioners to know that their approaches don’t do silly things when viewed from a worst-case frequentist setting. This paper provides some relevant theoretical results.

More NIPS Papers

Let me add to John’s post with a few of my own favourites
from this year’s conference. First, let me say that
Sanjoy’s talk, Coarse Sample Complexity Bounds for Active
Learning
was also one of my favourites, as was the

Forgettron paper
.

I also really enjoyed the last third of
Christos’ talk
on the complexity of finding Nash equilibria.

And, speaking of tagging, I think
the U.Mass Citeseer replacement system
Rexa from the demo track is very cool.

Finally, let me add my recommendations for specific papers:

  • Z. Ghahramani, K. Heller: Bayesian Sets
    [no preprint]
    (A very elegant probabilistic information retrieval style model
    of which objects are “most like” a given subset of objects.)
  • T. Griffiths, Z. Ghahramani: Infinite Latent Feature Models and
    the Indian Buffet Process

    [
    preprint
    ]
    (A Dirichlet style prior over infinite binary matrices with
    beautiful exchangeability properties.)
  • K. Weinberger, J. Blitzer, L. Saul: Distance Metric Learning for
    Large Margin Nearest Neighbor Classification

    [
    preprint
    ]
    (A nice idea about how to learn a linear transformation of your
    feature space which brings nearby points of the same class closer
    together and sends nearby points of differing classes further
    apart. Convex. Kilian gave a very nice talk on this.)
  • D. Blei, J. Lafferty: Correlated Topic Models
    [
    preprint
    ]
    (Nice trick using the lognormal to induce correlations on the simplex
    applied to topic models for text.)

I’ll also post in the comments a list of other papers that caught my eye but
which I haven’t looked at closely enough to be able to out-and-out
recommend.

Some NIPS papers

Here is a set of papers that I found interesting (and why).

  1. A PAC-Bayes approach to the Set Covering Machine improves the set covering machine. The set covering machine approach is a new way to do classification characterized by a very close connection between theory and algorithm. At this point, the approach seems to be competing well with SVMs in about all dimensions: similar computational speed, similar accuracy, stronger learning theory guarantees, more general information source (a kernel has strictly more structure than a metric), and more sparsity. Developing a classification algorithm is not very easy, but the results so far are encouraging.
  2. Off-Road Obstacle Avoidance through End-to-End Learning and Learning Depth from Single Monocular Images both effectively showed that depth information can be predicted from camera images (using notably different techniques). This ability is strongly enabling because cameras are cheap, tiny, light, and potentially provider longer range distance information than the laser range finders people traditionally use.
  3. The Forgetron: A Kernel-Based Perceptron on a Fixed Budget proved that a bounded memory kernelized perceptron algorithm (which might be characterizable as “stochastic functional gradient descent with weight decay and truncation”) competes well with respect to an unbounded memory algorithm when the data contains a significant margin. Roughly speaking, this implies that the perceptron approach can learn arbitary (via the kernel) reasonably simple concepts from unbounded quantities of data.

In addition, Sebastian Thrun‘s “How I won the Darpa Grand Challenge” and Sanjoy Dasgupta‘s “Coarse Sample Complexity for Active Learning” talks were both quite interesting.

(Feel free to add any that you found interesting.)

The Everything Ensemble Edge

Rich Caruana, Alexandru Niculescu, Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is:

  1. Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings.
  2. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression.
  3. For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method.

A series of conclusions can be drawn from the observations.

  1. (Calibrated) boosted decision trees appear to perform best, in general although support vector machines and neural networks give credible near-best performance.
  2. The metalearning algorithm which simply chooses the best (based upon a small validation set) performs much better.
  3. A metalearning algorithm which combines the predictors in an ensemble using stepwise refinement of validation set performance appears to perform even better.

There are a number of caveats to this work: it was only applied on large datasets there is no guarantee that the datasets are representative of your problem (although efforts were made to be representative in general), and the size of the training set was fixed rather than using the natural size given by the problem. Despite all these caveats, the story told above seems compelling: if you want maximum performance, you must try many methods and somehow combine them.

The most significant drawback of this method is computational complexity. Techniques for reducing the computational complexity are therefore of significant interest. It seems plausible that there exists some learning algorithm which typically performs well whenever any of the above algorithms can perform well at a computational cost which is significantly less than “run all algorithm on all settings and test”.

A fundamental unanswered question here is “why?” in several forms. Why have the best efforts of many machine learning algorithm designers failed to capture all the potential predictive strength into a single coherent learning algorithm? Why do ensembles give such a significant consistent edge in practice? A great many papers follow the scheme: invent a new way to create ensembles, test, observe that it improves prediction performance at the cost of more computation, and publish. There are several pieces of theory explain individual ensemble methods, but we seem to have no convincing theoretical statement explaining why they almost always work.

Fast Gradient Descent

Nic Schaudolph has been developing a fast gradient descent algorithm called Stochastic Meta-Descent (SMD).

Gradient descent is currently untrendy in the machine learning community, but there remains a large number of people using gradient descent on neural networks or other architectures from when it was trendy in the early 1990s. There are three problems with gradient descent.

  1. Gradient descent does not necessarily produce easily reproduced results. Typical algorithms start with “set the initial parameters to small random values”.
  2. The design of the representation that gradient descent is applied to is often nontrivial. In particular, knowing exactly how to build a large neural network so that it will perform well requires knowledge which has not been made easily applicable.
  3. Gradient descent can be slow. Obviously, taking infinitesimal steps in the direction of the gradient would take forever, so some finite step size must be used. What exactly this step size should be is unclear. Many people have developed many algorithms for adjusting the step size (and to some extent the step direction). Unfortunately, many of the more sophisticated algorithms are not robust to noise, scale badly with the number of parameters (Anything worse than O(n) is unacceptable for big applications) or both. Consequently, many people simply use gradient descent where the step size is adjusted by a simple momentum heuristic.

Many people would add point (4): gradient descent on many architectures does not result in a global optima. This seems like a confusion of goals to me. The goal is good performance on future examples in learning rather than achieving a global optima on the training set.

SMD addresses point (3). It is an O(n) algorithm for gradient descent that can compete with the sophisticed methods where the sophisticated methods work but remains fairly robust to noise. Exactly how well it addresses point (3) is not entirely clear, but a few interesting problems have been solved with the algorithm, and perhaps we will see more evidence in the near future.