More NIPS Papers II

I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order…

  • Preconditioner Approximations for Probabilistic Graphical Models.
    Pradeeep Ravikumar and John Lafferty.
    I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations.
  • Rodeo: Sparse nonparametric regression in high dimensions.
    John Lafferty and Larry Wasserman.
    A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs.
  • Interpolating between types and tokens by estimating power law generators.
    Goldwater, S., Griffiths, T. L., & Johnson, M.
    I had wondered how Chinese restaurant processes and Pitman-Yor processes related to Zipf’s plots and power laws for word frequencies. This paper seems to have the answers.
  • A Bayesian spatial scan statistic.
    Daniel B. Neill, Andrew W. Moore, and Gregory F. Cooper.
    When I first learned about spatial scan statistics I wondered what a Bayesian counterpart would be. I liked the fact they their method was simple, more accurate, and much faster than the usual frequentist method.
  • Q-Clustering.
    M. Narasimhan, N. Jojic and J. Bilmes.
    A very interesting application of sub-modular function optimization to clustering. This feels like a hot area.
  • Worst-Case Bounds for Gaussian Process Models.
    Sham M. Kakade, Matthias W. Seeger, & Dean P. Foster.

    It’s useful for Gaussian process practitioners to know that their approaches don’t do silly things when viewed from a worst-case frequentist setting. This paper provides some relevant theoretical results.

More NIPS Papers

Let me add to John’s post with a few of my own favourites
from this year’s conference. First, let me say that
Sanjoy’s talk, Coarse Sample Complexity Bounds for Active
Learning
was also one of my favourites, as was the

Forgettron paper
.

I also really enjoyed the last third of
Christos’ talk
on the complexity of finding Nash equilibria.

And, speaking of tagging, I think
the U.Mass Citeseer replacement system
Rexa from the demo track is very cool.

Finally, let me add my recommendations for specific papers:

  • Z. Ghahramani, K. Heller: Bayesian Sets
    [no preprint]
    (A very elegant probabilistic information retrieval style model
    of which objects are “most like” a given subset of objects.)
  • T. Griffiths, Z. Ghahramani: Infinite Latent Feature Models and
    the Indian Buffet Process

    [
    preprint
    ]
    (A Dirichlet style prior over infinite binary matrices with
    beautiful exchangeability properties.)
  • K. Weinberger, J. Blitzer, L. Saul: Distance Metric Learning for
    Large Margin Nearest Neighbor Classification

    [
    preprint
    ]
    (A nice idea about how to learn a linear transformation of your
    feature space which brings nearby points of the same class closer
    together and sends nearby points of differing classes further
    apart. Convex. Kilian gave a very nice talk on this.)
  • D. Blei, J. Lafferty: Correlated Topic Models
    [
    preprint
    ]
    (Nice trick using the lognormal to induce correlations on the simplex
    applied to topic models for text.)

I’ll also post in the comments a list of other papers that caught my eye but
which I haven’t looked at closely enough to be able to out-and-out
recommend.

Some NIPS papers

Here is a set of papers that I found interesting (and why).

  1. A PAC-Bayes approach to the Set Covering Machine improves the set covering machine. The set covering machine approach is a new way to do classification characterized by a very close connection between theory and algorithm. At this point, the approach seems to be competing well with SVMs in about all dimensions: similar computational speed, similar accuracy, stronger learning theory guarantees, more general information source (a kernel has strictly more structure than a metric), and more sparsity. Developing a classification algorithm is not very easy, but the results so far are encouraging.
  2. Off-Road Obstacle Avoidance through End-to-End Learning and Learning Depth from Single Monocular Images both effectively showed that depth information can be predicted from camera images (using notably different techniques). This ability is strongly enabling because cameras are cheap, tiny, light, and potentially provider longer range distance information than the laser range finders people traditionally use.
  3. The Forgetron: A Kernel-Based Perceptron on a Fixed Budget proved that a bounded memory kernelized perceptron algorithm (which might be characterizable as “stochastic functional gradient descent with weight decay and truncation”) competes well with respect to an unbounded memory algorithm when the data contains a significant margin. Roughly speaking, this implies that the perceptron approach can learn arbitary (via the kernel) reasonably simple concepts from unbounded quantities of data.

In addition, Sebastian Thrun‘s “How I won the Darpa Grand Challenge” and Sanjoy Dasgupta‘s “Coarse Sample Complexity for Active Learning” talks were both quite interesting.

(Feel free to add any that you found interesting.)

Is the Google way the way for machine learning?

Urs Hoelzle from Google gave an invited presentation at NIPS. In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following:

  1. Make a cluster of machines.
  2. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.)
  3. Interact with data via MapReduce.

Creating a cluster of machines is, by this point, relatively straightforward.

Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters.

MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interaction which are necessary.
Apparently, the Nutch project has an open source implementation of mapreduce (but this is clearly the most nonstandard element).

Shifting towards this paradigm has several effects:

  1. It makes “big data” applications more viable.
  2. It makes some learning algorithms more viable than others. One way to think about this is in terms of statistical query learning algorithms. The (generalized) notion of statistical query algorithms is algorithms that rely upon only the results of expections of a (relatively small) number of functions. Any such algorithm can be implemented via mapreduce. The “naive bayes” algorithm and most decision tree algorithms can be easily phrased as statistical query algorithms. Support vector machines can (technically) be phrased as statistical query algorithms, but the number of queries scales with the number of datapoints. Gradient descent algorithms can also be phrased as statistical query algorithms. Learning algorithms which work on one example at a time are not generally statistical query algorithms.

    Another way to think about this is in terms of the complexity of the computation. Roughly speaking, as the amount of data scales, only O(n) or (perhaps) O(n log(n)) algorithms are tractable. This strongly favors online learning algorithms. Decision trees and naive bayes are (again) relatively reasonable. Support vector machines (or gaussian processes) encounter difficulties related to scaling.

There is a reasonable argument that the “low hanging fruit” of machine learning research is in the big data with enabling tools paradigm. This is because (a) the amount of data available has been growing far faster than the amount of computation and (b) we just haven’t had the tools to scale here, until recently.

I expect Urs is right: we should look in this direction.