Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.

Yes , I am applying

Every year about now hundreds of applicants apply for a research/teaching job with the timing governed by the university recruitment schedule. This time, it’s my turn—the hat’s in the ring, I am a contender, etc… What I have heard is that this year is good in both directions—both an increased supply and an increased demand for machine learning expertise.

I consider this post a bit of an abuse as it is neither about general research nor machine learning. Please forgive me this once.

My hope is that I will learn about new places interested in funding basic research—it’s easy to imagine that I have overlooked possibilities.

I am not dogmatic about where I end up in any particular way. Several earlier posts detail what I think of as a good research environment, so I will avoid a repeat. A few more details seem important:

  1. Application. There is often a tension between basic research and immediate application. This tension is not as strong as might be expected in my case. As evidence, many of my coauthors of the last few years are trying to solve particular learning problems and I strongly care about whether and where a learning theory is useful in practice.
  2. Duration. I would like my next move to be of indefinite duration.

Feel free to email me (jl@hunch.net) if there is a possibility you think I should consider.

Workshops as Franchise Conferences

Founding a successful new conference is extraordinarily difficult. As a conference founder, you must manage to attract a significant number of good papers—enough to entice the participants into participating next year and to (generally) to grow the conference. For someone choosing to participate in a new conference, there is a very significant decision to make: do you send a paper to some new conference with no guarantee that the conference will work out? Or do you send it to another (possibly less related) conference that you are sure will work?

The conference founding problem is a joint agreement problem with a very significant barrier. Workshops are a way around this problem, and workshops attached to conferences are a particularly effective means for this. A workshop at a conference is sure to have people available to speak and attend and is sure to have a large audience available. Presenting work at a workshop is not generally exclusive: it can also be presented at a conference. For someone considering participation, the only overhead is the direct time and effort involved in participation.

All of the above says that workshops are much easier than conferences, but it does not address a critical question: “Why run a workshop at a conference rather than just a session at the conference?” A session at the conference would have all the above advantages.

There is one more very signficant and direct advantage of a workshop over a special session: workshops are run by people who have a direct and significant interest in their success. The workshop organizers do the hard work of developing a topic, soliciting speakers, and deciding what the program will be. Reputations for the workshop organizer are then built on the success or flop of the workshop. This “direct and signficant interest” aspect of a workshop is the basic reason why franchise systems (think 7-11 or McDonalds) are common and successful.

What does this observation imply about how things could be? For example, we could imagine a conference that is “all workshops”. Instead of having a program committee and program chair, the conference might just have a program chair that accepts or rejects workshop chairs who then organize their own workshop/session. This mode doesn’t seem to exist which is always cautioning, but on the other hand it ‘s not clear this mode has even been tried. NIPS is probably the conference closest to using this approach. For example, a significant number of people attend only the workshops at NIPS.

More NIPS Papers II

I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order…

  • Preconditioner Approximations for Probabilistic Graphical Models.
    Pradeeep Ravikumar and John Lafferty.
    I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations.
  • Rodeo: Sparse nonparametric regression in high dimensions.
    John Lafferty and Larry Wasserman.
    A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs.
  • Interpolating between types and tokens by estimating power law generators.
    Goldwater, S., Griffiths, T. L., & Johnson, M.
    I had wondered how Chinese restaurant processes and Pitman-Yor processes related to Zipf’s plots and power laws for word frequencies. This paper seems to have the answers.
  • A Bayesian spatial scan statistic.
    Daniel B. Neill, Andrew W. Moore, and Gregory F. Cooper.
    When I first learned about spatial scan statistics I wondered what a Bayesian counterpart would be. I liked the fact they their method was simple, more accurate, and much faster than the usual frequentist method.
  • Q-Clustering.
    M. Narasimhan, N. Jojic and J. Bilmes.
    A very interesting application of sub-modular function optimization to clustering. This feels like a hot area.
  • Worst-Case Bounds for Gaussian Process Models.
    Sham M. Kakade, Matthias W. Seeger, & Dean P. Foster.

    It’s useful for Gaussian process practitioners to know that their approaches don’t do silly things when viewed from a worst-case frequentist setting. This paper provides some relevant theoretical results.

More NIPS Papers

Let me add to John’s post with a few of my own favourites
from this year’s conference. First, let me say that
Sanjoy’s talk, Coarse Sample Complexity Bounds for Active
Learning
was also one of my favourites, as was the

Forgettron paper
.

I also really enjoyed the last third of
Christos’ talk
on the complexity of finding Nash equilibria.

And, speaking of tagging, I think
the U.Mass Citeseer replacement system
Rexa from the demo track is very cool.

Finally, let me add my recommendations for specific papers:

  • Z. Ghahramani, K. Heller: Bayesian Sets
    [no preprint]
    (A very elegant probabilistic information retrieval style model
    of which objects are “most like” a given subset of objects.)
  • T. Griffiths, Z. Ghahramani: Infinite Latent Feature Models and
    the Indian Buffet Process

    [
    preprint
    ]
    (A Dirichlet style prior over infinite binary matrices with
    beautiful exchangeability properties.)
  • K. Weinberger, J. Blitzer, L. Saul: Distance Metric Learning for
    Large Margin Nearest Neighbor Classification

    [
    preprint
    ]
    (A nice idea about how to learn a linear transformation of your
    feature space which brings nearby points of the same class closer
    together and sends nearby points of differing classes further
    apart. Convex. Kilian gave a very nice talk on this.)
  • D. Blei, J. Lafferty: Correlated Topic Models
    [
    preprint
    ]
    (Nice trick using the lognormal to induce correlations on the simplex
    applied to topic models for text.)

I’ll also post in the comments a list of other papers that caught my eye but
which I haven’t looked at closely enough to be able to out-and-out
recommend.