Machine Learning (Theory)

12/29/2005

Deadline Season

Tags: Announcements, Machine Learning jl@ 5:13 pm

Many different paper deadlines are coming up soon so I made a little reference table. Out of curiosity, I also computed the interval between submission deadline and conference.

Conference Location Date Deadline interval
COLT Pittsburgh June 22-25 January 21 152
ICML Pittsburgh June 26-28 January 30/February 6 140
UAI MIT July 13-16 March 9/March 16 119
AAAI Boston July 16-20 February 16/21 145
KDD Philadelphia August 23-26 March 3/March 10 166

It looks like the northeastern US is the big winner as far as location this year.

12/28/2005

Yet more nips thoughts

Tags: General DrewBagnell@ 12:13 pm

I only managed to make it out to the NIPS workshops this year so
I’ll give my comments on what I saw there.

The Learing and Robotics workshops lives again. I hope it
continues and gets more high quality papers in the future. The
most interesting talk for me was Larry Jackel’s on the LAGR
program (see John’s previous post on said program). I got some
ideas as to what progress has been made. Larry really explained
the types of benchmarks and the tradeoffs that had to be made to
make the goals achievable but challenging.

Hal Daume gave a very interesting talk about structured
prediction using RL techniques, something near and dear to my own
heart. He achieved rather impressive results using only a very
greedy search.

The non-parametric Bayes workshop was great. I enjoyed the entire
morning session I spent there, and particularly (the usually
desultory) discussion periods. One interesting topic was the
Gibbs/Variational inference divide. I won’t try to summarize
especially as no conclusion was reached. It was interesting to
note that samplers are competitive with the variational
approaches for many Dirichlet process problems. One open question
I left with was whether the fast variants of Gibbs sampling could
be made multi-processor as the naive variants can.

I also have a reading list of sorts from the main
conference. Most of the papers mentioned in previous posts on
NIPS are on that list as well as these: (in no particular order)

The Information-Form Data Association Filter
Sebastian Thrun, Brad Schumitsch, Gary Bradski, Kunle Olukotun
[ps.gz][pdf][bibtex]

Divergences, surrogate loss functions and experimental design
XuanLong Nguyen, Martin Wainwright, Michael Jordan [ps.gz][pdf][bibtex]

Generalization to Unseen Cases
Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri [ps.gz][pdf][bibtex]

Gaussian Process Dynamical Models
David Fleet, Jack Wang, Aaron Hertzmann [ps.gz][pdf][bibtex]

Convex Neural Networks
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau,
Patrice Marcotte [ps.gz][pdf][bibtex]

Describing Visual Scenes using Transformed Dirichlet Processes
Erik Sudderth, Antonio Torralba, William Freeman, Alan Willsky
[ps.gz][pdf][bibtex]

Learning vehicular dynamics, with application to modeling helicopters
Pieter Abbeel, Varun Ganapathi, Andrew Ng [ps.gz][pdf][bibtex]

Tensor Subspace Analysis
Xiaofei He, Deng Cai, Partha Niyogi [ps.gz][pdf][bibtex]

12/27/2005

Automated Labeling

One of the common trends in machine learning has been an emphasis on the use of unlabeled data. The argument goes something like “there aren’t many labeled web pages out there, but there are a huge number of web pages, so we must find a way to take advantage of them.” There are several standard approaches for doing this:

  1. Unsupervised Learning. You use only unlabeled data. In a typical application, you cluster the data and hope that the clusters somehow correspond to what you care about.
  2. Semisupervised Learning. You use both unlabeled and labeled data to build a predictor. The unlabeled data influences the learned predictor in some way.
  3. Active Learning. You have unlabeled data and access to a labeling oracle. You interactively choose which examples to label so as to optimize prediction accuracy.

It seems there is a fourth approach worth serious investigation—automated labeling. The approach goes as follows:

  1. Identify some subset of observed values to predict from the others.
  2. Build a predictor.
  3. Use the output of the predictor to define a new prediction problem.
  4. Repeat…

Examples of this sort seem to come up in robotics very naturally. An extreme version of this is:

  1. Predict nearby things given touch sensor output.
  2. Predict medium distance things given the nearby predictor.
  3. Predict far distance things given the medium distance predictor.

Some of the participants in the LAGR project are using this approach.

A less extreme version was the DARPA grand challenge winner where the output of a laser range finder was used to form a road-or-not predictor for a camera image.

These automated labeling techniques transform an unsupervised learning problem into a supervised learning problem, which has huge implications: we understand supervised learning much better and can bring to bear a host of techniques.

The set of work on automated labeling is sketchy—right now it is mostly just an observed-as-useful technique for which we have no general understanding. Some relevant bits of algorithm and theory are:

  1. Reinforcement learning to classification reductions which convert rewards into labels.
  2. Cotraining which considers a setting containing multiple data sources. When predictors using different data sources agree on unlabeled data, an inferred label is automatically created.

It’s easy to imagine that undiscovered algorithms and theory exist to guide and use this empirically useful technique.

12/22/2005

Yes , I am applying

Tags: General jl@ 6:43 pm

Every year about now hundreds of applicants apply for a research/teaching job with the timing governed by the university recruitment schedule. This time, it’s my turn—the hat’s in the ring, I am a contender, etc… What I have heard is that this year is good in both directions—both an increased supply and an increased demand for machine learning expertise.

I consider this post a bit of an abuse as it is neither about general research nor machine learning. Please forgive me this once.

My hope is that I will learn about new places interested in funding basic research—it’s easy to imagine that I have overlooked possibilities.

I am not dogmatic about where I end up in any particular way. Several earlier posts detail what I think of as a good research environment, so I will avoid a repeat. A few more details seem important:

  1. Application. There is often a tension between basic research and immediate application. This tension is not as strong as might be expected in my case. As evidence, many of my coauthors of the last few years are trying to solve particular learning problems and I strongly care about whether and where a learning theory is useful in practice.
  2. Duration. I would like my next move to be of indefinite duration.

Feel free to email me (jl@hunch.net) if there is a possibility you think I should consider.

12/17/2005

Workshops as Franchise Conferences

Tags: Research jl@ 2:35 pm

Founding a successful new conference is extraordinarily difficult. As a conference founder, you must manage to attract a significant number of good papers—enough to entice the participants into participating next year and to (generally) to grow the conference. For someone choosing to participate in a new conference, there is a very significant decision to make: do you send a paper to some new conference with no guarantee that the conference will work out? Or do you send it to another (possibly less related) conference that you are sure will work?

The conference founding problem is a joint agreement problem with a very significant barrier. Workshops are a way around this problem, and workshops attached to conferences are a particularly effective means for this. A workshop at a conference is sure to have people available to speak and attend and is sure to have a large audience available. Presenting work at a workshop is not generally exclusive: it can also be presented at a conference. For someone considering participation, the only overhead is the direct time and effort involved in participation.

All of the above says that workshops are much easier than conferences, but it does not address a critical question: “Why run a workshop at a conference rather than just a session at the conference?” A session at the conference would have all the above advantages.

There is one more very signficant and direct advantage of a workshop over a special session: workshops are run by people who have a direct and significant interest in their success. The workshop organizers do the hard work of developing a topic, soliciting speakers, and deciding what the program will be. Reputations for the workshop organizer are then built on the success or flop of the workshop. This “direct and signficant interest” aspect of a workshop is the basic reason why franchise systems (think 7-11 or McDonalds) are common and successful.

What does this observation imply about how things could be? For example, we could imagine a conference that is “all workshops”. Instead of having a program committee and program chair, the conference might just have a program chair that accepts or rejects workshop chairs who then organize their own workshop/session. This mode doesn’t seem to exist which is always cautioning, but on the other hand it ’s not clear this mode has even been tried. NIPS is probably the conference closest to using this approach. For example, a significant number of people attend only the workshops at NIPS.

12/14/2005

More NIPS Papers II

Tags: Papers zoubin@ 11:47 pm

I thought this was a very good NIPS with many excellent papers. The following are a few NIPS papers which I liked and I hope to study more carefully when I get the chance. The list is not exhaustive and in no particular order…

  • Preconditioner Approximations for Probabilistic Graphical Models.
    Pradeeep Ravikumar and John Lafferty.
    I thought the use of preconditioner methods from solving linear systems in the context of approximate inference was novel and interesting. The results look good and I’d like to understand the limitations.
  • Rodeo: Sparse nonparametric regression in high dimensions.
    John Lafferty and Larry Wasserman.
    A very interesting approach to feature selection in nonparametric regression from a frequentist framework. The use of lengthscale variables in each dimension reminds me a lot of ‘Automatic Relevance Determination’ in Gaussian process regression — it would be interesting to compare Rodeo to ARD in GPs.
  • Interpolating between types and tokens by estimating power law generators.
    Goldwater, S., Griffiths, T. L., & Johnson, M.
    I had wondered how Chinese restaurant processes and Pitman-Yor processes related to Zipf’s plots and power laws for word frequencies. This paper seems to have the answers.
  • A Bayesian spatial scan statistic.
    Daniel B. Neill, Andrew W. Moore, and Gregory F. Cooper.
    When I first learned about spatial scan statistics I wondered what a Bayesian counterpart would be. I liked the fact they their method was simple, more accurate, and much faster than the usual frequentist method.
  • Q-Clustering.
    M. Narasimhan, N. Jojic and J. Bilmes.
    A very interesting application of sub-modular function optimization to clustering. This feels like a hot area.
  • Worst-Case Bounds for Gaussian Process Models.
    Sham M. Kakade, Matthias W. Seeger, & Dean P. Foster.

    It’s useful for Gaussian process practitioners to know that their approaches don’t do silly things when viewed from a worst-case frequentist setting. This paper provides some relevant theoretical results.

12/11/2005

More NIPS Papers

Tags: Papers roweis@ 1:14 am

Let me add to John’s post with a few of my own favourites
from this year’s conference. First, let me say that
Sanjoy’s talk, Coarse Sample Complexity Bounds for Active
Learning
was also one of my favourites, as was the

Forgettron paper
.

I also really enjoyed the last third of
Christos’ talk
on the complexity of finding Nash equilibria.

And, speaking of tagging, I think
the U.Mass Citeseer replacement system
Rexa from the demo track is very cool.

Finally, let me add my recommendations for specific papers:

  • Z. Ghahramani, K. Heller: Bayesian Sets
    [no preprint]
    (A very elegant probabilistic information retrieval style model
    of which objects are “most like” a given subset of objects.)
  • T. Griffiths, Z. Ghahramani: Infinite Latent Feature Models and
    the Indian Buffet Process

    [
    preprint
    ]
    (A Dirichlet style prior over infinite binary matrices with
    beautiful exchangeability properties.)
  • K. Weinberger, J. Blitzer, L. Saul: Distance Metric Learning for
    Large Margin Nearest Neighbor Classification

    [
    preprint
    ]
    (A nice idea about how to learn a linear transformation of your
    feature space which brings nearby points of the same class closer
    together and sends nearby points of differing classes further
    apart. Convex. Kilian gave a very nice talk on this.)
  • D. Blei, J. Lafferty: Correlated Topic Models
    [
    preprint
    ]
    (Nice trick using the lognormal to induce correlations on the simplex
    applied to topic models for text.)

I’ll also post in the comments a list of other papers that caught my eye but
which I haven’t looked at closely enough to be able to out-and-out
recommend.

12/9/2005

Some NIPS papers

Tags: Papers jl@ 4:46 pm

Here is a set of papers that I found interesting (and why).

  1. A PAC-Bayes approach to the Set Covering Machine improves the set covering machine. The set covering machine approach is a new way to do classification characterized by a very close connection between theory and algorithm. At this point, the approach seems to be competing well with SVMs in about all dimensions: similar computational speed, similar accuracy, stronger learning theory guarantees, more general information source (a kernel has strictly more structure than a metric), and more sparsity. Developing a classification algorithm is not very easy, but the results so far are encouraging.
  2. Off-Road Obstacle Avoidance through End-to-End Learning and Learning Depth from Single Monocular Images both effectively showed that depth information can be predicted from camera images (using notably different techniques). This ability is strongly enabling because cameras are cheap, tiny, light, and potentially provider longer range distance information than the laser range finders people traditionally use.
  3. The Forgetron: A Kernel-Based Perceptron on a Fixed Budget proved that a bounded memory kernelized perceptron algorithm (which might be characterizable as “stochastic functional gradient descent with weight decay and truncation”) competes well with respect to an unbounded memory algorithm when the data contains a significant margin. Roughly speaking, this implies that the perceptron approach can learn arbitary (via the kernel) reasonably simple concepts from unbounded quantities of data.

In addition, Sebastian Thrun’s “How I won the Darpa Grand Challenge” and Sanjoy Dasgupta’s “Coarse Sample Complexity for Active Learning” talks were both quite interesting.

(Feel free to add any that you found interesting.)

Machine Learning Thoughts

Tags: General jl@ 12:04 am

I added a link to Olivier Bousquet’s machine learning thoughts blog. Several of the posts may be of interest.

12/7/2005

Is the Google way the way for machine learning?

Tags: General jl@ 5:36 pm

Urs Hoelzle from Google gave an invited presentation at NIPS. In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following:

  1. Make a cluster of machines.
  2. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.)
  3. Interact with data via MapReduce.

Creating a cluster of machines is, by this point, relatively straightforward.

Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters.

MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interaction which are necessary.
Apparently, the Nutch project has an open source implementation of mapreduce (but this is clearly the most nonstandard element).

Shifting towards this paradigm has several effects:

  1. It makes “big data” applications more viable.
  2. It makes some learning algorithms more viable than others. One way to think about this is in terms of statistical query learning algorithms. The (generalized) notion of statistical query algorithms is algorithms that rely upon only the results of expections of a (relatively small) number of functions. Any such algorithm can be implemented via mapreduce. The “naive bayes” algorithm and most decision tree algorithms can be easily phrased as statistical query algorithms. Support vector machines can (technically) be phrased as statistical query algorithms, but the number of queries scales with the number of datapoints. Gradient descent algorithms can also be phrased as statistical query algorithms. Learning algorithms which work on one example at a time are not generally statistical query algorithms.

    Another way to think about this is in terms of the complexity of the computation. Roughly speaking, as the amount of data scales, only O(n) or (perhaps) O(n log(n)) algorithms are tractable. This strongly favors online learning algorithms. Decision trees and naive bayes are (again) relatively reasonable. Support vector machines (or gaussian processes) encounter difficulties related to scaling.

There is a reasonable argument that the “low hanging fruit” of machine learning research is in the big data with enabling tools paradigm. This is because (a) the amount of data available has been growing far faster than the amount of computation and (b) we just haven’t had the tools to scale here, until recently.

I expect Urs is right: we should look in this direction.

12/4/2005

Watchword: model

Tags: Bayesian, Definitions jl@ 10:16 pm

In everyday use a model is a system which explains the behavior of some system, hopefully at the level where some alteration of the model predicts some alteration of the real-world system. In machine learning “model” has several variant definitions.

  1. Everyday. The common definition is sometimes used.
  2. Parameterized. Sometimes model is a short-hand for “parameterized model”. Here, it refers to a model with unspecified free parameters. In the Bayesian learning approach, you typically have a prior over (everyday) models.
  3. Predictive. Even further from everyday use is the predictive model. Examples of this are “my model is a decision tree” or “my model is a support vector machine”. Here, there is no real sense in which an SVM explains the underlying process. For example, an SVM tells us nothing in particular about how alterations to the real-world system would create a change.

Which definition is being used at any particular time is important information. For example, if it’s a parameterized or predictive model, this implies some learning is required. If it’s a predictive model, then the set of operations which can be done to the model are restricted with respect to everyday usage. I don’t have any particular advice here other than “watch out”—be aware of the distinctions, watch for this source of ambiguity, and clarify when necessary.

12/1/2005

The Webscience Future

Tags: General jl@ 10:47 am

The internet has significantly effected the way we do research but it’s capabilities have not yet been fully realized.

First, let’s acknowledge some known effects.

  1. Self-publishing By default, all researchers in machine learning (and more generally computer science and physics) place their papers online for anyone to download. The exact mechanism differs—physicists tend to use a central repository (Arxiv) while computer scientists tend to place the papers on their webpage. Arxiv has been slowly growing in subject breadth so it now sometimes used by computer scientists.
  2. Collaboration Email has enabled working remotely with coauthors. This has allowed collaborationis which would not otherwise have been possible and generally speeds research.

Now, let’s look at attempts to go further.

  1. Blogs (like this one) allow public discussion about topics which are not easily categorized as “a new idea in machine learning” (like this topic).
  2. Organization of some subfield of research. This includes Satinder Singh’s Reinforcement Learning pages, and, more generally books that have been placed online such as this one.
  3. Discussion Groups The kernel machines discussions provide a good example of some common format allowing discussion.
  4. Class notes have been placed online such as Avrim’s learning theory lecture notes.
  5. Wikipedia has an article on Machine Learning. The article gives a reasonable quick overview and is surely read by a very large number of people.
  6. Online Proceedings are now being used by several conferences such as NIPS.

Now, let’s consider some futures.

  1. Wikifuture Wikipedia becomes better to the point where it is a completely comprehensive listing of current research in machine learning. At some point, we-the-community realize this and begin to emphasize (and credit) information placed in wikipedia. This process reinforces itself to the point where “if it’s not in wikipedia, it doesn’t exist”.

    This future is significantly more probable than most people understand. As evidence compare the machine learning page three years ago (yep, it didn’t exist), two years ago, one year ago, and today. That progression strongly suggests that wikipedia:machine learning will continue to grow into a significant information resource.

    There are fundamental obstacles to the success of the wikipedia future.

    1. credit Wikipedia has only very weak mechanisms for crediting editors. A list of the changes done by one user account is about as much credit as is available. This is not enough to make career-deciding questions on. We could hope for a stronger link between identity and editor along with tools to track the value of particular edits (Think of counting hyperlinks as an analogue for counting citations).
    2. controversy Wikipedia has grown up in a nonmanipulative environment. When it was little known, the incentive to fabricate entries was not great. Now that it is becoming well known that incentive is growing. Character assasination by false article exists. In science, the thing to worry about is misplaced ideas of the importance of your topic of research since it is very difficult to be sufficiently interested in a research topic and simultaneously view it objectively. Research is about creating new ideas, and the location of these ideas in some general organization is in dispute by default.
  2. Evolutionary Progression Consider the following sequence of steps.
    1. Conference Organization We realize that having a list of online papers isn’t nearly as useful as having an organized list of online papers so the conferences which already have online proceedings create an explorable topic hierarchy.
    2. Time Organization We realize that the organization at one particular year’s conference is sketchy—research is a multiyear endeavor. Consequently, we start adding to last years topic hierarchy rather than creating a new one from scratch each year.
    3. Transformation We realize that it is better if papers are done in the language of the web. For example, it’s very handy to be able to hyperlink inside of a paper. A good solution to the math on the web problem would greatly help here.
    4. Consolidation We realize that there is a lot of redundancy in two papers on the same or a similar topic. They share an introduction, motivation, and (often) definitions. By joining the shared pieces, the contents of both papers can be made clearer.

    Each of these individual steps clearly yields something better. At the end of these steps, creating a paper is simply the process of creating a webpage or altering an existing webpage. We can imagine doing all of this while keeping the peer-review mechanisms of science intact, so the resulting process is simply better in all ways. It’s easier to author because for most papers much of the “filler” introduction/motivation/definition can be reused from previous papers. It’s easier to review, because reviewers can consider the result in context. Much of the difficulty of reviewing is simply due to the author and reviewer not being “on the same page” in how they understand things. An organized topic hierarchy greatly aids this.

  3. The unknown It is difficult to anticipate everything. What other futures might exist?

Which future comes about is dependent on many things—the decisions of community leaders, enabling ‘math-on-the-web’ technologies, etc…, so it is difficult to predict which future and when it will come about. Nevertheless, a potential exists and there are several paths leading towards reaching that potential.

Powered by WordPress