John Langford – Page 23 – Machine Learning (Theory)

12/26/2010

NIPS 2010

I enjoyed attending NIPS this year, with several things interesting me. For the conference itself:

Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona, The Multidimensional Wisdom of Crowds. This paper is about using mechanical turk to get label information, with results superior to a majority vote approach.
David McAllester, Tamir Hazan, and Joseph Keshet Direct Loss Minimization for Structured Prediction. This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition.
Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades. This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested.
Alan Fern and Prasad Tadepalli, A Computational Decision Theory for Interactive Assistants. This paper carves out some forms of natural not-MDP problems and shows their RL-style solution is tractable. It’s good to see people moving beyond MDPs, which at this point are both well understood and limited.
Oliver Williams and Frank McSherry Probabilistic Inference and Differential Privacy. This paper is about a natural and relatively unexplored, and potentially dominating approach for achieving differential privacy and learning.

I also attended two workshops—Coarse-To-Fine and LCCC which were a fine combination. The first was about more efficient (and sometimes more effective) methods for learning which start with coarse information and refine, while the second was about parallelization and distribution of learning algorithms. Together, they were about how to learn fast and effective solutions.

The CtF workshop could have been named “Integrating breadth first search and learning”. I was somewhat (I hope not too) pesky, discussing Searn repeatedly during questions, since it seems quite plausible that a good application of Searn would compete with and plausibly improve on results from several of the talks. Eventually, I hope the conventional wisdom shifts to a belief that search and learning must be integrated for efficiency and robustness reasons. The talks in this workshop were uniformly strong in making that case. I was particularly interested in Drew‘s talk on a plausible improvement on Searn.

The level of agreement in approaches at the LCCC workshop was much lower, with people discussing many radically different approaches.

Should data be organized by feature partition or example partition? Fernando points out that features often scale sublinearly in the number of examples, implying that an example partition addresses scale better. However, basic learning theory tells us that if the number of parameters scales sublinearly in the number of examples, then the value of additional samples asymptotes, implying a mismatched solution design. My experience is that a ‘not enough features’ problem can be dealt with by throwing all the missing features you couldn’t properly previously use, for example personalization.
How can we best leverage existing robust distributed filesystem/MapReduce frameworks? There was near unanimity on the belief that MapReduce itself is of limited value for machine learning, but the step forward is unclear. I liked what Markus said: that no one wants to abandon the ideas of robustly storing data and moving small amounts of code to large amounts of data. The best way to leverage this capability to build great algorithms remains unclear to me.
Every speaker was in agreement that their approach was faster, but there was great disagreement about what “fast” meant in an absolute sense. This forced me to think about an absolute measure of (input complexity)/(time) where we see results between 100 features/s and 10*10⁶ features/s being considered “fast” depending on who is speaking. This scale disparity is remarkably extreme. A related detail is that the strength of baseline algorithms varies greatly.

I hope we’ll discover convincing answers to these questions in the near future.

12/4/2010

Vowpal Wabbit, version 5.0, and the second heresy

I’ve released version 5.0 of the Vowpal Wabbit online learning software. The major number has changed since the last release because I regard all earlier versions as obsolete—there are several new algorithms & features including substantial changes and upgrades to the default learning algorithm.

The biggest changes are new algorithms:

Nikos and I improved the default algorithm. The basic update rule still uses gradient descent, but the size of the update is carefully controlled so that it’s impossible to overrun the label. In addition, the normalization has changed. Computationally, these changes are virtually free and yield better results, sometimes much better. Less careful updates can be reenabled with –loss_function classic, although results are still not identical to previous due to normalization changes.
Nikos also implemented the per-feature learning rates as per these two papers. Often, this works better than the default algorithm. It isn’t the default because it isn’t (yet) as adaptable in terms of learning rate decay. This is enabled with –adaptive and learned regressors are compatible with the default. Computationally, you might see a factor of 4 slowdown if using ‘-q’. Nikos noticed that the phenomenal quake inverse square root hack applies making this substantially faster than a naive implementation.
Nikos and Daniel also implemented active learning derived from this paper, usable via –active_simulation (to test parameters on an existing supervised dataset) or –active_learning (to do the real thing). This runs at full speed which is much faster than is reasonable in any active learning scenario. We see this approach dominating supervised learning on all classification datasets so far, often with far fewer labeled examples required, as the theory predicts. The learned predictor is compatible with the default.
Olivier helped me implement preconditioned conjugate gradient based on Jonathan Shewchuk‘s tutorial. This is a batch algorithm and hence requires multiple passes over any dataset to do something useful. Each step of conjugate gradient requires 2 passes. The advantage of cg is that it converges relatively quickly via the use of second derivative information. This can be particularly helpful if your features are of widely differing scales. The use of –regularization 0.001 (or smaller) is almost required with –conjugate_gradient as it will otherwise overfit hard. This implementation has two advantages over the basic approach: it implicitly computes a Hessian in O(n) time where n is the number of features and it operates out of core, hence making it applicable to datasets that don’t conveniently fit in RAM. The learned predictor is compatible with the default, although you’ll notice that a factor of 8 more RAM is required when learning.
Matt Hoffman and I implemented Online Latent Dirichlet Allocation. This code is still experimental and likely to change over the next week. It really does a minibatch update under the hood. The code appears to be substantially faster than Matt’s earlier python implementation making this probably the most efficient LDA anywhere. LDA is still much slower than online linear learning as it is quite computationally heavy in comparison—perhaps a good candidate for GPU optimization.
Nikos, Daniel, and I have been experimenting with more online cluster parallel learning algorithms (–corrective, –backprop, –delayed_global). We aren’t yet satisfied with these although they are improving. Details are at the LCCC workshop.

In addition, Ariel added a test suite, Shravan helped with ngrams, and there are several other minor new features and bug fixes including a very subtle one caught by Vaclav.

The documentation on the website hasn’t kept up with the code. I’m planning to rectify that over the next week, and have a new tutorial starting at 2pm in the LCCC room for those interested. Yes, I’ll not be skiing 🙂

12/2/2010

Traffic Prediction Problem

Slashdot points out the Traffic Prediction Challenge which looks pretty fun. The temporal aspect seems to be very common in many real-world problems and somewhat understudied.

10/29/201011/5/2010

To Vidoelecture or not

(update: cross-posted on CACM)

For the first time in several years, ICML 2010 did not have videolectures attending. Luckily, the tutorial on exploration and learning which Alina and I put together can be viewed, since we also presented at KDD 2010, which included videolecture support.

ICML didn’t cover the cost of a videolecture, because PASCAL didn’t provide a grant for it this year. On the other hand, KDD covered it out of registration costs. The cost of videolectures isn’t cheap. For a workshop the baseline quote we have is 270 euro per hour, plus a similar cost for the cameraman’s travel and accomodation. This can be reduced substantially by having a volunteer with a camera handle the cameraman duties, uploading the video and slides to be processed for a quoted 216 euro per hour.

Youtube is the most predominant free video site with a cost of $0, but it turns out to be a poor alternative. 15 minute upload limits do not match typical talk lengths. Videolectures also have side-by-side synchronized slides & video which allows quick navigation of the videostream and acceptable resolution of typical talk slides. Overall, these benefits are substantial enough that youtube is not presently a serious alternative.

So, if we can’t avoid paying the cost, is it worthwhile? One way to judge this is by comparing how much authors currently spend traveling to a conference and presenting research vs. the size of the audience. In general, costs vary wildly, but for a typical academic international conference, airfare, hotel, and registration are commonly at least $1000 even after scrimping. The sizes of audiences also varies substantially, but something in the 30-100 range is a typical average. For KDD 2010, the average number of views per presentation is 14.6, but this is misleadingly low, as KDD presentations were just put up. A better number is for KDD 2009, where the average view number is presently 74.2. This number is representative with ICML 2009 presently averaging 115.8. We can argue about the relative merits of online vs. in-person viewing, but the order of their value is at least unclear, since in an online system people specifically seek out lectures to view while at the conference itself people are often opportunistic viewers. Valuing these equally, we see that videolectures increases the size of the audience, and (hence) the value to authors by perhaps a factor of 2 for a cost around 1/3 of current presentation costs.

This conclusion is conservative, because a videolecture is almost surely viewed over more than a year, cost of conference attendance are often higher, and the cost in terms of a presenter’s time is not accounted for. Overall, videolecture coverage seems quite worthwhile. Since authors also typically are the attendees of a conference, increasing the registration fees to cover the cost of videolectures seems reasonable. A videolecture is simply a new publishing format.

We can hope that the price will drop over time, as it’s not clear to me that the 216 euros/hour reflects the real costs of videolectures.net. Some competition of a similar quality would be the surest way to do that. But in the near future, there are two categories of conferences—those that judge the value of their content above 216 euros/hour, and those that do not. Whether or not a conference has videolecture support substantially impacts its desirability as a place to send papers.

10/28/201010/28/2010

NY ML Symposium 2010

About 200 people attended the 2010 NYAS ML Symposium this year. (It was about 170 last year.) I particularly enjoyed several talks.

Yann has a new live demo of (limited) real-time object recognition learning.
Sanjoy gave a fairly convincing and comprehensible explanation of why a modified form of single-linkage clustering is consistent in higher dimensions, and why consistency is a critical feature for clustering algorithms. I’m curious how well this algorithm works in practice.
Matt Hoffman‘s poster covering online LDA seemed pretty convincing to me as an algorithmic improvement.

This year, we allocated more time towards posters & poster spotlights.

For next year, we are considering some further changes. The format has traditionally been 4 invited Professor speakers, with posters and poster spotlight for students. Demand from other parties to participate is growing, for example from postdocs and startups in the area. Another growing concern is the facility—the location is exceptional, but fitting more people is challenging. Does anyone else have suggestions to improve the event?

New York is a particularly good location for a regional symposium, but it’s quite plausible that other places could have one as well. Looking at Meetup groups for interest, obvious choices are Southern California (San Diego & Los Angeles both have large R meetup groups), Northern California (which has , 2, 3 AI-related Meetup groups), and Sydney, Australia, which has a large datamining meetup group. Relative to meetups, a regional symposium offers a more intense affair with higher participation, and new kinds of participation (for example, via posters). Relative to international conferences, a regional meetup is much easier, and has a high chance of producing future collaborations.