John Langford – Page 33 – Machine Learning (Theory)

8/26/2009

Another 10-year paper in Machine Learning

When I was thinking about the best “10 year paper” for ICML, I also took a look at a few other conferences. Here is one from 10 years ago that interested me:

David McAllester PAC-Bayesian Model Averaging, COLT 1999. 2001 Journal Draft.

Prior to this paper, the only mechanism known for controlling or estimating the necessary sample complexity for learning over continuously parameterized predictors was VC theory and variants, all of which suffered from a basic problem: they were incredibly pessimistic in practice. This meant that only very gross guidance could be provided for learning algorithm design. The PAC-Bayes bound provided an alternative approach to sample complexity bounds which was radically tighter, quantitatively. It also imported and explained many of the motivations for Bayesian learning in a way that learning theory and perhaps optimization people might appreciate. Since this paper came out, there have been a number of moderately successful attempts to drive algorithms directly by the PAC-Bayes bound. We’ve gone from thinking that a bound driven algorithm is completely useless to merely a bit more pessimistic and computationally intense than might be necessary.

The PAC-Bayes bound is related to the “bits-back” argument that Geoff Hinton and Drew van Camp made at COLT 6 years earlier.

What other machine learning or learning theory papers from 10 years ago have had a substantial impact?

8/16/2009

Centmail comments

Centmail is a scheme which makes charity donations have a secondary value, as a stamp for email. When discussed on newscientist, slashdot, and others, some of the comments make the academic review process appear thoughtful :). Some prominent fallacies are:

Costing money fallacy. Some commenters appear to believe the system charges money per email. Instead, the basic idea is that users get an extra benefit from donations to a charity and participation is strictly voluntary. The solution to this fallacy is simply reading the details.
Single solution fallacy. Some commenters seem to think this is proposed as a complete solution to spam, and since not everyone will opt to participate, it won’t work. But a complete solution is not at all necessary or even possible given the flag-day problem. Deployed machine learning systems for fighting spam are great at taking advantage of a partial solution. The solution to this fallacy is learning about machine learning. In the current state of affairs, informed comment about spam fighting without knowing machine learning is difficult to imagine.
Broken crypto fallacy. Some commenters seem to think that stamps can be reused arbitrarily on emails. This ignores the existence of strong hashes. The solution to this fallacy is simply checking the details and possibly learning about cryptographics hashes.

Dan Reeves made a very detailed FAQ trying to address all the failure modes seen in comments, and there is a bit more discussion at messy matters.

My personal opinion is that Centmail is an interesting idea that might work, avoids the failure modes of many other ideas, hasn’t failed yet, and hence is worth trying. It’s a better approach than my earlier thoughts on economic solutions to spam.

8/3/2009

Carbon in Computer Science Research

Al Gore‘s film and gradually more assertive and thorough science has managed to mostly shift the debate on climate change from “Is it happening?” to “What should be done?” In that context, it’s worthwhile to think a bit about what can be done within computer science research.

There are two things we can think about:

Doing Research At a cartoon level, computer science research consists of some combination of commuting to&from work, writing programs, running them on computers, writing papers, and presenting them at conferences. A typical computer has a power usage on the order of 100 Watts, which works out to 2.4 kiloWatt-hours/day. Looking up David MacKay‘s reference on power usage per person, it becomes clear that this is a relatively minor part of the lifestyle, although it could become substantial if many more computers are required. Much larger costs are associated with commuting (which is in common with many people) and attending conferences. Since local commuting is common across many people, and there are known approaches (typically public transportation) for more efficient commuting, I expect researchers can piggyback on improvements in public transportation to reduce commuting costs. In fact, the situation for researchers may be better in general, as the nature of the job may make commuting avoidable, at least on some days.
Presenting at conferences is the remaining problem area, essentially due to travel by airplane to and from a conference. Travel by airplane has an energy cost similar to travel by car over the same distance, but we typically take airplanes for very long distances. Unlike cars, typical airplane usage requires stored energy in a dense form. For example, there are no serious proposals I’m aware of for battery-powered airplanes, because all existing rechargeable batteries have a power density around 1/10th that of hydrocarbon fuel (which makes sense given that about 3/4 of the mass for a hydrocarbon fire is oxygen in the air). This suggests airplane transport may be particularly difficult to adapt towards low or zero carbon usage. The plausible approaches I know involve either using electricity (from where?) to inefficiently crack water for hydrogen, or the biofuel approach where hydrocarbons are made by plants, with neither of these approaches particularly far along in development. If these aren’t developed, it seems we should expect fewer conferences, more regional conferences, Europe with it’s extensive fast train network to be less impacted, and more serious effort towards distributed conferences. For the last, it’s easy to imagine with existing technology having simultaneous regional conferences which are mutually videoconferenced, and we aren’t far from being able to handle a fully interactive videobroadcast amongst an indefinitely large number of participants. As a corollary of fewer conferences, other interactive mechanisms (for example research blogs) seems likely to grow.
Research Topics They keyword for research topics is efficiency, and it is not a trivial concern on a global scale. In computer science, there have been a few algorithms (such as quicksort and hashing) developed which substantially and broadly improved real-world efficiency, but the real driver of efficiency so far is the hardware development, which has phenomenally improved efficiency for several decades.
Many of the efficiency improvements are sure to remain hardware based, but software is becoming an essential component. One basic observation about efficient algorithms is that for problems admitting an efficient parallel solution (counting is a great example), the parallel algorithm is generally more efficient, because energy use is typically superlinear in clock speed. As an extreme example, the human brain which is deeply optimized by evolution for energy efficiency typically runs at at 100Hz or 100KHz.

Although efficiency suggests parallel algorithms, this should not be done blindly. For example, in machine learning the evidence I’ve seen so far suggests that online learning (which is admittedly harder to parallelize) is substantially more efficient than batch style learning, implying that I expect online approaches to be more efficient than map-reduce based machine learning as is typically seen in the Mahout project.

A substantial difficulty with parallel algorithms is the programming itself. In this regard, there is plenty of room for programming language work as well.

7/31/2009

Vowpal Wabbit Open Source Project

Today brings a new release of the Vowpal Wabbit fast online learning software. This time, unlike the previous release, the project itself is going open source, developing via github. For example, the lastest and greatest can be downloaded via:

git clone git://github.com/JohnLangford/vowpal_wabbit.git

If you aren’t familiar with git, it’s a distributed version control system which supports quick and easy branching, as well as reconciliation.

This version of the code is confirmed to compile without complaint on at least some flavors of OSX as well as Linux boxes.

As much of the point of this project is pushing the limits of fast and effective machine learning, let me mention a few datapoints from my experience.

The program can effectively scale up to batch-style training on sparse terafeature (i.e. 10¹² sparse feature) size datasets. The limiting factor is typically i/o.
I started using the the real datasets from the large-scale learning workshop as a convenient benchmark. The largest dataset takes about 10 minutes. (This is using the native features that the organizers intended as a starting point, yet all contestants used. In some cases, that admittedly gives you performance nowhere near to optimal.)
After using this program for awhile, it’s substantially altered my perception of what is a large-scale learning problem. This causes confusion when people brag about computational performance on tiny datasets with only 10⁵ examples

I would also like to emphasize that this is intended as an open source project rather than merely a code drop, as occurred last time. What I think this project has to offer researchers is an infrastructure for implementing fast online algorithms. It’s reasonably straightforward to implant your own tweaked algorithm, automatically gaining the substantial benefits of the surrounding code that deals with file formats, file caching, buffering, etc… In a very real sense, most of the code is this surrounding stuff, which you don’t have to rewrite to benefit from. For people applying machine learning, there is some obvious value in getting very fast feedback in a batch setting, as well as having an algorithm that actually works in a real online setting.

As one example of the ability to reuse the code for other purposes, an effective general purpose online implementation of the Offset Tree is included. I haven’t seen any other implementation of an algorithm for learning in the agnostic partial label setting, so this code may be of substantial interest for people encountering these sorts of problems.

The difference between this version and the previous is a nearly total rewrite. Some bigger changes are:

We dropped SEG for now, because of code complexity reasons.
Multicore parallelization proceeds in a different fashion—parallelization over features instead of examples. This works better with caches. Note that all parallelization of the core algorithm is meaningless unless you use the -q flag, because otherwise you are i/o bound.
The code is more deeply threaded, with a separate thread for parsing.
There is support for several different loss functions, and it’s easy to add your own.

I’m interested in any bug reports or suggestions for the code. I have substantial confidence that this code can do interesting and useful things, but improving it is a constant and ongoing process.

7/11/2009

Interesting papers at KDD

I attended KDD this year. The conference has always had a strong grounding in what works based on the KDDcup, but it has developed a halo of workshops on various subjects. It seems that KDD has become a place where the economy meets machine learning in a stronger sense than many other conferences.

There were several papers that other people might like to take a look at.

Yehuda Koren Collaborative Filtering with Temporal Dynamics. This paper describes how to incorporate temporal dynamics into a couple of collaborative filtering approaches. This was also a best paper award.
D. Sculley, Robert Malkin, Sugato Basu, Roberto J. Bayardo, Predicting Bounce Rates in Sponsored Search Advertisements. The basic claim of this paper is that the probability people immediately leave (“bounce”) after clicking on an advertisement is predictable.
Frank McSherry and Ilya Mironov Differentially Private Recommender Systems: Building Privacy into the Netflix Prize Contenders. The basic claim here is that it is possible to beat the baseline system in Netflix and preserve a nontrivial amount of user privacy. It’s the first demonstration I’ve seen of this sort, and it’s particularly impressive they used a strong algorithm-independent definition of privacy which Cynthia Dwork first stated.

KDD also experimented this year with crowdvine which was interesting. Compared to Mark Reid‘s efforts with ICML, they managed to get substantially more participation. There seemed to be two reasons: the conference organizers more deeply integrated and encouraged the use of crowdvine, and crowdvine has certain handy additional uses—you can create your own personal schedule for instance, which incidentally provides some vague global notion of the popularity of various papers. The biggest drawback I found was that the papers themselves were not integrated into the website.