February 2007 – Machine Learning (Theory)

2/22/2007

Create Your Own ICML Workshop

As usual ICML 2007 will be hosting a workshop program to be held this year on June 24th. The success of the program depends on having researchers like you propose interesting workshop topics and then organize the workshops. I’d like to encourage all of you to consider sending a workshop proposal. The proposal deadline has been extended to March 5. See the workshop web-site for details.

Organizing a workshop is a unique way to gather an international group of researchers together to focus for an entire day on a topic of your choosing. I’ve always found that the cost of organizing a workshop is not so large, and very low compared to the benefits. The topic and format of a workshop are limited only by your imagination (and the attractiveness to potential participants) and need not follow the usual model of a mini-conference on a particular ML sub-area. Hope to see some interesting proposals rolling in.

2/16/2007

The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year	2006	2005	2002	1997	1987	1967
count	4	10	5	1	0	0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?

2/11/2007

24

To commemorate the Twenty Fourth Annual International Conference on Machine Learning (ICML-07), the FOX Network has decided to launch a new spin-off series in prime time. Through unofficial sources, I have obtained the story arc for the first season, which appears frighteningly realistic.

2/10/2007

Best Practices for Collaboration

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3” rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

2/2/2007

Thoughts regarding “Is machine learning different from statistics?”

Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues.

To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable).

If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?”

With this in mind, let’s spend the rest of the post thinking about agnostic learning. The following definition is due to Kearns, Schapire, and Sellie:

Fix a function class C. Given a training set drawn from an arbitrary distribution D and *arbitrarily labeled* (that is, we are given samples from an arbitrary distribution D over some set X x Y), efficiently output a hypothesis that is competitive with the best function in C with respect to D. If there exists a provably efficient solution for the above problem we say C is agnostically learnable with respect to D. Notice the difference between agnostic and PAC learning: we *do not assume* that the data is labeled according to a fixed function (I find this to be a common criticism of the PAC model).

To make this a little more concrete, let C be the class of halfspaces, let X = Rⁿ, and let Y = {-1,1}. Let opt = min_{{h in C}} (classification error of h with respect to D). Then the goal is, given an *arbitrarily* labeled data set of points in Rⁿ , find a hypothesis whose true error with respect to D is at most opt + eps. Furthermore the solution should provably run in time polynomial in n and 1/eps. Note that the hypothesis we output *need not be a halfspace*– we only require that its error rate is close to the error rate of the optimal halfspace.

Why choose C to be halfspaces? At the heart of many of our most popular learning tools, such as Support Vector Machines, is an algorithm for learning a halfspace over a set of features in many dimensions. Perhaps surprisingly, *it is still not known*, given samples from an arbitrary distribution on X x Y, how to efficiently ‘find the best fitting halfspace’ over a set of features. In fact, if we require our algorithm to output a halfspace as its hypothesis then the agnostic halfspace learning problem is NP-hard. Recent results have shown that even in the case where there is a halfspace with error rate .0001 with respect to D it is NP-hard to output a halfspace with error rate .49999 (achieving .5 is of course trivial).

The astute reader here may comment that I am ignoring the whole point of the ‘kernel trick,’ which is to provide an implicit representation of a halfspace over some feature set in many dimensions, therefore bypassing the above hardness result. While this is true, I am still not aware of any results in the statistical learning literature that give provably efficient algorithms for agnostically learning halfspaces (regardless of the choice of kernel).

My larger point is that a fundamental approach from statistical learning theory (running an SVM) is deeply connected to strong negative results from computational complexity theory. Still, as I explain in the next paragraph, finding the ‘best fitting halfspace’ remains an open and important computational problem.

Negative results not withstanding, it now seems natural to ask ‘well, can we find any efficient algorithms for agnostically learning halfspaces if we allow ourselves to output hypotheses that are not halfspaces?’ Briefly, the answer is ‘to some extent Yes,’ if we restrict the marginal distribution on X to be one of our favorite nice distributions (such as Gaussian over R^n). The solution runs in polynomial-time for any constant eps > 0 (for various reasons involving the ‘noisy XOR’ topic three posts ago, obtaining efficient algorithms for subconstant eps seems intractable).

The complexity of agnostically learning halfspaces with respect to arbitrary distributions remains open. A solution (if one exists) will require a powerful new algorithmic idea and will have important consequences regarding our most commonly used machine learning tools.

To respond to Andrej Bauer’s comment three posts ago:

‘I guess I have a more general question. I understand that for the purposes of proving lower bounds in computational complexity it’s reasonable to show theorems like “can’t learn XOR with NOT, AND, OR.” But I always thought that machine learning was more “positive”, i.e., you really want to learn whenever possible. So, instead of dispairing one should try to use a different bag of tricks.’

I think Andrej is right. Solving future machine learning problems will require a more sophisticated ‘bag of tricks’ than the one we have now, as most of our algorithms are based on learning halfspaces (in a noisy or agnostic setting). Computational learning theory gives us a methodology for how to proceed: find provably efficient solutions to unresolved computational problems– novel learning algorithms will no doubt follow.