Machine Learning (Theory) – Page 64 – Machine learning and learning theory research

2/11/2007

24

To commemorate the Twenty Fourth Annual International Conference on Machine Learning (ICML-07), the FOX Network has decided to launch a new spin-off series in prime time. Through unofficial sources, I have obtained the story arc for the first season, which appears frighteningly realistic.

2/10/2007

Best Practices for Collaboration

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3” rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.

2/2/2007

Thoughts regarding “Is machine learning different from statistics?”

Given John’s recent posts on CMU’s new machine learning department and “Deep Learning,” I asked for an opportunity to give a computational learning theory perspective on these issues.

To my mind, the answer to the question “Are the core problems from machine learning different from the core problems of statistics?” is a clear Yes. The point of this post is to describe a core problem in machine learning that is computational in nature and will appeal to statistical learning folk (as an extreme example note that if P=NP– which, for all we know, is true– then we would suddenly find almost all of our favorite machine learning problems considerably more tractable).

If the central question of statistical learning theory were crudely summarized as “given a hypothesis with a certain loss bound over a test set, how well will it generalize?” then the central question of computational learning theory might be “how can we find such a hypothesis efficently (e.g., in polynomial-time)?”

With this in mind, let’s spend the rest of the post thinking about agnostic learning. The following definition is due to Kearns, Schapire, and Sellie:

Fix a function class C. Given a training set drawn from an arbitrary distribution D and *arbitrarily labeled* (that is, we are given samples from an arbitrary distribution D over some set X x Y), efficiently output a hypothesis that is competitive with the best function in C with respect to D. If there exists a provably efficient solution for the above problem we say C is agnostically learnable with respect to D. Notice the difference between agnostic and PAC learning: we *do not assume* that the data is labeled according to a fixed function (I find this to be a common criticism of the PAC model).

To make this a little more concrete, let C be the class of halfspaces, let X = Rⁿ, and let Y = {-1,1}. Let opt = min_{{h in C}} (classification error of h with respect to D). Then the goal is, given an *arbitrarily* labeled data set of points in Rⁿ , find a hypothesis whose true error with respect to D is at most opt + eps. Furthermore the solution should provably run in time polynomial in n and 1/eps. Note that the hypothesis we output *need not be a halfspace*– we only require that its error rate is close to the error rate of the optimal halfspace.

Why choose C to be halfspaces? At the heart of many of our most popular learning tools, such as Support Vector Machines, is an algorithm for learning a halfspace over a set of features in many dimensions. Perhaps surprisingly, *it is still not known*, given samples from an arbitrary distribution on X x Y, how to efficiently ‘find the best fitting halfspace’ over a set of features. In fact, if we require our algorithm to output a halfspace as its hypothesis then the agnostic halfspace learning problem is NP-hard. Recent results have shown that even in the case where there is a halfspace with error rate .0001 with respect to D it is NP-hard to output a halfspace with error rate .49999 (achieving .5 is of course trivial).

The astute reader here may comment that I am ignoring the whole point of the ‘kernel trick,’ which is to provide an implicit representation of a halfspace over some feature set in many dimensions, therefore bypassing the above hardness result. While this is true, I am still not aware of any results in the statistical learning literature that give provably efficient algorithms for agnostically learning halfspaces (regardless of the choice of kernel).

My larger point is that a fundamental approach from statistical learning theory (running an SVM) is deeply connected to strong negative results from computational complexity theory. Still, as I explain in the next paragraph, finding the ‘best fitting halfspace’ remains an open and important computational problem.

Negative results not withstanding, it now seems natural to ask ‘well, can we find any efficient algorithms for agnostically learning halfspaces if we allow ourselves to output hypotheses that are not halfspaces?’ Briefly, the answer is ‘to some extent Yes,’ if we restrict the marginal distribution on X to be one of our favorite nice distributions (such as Gaussian over R^n). The solution runs in polynomial-time for any constant eps > 0 (for various reasons involving the ‘noisy XOR’ topic three posts ago, obtaining efficient algorithms for subconstant eps seems intractable).

The complexity of agnostically learning halfspaces with respect to arbitrary distributions remains open. A solution (if one exists) will require a powerful new algorithmic idea and will have important consequences regarding our most commonly used machine learning tools.

To respond to Andrej Bauer’s comment three posts ago:

‘I guess I have a more general question. I understand that for the purposes of proving lower bounds in computational complexity it’s reasonable to show theorems like “can’t learn XOR with NOT, AND, OR.” But I always thought that machine learning was more “positive”, i.e., you really want to learn whenever possible. So, instead of dispairing one should try to use a different bag of tricks.’

I think Andrej is right. Solving future machine learning problems will require a more sophisticated ‘bag of tricks’ than the one we have now, as most of our algorithms are based on learning halfspaces (in a noisy or agnostic setting). Computational learning theory gives us a methodology for how to proceed: find provably efficient solutions to unresolved computational problems– novel learning algorithms will no doubt follow.

1/26/20071/27/2007

Parallel Machine Learning Problems

Parallel machine learning is a subject rarely addressed at machine learning conferences. Nevertheless, it seems likely to increase in importance because:

Data set sizes appear to be growing substantially faster than computation. Essentially, this happens because more and more sensors of various sorts are being hooked up to the internet.
Serial speedups of processors seem are relatively stalled. The new trend is to make processors more powerful by making them multicore.
1. Both AMD and Intel are making dual core designs standard, with plans for more parallelism in the future.
2. IBM’s Cell processor has (essentially) 9 cores.
3. Modern graphics chips can have an order of magnitude more separate execution units.
The meaning of ‘core’ varies a bit from processor to processor, but the overall trend seems quite clear.

So, how do we parallelize machine learning algorithms?

The simplest and most common technique is to simply run the same learning algorithm with different parameters on different processors. Cluster management software like OpenMosix, Condor, or Torque are helpful here. This approach doesn’t speed up any individual run of a learning algorithm.
The next simplest technique is to decompose a learning algorithm into an adaptive sequence of statistical queries and parallelize the queries over the sample. This paper (updated from the term paper according to a comment) points out that statistical integration can be implemented via MapReduce which Google popularized (the open source version is Hadoop). The general technique of parallel statistical integration is already used by many people including IBM’s Parallel Machine Learning Toolbox. The disadvantage of this approach is that it is particularly good at speeding up slow algorithms. One example of a fast sequential algorithm is perceptron. The perceptron works on a per example basis making individual updates extremely fast. It is explicitly not a statistical query algorithm.
The most difficult method for speeding up an algorithm is fine-grained structural parallelism. The brain works in this way: each individual neuron operates on it’s own. The reason why this is difficult is that the programming is particularly tricky—you must carefully optimize to avoid latency in network communication. The full complexity of parallel programming is exposed.

A basic question is: are there other approaches to speeding up learning algorithms which don’t incur the full complexity of option (3)? One approach is discussed here.

1/15/2007

The Machine Learning Department

Carnegie Mellon School of Computer Science has the first academic Machine Learning department. This department already existed as the Center for Automated Learning and Discovery, but recently changed it’s name.

The reason for changing the name is obvious: very few people think of themselves as “Automated Learner and Discoverers”, but there are number of people who think of themselves as “Machine Learners”. Machine learning is both more succinct and recognizable—good properties for a name.

A more interesting question is “Should there be a Machine Learning Department?”. Tom Mitchell has a relevant whitepaper claiming that machine learning is answering a different question than other fields or departments. The fundamental debate here is “Is machine learning different from statistics?”

At a cultural level, there is no real debate: they are different. Machine learning is characterized by several very active large peer reviewed conferences, operating in a computer science mode. Statistics tends to function with a greater emphasis on journals and a lesser emphasis on conferences which often implies a much longer publishing cycle.

In terms of the basic questions driving the field, the answer seems less clear. It is true that the core problems of statistics in the past have typically differed from the core problems of machine learning today. Yet, there has been some substantial overlap, and there are a number of statisticians nowadays that are actively doing machine learning. It’s reasonably plausible that in the long term statistics departments will adopt the core problems of machine learning, removing the reasons for a separate machine learning department.

The parallel question for computer science comes up less often perhaps because computer science is a notoriously broad field.

The practical implication of a new department is the ability to create a more specific curricula, admit more specific students, and hire faculty based upon more specific interests. Compared to a computer science program, classes on programming languages, computer architecture, or graphics might be dropped in favor of classes on learning theory, statistics, etc… Compared to a statistics program, classes on advanced parameter estimation and measure theory might be dropped in favor of algorithms and programming experience.

An alternative solution like “learn everything from computer science and statistics” is personally appealing to me, and I have benefitted from and recommend a broad education. However this is not practical for everyone. In my experience, a machine learning skill set is an effective specialization with which people can do important things in the world. Given this, having a department with a machine learning centered curricula seems like a good idea. At Carnegie Mellon, this is the Machine Learning department. In the future and elsewhere it may have a different name, but the value of the machine learning skill set should grow with research, improving computers, and improving data sources.