Machine Learning (Theory) – Page 93 – Machine learning and learning theory research

7/1/20057/1/2005

The Role of Impromptu Talks

COLT had an impromptu session which seemed as interesting or more interesting than any other single technical session (despite being only an hour long). There are several roles that an impromptu session can play including:

Announcing new work since the paper deadline. Letting this happen now rather than later helps aid the process of research.
Discussing a paper that was rejected. Reviewers err sometimes and an impromptu session provides a means to remedy that.
Entertainment. We all like to have a bit of fun.

For design, the following seem important:

Impromptu speakers should not have much time. At COLT, it was 8 minutes, but I have seen even 5 work well.
The entire impromptu session should not last too long because the format is dense and promotes restlessness. A half hour or hour can work well.

Impromptu talks are a mechanism to let a little bit of chaos into the schedule. They will be chaotic in content, presentation, and usefulness. The fundamental advantage of this chaos is that it provides a means for covering material that the planned program did not (or could not). This seems like a “bargain use of time” considering the short duration. One caveat is that it is unclear how well this mechanism can scale to large conferences.

6/29/20056/29/2005

Not EM for clustering at COLT

One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”.

One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting.

A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form:

Project to low dimensional space.
Pick out gross structure.
Project gross structure into the high dimensional space.
Run EM (or some other local improvement algorithm) to find a final fit.

The effects of steps 1-3 is to “seed” the local optimization algorithm in a good place from which a global optima is plausibly reachable.

6/28/20056/29/2005

A COLT paper

I found Tong Zhang’s paper on Data Dependent Concentration Bounds for Sequential Prediction Algorithms interesting. Roughly speaking, it states a tight bound on the future error rate for online learning algorithms assuming that samples are drawn independently. This bound is easily computed and will make the progressive validation approaches used here significantly more practical.

6/28/20056/29/2005

The cross validation problem: cash reward

I just presented the cross validation problem at COLT.

The problem now has a cash prize (up to $500) associated with it—see the presentation for details.

The write-up for colt.

6/22/20056/22/2005

Languages of Learning

A language is a set of primitives which can be combined to succesfully create complex objects. Languages arise in all sorts of situations: mechanical construction, martial arts, communication, etc… Languages appear to be the key to succesfully creating complex objects—it is difficult to come up with any convincing example of a complex object which is not built using some language. Since languages are so crucial to success, it is interesting to organize various machine learning research programs by language.

The most common language in machine learning are languages for representing the solution to machine learning. This includes:

Bayes Nets and Graphical Models A language for representing probability distributions. The key concept supporting modularity is conditional independence. Michael Kearns has been working on extending this to game theory.
Kernelized Linear Classifiers A language for representing linear separators, possibly in a large space. The key form of modularity here is kernelization.
Neural Networks A language for representing and learning functions. The key concept supporting modularity is backpropagation. (Yann LeCun gave some very impressive demos at the Chicago MLSS.)
Decision Trees Another language for representing and learning functions. The key concept supporting modularity is partitioning the input space.

Many other learning algorithms can be seen as falling into one of the above families.

In addition there are languages related to various aspects of learning.

Reductions A language for translating between varying real-world losses and core learning algorithm optimizations.
Feature Languages Exactly how features are specified varies from on learning algorithm to another. Several people have been working on languages for features that cope with sparsity or the cross-product nature of databases.
Data interaction languages The statistical query model of learning algorithms provides a standardized interface between data and learning algorithm.

These lists surely miss some languages—feel free to point them out below.

With respect to research “interesting” language-related questions include:

For what aspects of learning is a language missing? Anytime adhocery is encountered, this suggests that there is room for a language. Finding what is not there is both hard and valuable.
Are any of these languages fundamentally flawed or fundamentally advantageous with respect to another language?
What are the most easy to use and effective primitives for these languages?