June 2009 – Machine Learning (Theory)

6/26/2009

Netflix nearly done

A $1M qualifying result was achieved on the public Netflix test set by a 3-way ensemble team. This is just in time for Yehuda‘s presentation at KDD, which I’m sure will be one of the best attended ever.

This isn’t quite over—there are a few days for another super-conglomerate team to come together and there is some small chance that the performance is nonrepresentative of the final test set, but I expect not.

Regardless of the final outcome, the biggest lesson for ML from the Netflix contest has been the formidable performance edge of ensemble methods.

6/24/2009

Interesting papers at UAICMOLT 2009

Here’s a list of papers that I found interesting at ICML/COLT/UAI in 2009.

Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation.
Hal Daume, Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective.
There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thomas Walsh, Istvan Szita, Carlos Diuk, Michael Littman Exploring compact reinforcement-learning representations with linear regression at UAI. The UAI paper covers application to RL as well.
Ping Li, Improving Compressed Counting at UAI. This paper talks about how to keep track of the moments in a datastream with very little space and computation. I’m not sure I have a use for it yet, but it seems like a cool piece of basic technology.
Mark Reid and Bob Williamson Surrogate Regret Bounds for Proper Losses at ICML. This paper points out that via the integral characterization of proper losses, proper scoring rules can be reduced to binary classification. The results unify and generalize the Probing and Quanting reductions we worked on previously. This paper is also related to Nicolas Lambert‘s work, which is quite thought provoking in terms of specifying what is learnable.
Daniel Hsu, Sham M. Kakade and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov Models COLT. This paper shows that a subset of HMMs can be learned using an SVD-based algorithm.
Samory Kpotufe, Escaping the curse of dimensionality with a tree-based regressor at COLT. This paper shows how to directly applying regression in high dimensional vector spaces and have it succeed anyways because the data is naturally low-dimensional.
Shai Ben-David, David Pal and Shai Shalev-Shwartz. Agnostic Online Learning at COLT. This paper characterizes the ability to learn when an adversary is choosing features in the online setting as the “Littlestone dimension”.

6/15/20096/17/2009

In Active Learning, the question changes

A little over 4 years ago, Sanjoy made a post saying roughly “we should study active learning theoretically, because not much is understood”.

At the time, we did not understand basic things such as whether or not it was possible to PAC-learn with an active algorithm without making strong assumptions about the noise rate. In other words, the fundamental question was “can we do it?”

The nature of the question has fundamentally changed in my mind. The answer is to the previous question is “yes”, both information theoretically and computationally, most places where supervised learning could be applied.

In many situation, the question has now changed to: “is it worth it?” Is the programming and computational overhead low enough to make the label cost savings of active learning worthwhile? Currently, there are situations where this question could go either way. Much of the challenge for the future is in figuring out how to make active learning easier or more worthwhile.

At the active learning tutorial, I stated a set of somewhat more precise research questions that I don’t yet have answer to, and which I believe are worth answering. Here is a bit of an expansion on those questions for those interested.

Is active learning possible in a fully adversarial setting? By fully adversarial, I mean when an adversary controls all the algorithms observations. Some work by Claudio and Nicolo has moved in this direction, but there is not yet a solid answer.
Is there an efficient and effective reduction of active learning to supervised learning? The bootstrap IWAL approach is efficient but not effective in some situations where other approaches can succeed. The algorithm here is a reduction to a special kind of supervised learning where you can specify both examples and constraints. For many supervised learning algorithms, adding constraints seems problematic.
Can active learning succeed with alternate labeling oracles? The ones I see people trying to use in practice often differ because they can provide answers of varying specificity and cost, or because some oracles are good for some questions, but not good for others.
At this point, there have been several successful applications of active learning, but that’s not the same thing as succeeding with more robust algorithms. Can we succeed empirically with more robust algorithms? And is the empirical cost of additional robustness worth the empirical peace-of-mind that your learning algorithm won’t go astray where other more aggressive approaches may do so?

6/3/20096/8/2009

Functionally defined Nonlinear Dynamic Models

Suppose we have a set of observations over time x₁,x₂,…,x_t and want to predict some future event y_t+1. An inevitable problem arises, because learning a predictor h(x₁,…,x_t) of y_t+1 is generically intractable due to the size of the input. To make this problem tractable, what’s necessary is a method for summarizing the relevant information in past observations for the purpose of prediction in the future. In other words, state is required.

Existing approaches for deriving state have some limitations.

Hidden Markov models learned with EM suffer from local minima, use tabular learning approaches which provide dubious generalization ability, and often require substantial a.priori specification of the observations.
Kalman Filters and Particle Filters are very parametric in the sense that substantial information must be specified up front.
Dynamic Bayesian Networks (graphical models through time) require substantial a.priori specification and often require the solution of difficult computational problems to use. Some of these difficulties are representational rather than computational.
The Subspace-ID approach from control theory uses a linear representation, with the basic claim that it works well when all transformations are linear, but not so well when things are nonlinear. (Thanks to Drew for pointing it out.) In making this post, I ran across this two day tutorial which discusses extensions of this idea to nonlinear systems. Unfortunately, I’ll miss the tutorial, and I haven’t found the related paper.

The point of this paper at ICML is that some dynamic systems (those which are “invertible”), can be decomposed into separate bounded resource prediction problems which, when solved, create an implicit definition of state. This allows us to use any general purpose supervised learning algorithm to solve the state formation problem without requiring linearity or any specific representation. When writing papers you don’t generally gush too hard, but it’s fair to say that I’m excited by this approach.

It’s not a known dead end.
It doesn’t require lots of prior specification & information when you have lots of data.
It leverages the huge amount of work that has gone into supervised learning algorithm design.
It works in controlled systems also, where the control is simply another observation.
It works with generalization from the start, rather than requiring the (often awkward) addition of generalization later.
It doesn’t require predicting everything in order to predict what you want.
It can work with very large observation spaces, and can even work better the larger the observation space, because larger observations imply more invertibility.

I expect some people reading this paper will be disappointed that it doesn’t solve all problems. That’s good news for anyone interested in research. For those who aren’t note that this is (in some sense) a generalization of subspace ID, and hence that there are other applications of the approach known to work in practice. Furthermore, we have some sample complexity analysis in the linear case.

It’s relatively rare to have a paper about a new approach to solving a problem as intractable as nonlinear dynamics has proved to be, so if you see a flaw please speak up.

6/1/20096/8/2009

Multitask Poisoning

There are many ways that interesting research gets done. For example it’s common at a conference for someone to discuss a problem with a partial solution, and for someone else to know how to solve a piece of it, resulting in a paper. In some sense, these are the easiest results we can achieve, so we should ask: Can all research be this easy?

The answer is certainly no for fields where research inherently requires experimentation to discover how the real world works. However, mathematics, including parts of physics, computer science, statistics, etc… which are effectively mathematics don’t require experimentation. In effect, a paper can be simply a pure expression of thinking. Can all mathematical-style research be this easy?

What’s going on here is research-by-communication. Someone knows something, someone knows something else, and as soon as someone knows both things, a problem is solved. The interesting thing about research-by-communication is that it is becoming radically easier with improved technology. It’s already possible to communicate with nearly anyone anywhere in the world via {voice, video, text, shared program/website/blog}. As these mechanisms become cheaper in cost and ease of use, we’ll surely see their adoption into many activities. If all mathematical-style research can be sped up by communication, that’s great news for research.

However, I don’t believe it. The essential difficulty is that doing good research often requires the simultaneous understanding of several different things—the problem, all the broken approaches to solving some problem, why they break, and some hint about where to look for a solution. Absorbing and simultaneously keeping in mind this information requires a substantial amount of effort.

The extent of effort required is perhaps most easily understood in collaboration with yourself. Often, a problem is not immediately solved the first time it is thought of, instead a researcher must attack it again and again, until either giving up or finding a solution. A basic parameter in attacking a problem is: How hard is it to resume where you left off?

For many of the problems I’ve worked on, the answer is pretty hard. After weeks of not working on a problem, it might take several hours to refresh my memory with all the simultaneous concerns that go into a solution. Even if I worked on such a problem yesterday, it might take a half hour or an hour to reach a state where I’m prepared to make progress.

Given the difficulty of research, I (and many other people) often struggle in dealing with interruptions. Modern technology has made communication very easy, implying a stream of potential interruptions throughout the day, some of which are undeniably fruitful. And yet, an interruption means the overhead of getting back to thinking must be paid yet again.

Trading off properly between the value of avoiding the overhead and the value of dealing with interruptions is a constant struggle which typically did not exist before before modern communication technology made it prevalent. I think it is common to give in to the interrupts, and effectively cease to be able to do research other than research-by-communication. I’m not ready to do that, so I’ve found it essential to arrange large unused and uninterrupted periods of time for the purpose of doing research.

Although the solution is simple, I’ve often found implementing it challenging in several ways. There is the standard problem that people you deal with don’t understand the overhead of switching tasks in research. However, there is also a more insidious problem. Coping with the modern world requires that at least some portion of our time be devoted to interrupts, which are almost always easier to deal with than research. Dealing with these interrupts therefore can create a bad habit where you seek interrupts to achieve the (short term, easy) gratification of dealing with them. Thus, multitasking creates an internal expectation of multitasking, which makes multitasking preferred, eliminating the ability to do careful research. This is what I think of as multitask poisoning—it’s not just the time lost to swapping some context back in, but also the bad habits of self-interruption that it creates.

I don’t have a good answer to this problem, other than continuing the struggle to preserve substantial contiguous chunks of time for thinking. An alternative sometimes-applicable solution is to reduce the overhead of starting to solve a problem by decomposing the problem into subproblems. Where possible, this is of course valuable, but mathematical research is often almost uniquely undecomposable, because the nature of the problem is that it’s solution is unknown and hence undecomposable. Restated, the real problem is finding a valid decomposition.

I self-interrupted at least 3 times in making this post.