Muthu invited me to the workshop on algorithms in the field, with the goal of providing a sense of where near-term research should go. When the time came though, I bargained for a post instead, which provides a chance for many other people to comment.

There are several things I didn’t fully understand when I went to Yahoo! about 5 years ago. I’d like to repeat them as people in academia may not yet understand them intuitively.

- Almost all the big impact algorithms operate in pseudo-linear or better time. Think about caching, hashing, sorting, filtering, etc… and you have a sense of what some of the most heavily used algorithms are. This matters quite a bit to Machine Learning research, because people often work with superlinear time algorithms and languages. Two very common examples of this are graphical models, where inference is often a superlinear operation—think about the
*n*dependence on the number of states in a Hidden Markov Model and Kernelized Support Vector Machines where optimization is typically quadratic or worse. There are two basic reasons for this. The most obvious is that linear time allows you to deal with large datasets. A less obvious but critical point is that a superlinear time algorithm is inherently buggy—it has an unreliable running time that can easily explode if you accidentally give it too much input.^{2} - Almost anything worth doing requires many people working together. This happens for many reasons. One is the time-critical aspect of development—in many places it really is worthwhile to pay more to have something developed faster. Another is that real projects are simply much bigger than you might otherwise expect. A third is that real organizations have people coming and going, and any project that is just by one person whithers when they leave. This observation is what means that development of systems with clean abstractions can be extraordinarily helpful, as it allows people to work independently. This observation also means that simple widely applicable tricks (for example the hashing trick) can be broadly helpful.

A good way to phrase research directions is with questions. Here are a few of my natural questions.

- How do we efficiently learn in settings where exploration is required? These are settings where the choice of action you take influences the observed reward—ad display and medical testing are two good scenarios. This is deeply critical to many applications, because the learning with exploration setting is inherently more natural than the standard supervised learning setting. The tutorial we did details much of the state of the art here, but very significant questions remain. How can we do efficient offline evaluation of algorithms (see here for a first attempt)? How can we be both efficient in sample complexity and computational complexity? Several groups are interested in sampling from a Bayesian posterior to solve these sorts of problems. Where and when can this be proved to work? (There is essentially no analysis.) What is a maximally distributed and incentive compatible algorithm that remains efficient? The last question is very natural for market place design. How can we best construct reward functions operating on different time scales? What is the relationship between the realizable and agnostic versions of this setting, and how can we construct an algorithm which smoothly interpolates between the two?
- How can we learn from lots of data? We’ll be presenting a KDD survey tutorial about what’s been done. Some of the larger scale learning problems have been addressed effectively using MapReduce. The best example here I know is known as Ozgur Cetin’s algorithm at Y!—It’s preconditioned conjugate gradient with a Newton stepsize using two passes over examples per step. (A nonHadoop version is implemented in VW for reference.) But linear predictors are not enough—we would like learning algorithms that can for example learn from all the images in the world. Doing this well plausibly requires a new approach and new learning algorithms. A key observation here is that the bandwidth required by the learning algorithm can not be too great.
- How can we learn to index efficiently? The standard solution in information retrieval is to evaluate (or approximately evaluate) all objects in a database returning the elements with the largest score according to some learned or constructed scoring function. This is an inherently
*O(n)*operation, which is frustrating when it’s plausible that an exponentially faster*O(log(n))*solution exists. A good solution involves both theory and empirical work here, as we need to think about how to think about how to solve the problem, and of course we need to solve it. - What is a flexible inherently efficient language for architecting representations for learning algorithms? Right now, graphical models often get (mis)used for this purpose. It’s easy and natural to pose a computationally intractable graphical model, implying many real applications involve approximations. A better solution would be to use a different representation language which was always computationally tractable yet flexible enough to solve real-world problems. One starting point for this is Searn. Another general approach was the topic of the Coarse to fine workshop. These are inherently related as Coarse to fine is a pruned breadth first search. Restated, it is not enough to have a language for specifying your prior structural beliefs—instead we must have a language for such which results in computationally tractable solutions.
- The Deep Learning problem remains interesting. How do you effectively learn complex nonlinearities capable of better performance than a basic linear predictor? An effective solution avoids feature engineering. Right now, this is almost entirely dealt with empirically, but theory could easily have a role to play in phrasing appropriate optimization algorithms, for example.

Good solutions to each of the research directions above would result in revolutions in their area, and everyone of them would plausibly see wide applicability.

What’s missing?

Cross-posted at CACM.

The main thing I learned in industry that surprised me is that good technical managers are worth their weight in gold. It’s because your first points 1 and 2 are related. Scaling projects to more people requires managing parallelism and dependencies, and clean abstractions are the best (only?) way to do this. Good tech managers are great at this. They are way better at deciding what I should be doing than I am. Separation of concerns between what should be done (a marketing issue) and how it should be done (an engineering issue) were also eye opening when the marketing people are good. It’s not the same tradeoff as in research, because memory and thread utilization, robustness, portability and maintainability (all related back to good abstractions) are more important than the last 1/2% accuracy.

Real problems (almost?) always have constraints. I mentioned this in a previous comment on customers who have absolute minimum precision guidelines. But it also comes in terms of efficiency constraints. For instance, the telephony platforms at Lucent had almost no memory relative to even my circa-1998 notebook (it’s all about the heat in densely packed chips). Or the latency constraint I had in semantic parsing at SpeechWorks, where we had to deal with 20 simultaneous conversations with a maximum latency of 50ms or so per semantic processing step. It turned out the solution with a single Javascript virtual machine had an average latency of 20ms but occasional GC burps of 2000ms, whereas rebuilding the Javascript virtual machine every turn had an average latency of around 40ms and no GC burps at all, and so was the right solution for the job.

I also learned that doc within code is overrated; you almost always have to get down and read the code, because you can’t trust comments not to lie. Interface requirements need to be documented, but code rarely does.

Is there a freely available *Hadoop* implementation of Ozgur Cetinâ€™s CG algorithm?

Not that I know of.

Mahout will have some form of hadoop-ified CG in the next release (track progress on this JIRA: https://issues.apache.org/jira/browse/MAHOUT-672 ), but I’m not sure if this is Ozgur Cetin’s version…

CG here refers to the nonlinear version of CG, an algorithm for optimizing a function that only needs access to gradients and Hessian*vector products.

I was having a conversation with Udi Manber about 12 years at a Usenix conference in New Orleans. Udi was a long time academic and now a VP at Google. I was one of his students. He told me how we studied so many exotic algorithms but that in practice anything over O( N * Log(n) ) was not scalable. In other words, anything more computationally complex than sorting doesn’t work on the Internet scale. This is one reason that I hope one day, that we can apply some version of suffix arrays on an Internet scale. I think it is doable.

Do you happen to know whether tutorials at this year’s KDD will be recorded and put up on videolectures.net as in previous years?