Predictive Analytics World

Carla Vicens and Eric Siegel contacted me about Predictive Analytics World in San Francisco February 18&19, which I wasn’t familiar with. A quick look at the agenda reveals several people I know working on applications of machine learning in businesses, covering deployed applications topics. It’s interesting to see a business-focused machine learning conference, as it says that we are succeeding as a field. If you are interested in deployed applications, you might attend.

Eric and I did a quick interview by email.

John >
I’ve mostly published and participated in academic machine learning conferences like ICML, COLT, and NIPS. When I look at the set of speakers and subjects for your conference I think “machine learning for business”. Is that your understanding of things? What I’m trying to ask is: what do you view as the primary goal for this conference?

Eric >
You got it. This is the business event focused on the commercial deployment of technology developed at the research conferences you named. Academics’ term, “machine learning,” is essentially synonymous with the business world’s “predictive modeling”. Predictive Analytics World focuses on business applications of this technology, such as response modeling, churn modeling, email targeting, product recommendations, insurance pricing, and credit scoring. PAW’s goal is to strengthen the business impact delivered by predictive analytics deployment, and establish new opportunities with predictive analytics. The conference delivers case studies, expertise and resources to this end.

The conference program is designed to speak the language of marketing and business professionals using or planning to use predictive analytics to solve business challenges — but for the hands-on practitioner or analytical expert focused on commercial deployment who wishes to speak this same language, it’s an equally valuable event.

John >
People at academic conferences would hope that technology developed there can transfer into business use. In your experience, does this happen? And how fast or difficult is it?

Eric >
The best way to catalyze commercial deployment is to show the people it really works outside “the lab” – which is why PAW’s program is packed primarily with named case studies of commercial deployment. These success stories answer your question with a resounding “yes” that the core technology developed academically is indeed put to use.

But predictive analytics has not yet been broadly adopted across all industries, although success stories show at least initial reach in each vertical. So, sure, as one who previously wore a researcher’s hat, commercial deployment can feel slow; having solved the hardest theoretical, mathematical or statistical problems, the rest should go smoothly, right? Not exactly. The main challenges come in ramping up the business “consumer” on the technology so they see its value, positioning the technology in a way that provides business value, and, on the integration side, in preparing corporate data for predictive modeling (that’s a doozy!) and in integrating predict scores into existing systems and processes. These things take time.

John >
Sometimes people working in the academic world don’t have a good understanding of what the real problems are. Do you have a sense of which areas of research are underemphasized in the academic world?

Eric >
To reach commercial success in deploying predictive analytics for the business applications I listed above, the main challenges are on the process and non-analytical integration side, rather than core machine learning technology; its good enough. So, I don’t consider there to be glaring ommissions in the capabilities of core machine learning (I taught the machine learning graduate course at Columbia University and still consider Tom Mitchell’s textbook to be my bible).

On the other hand, there are always places where “real-world” data is going to bring researchers’ attention to as-yet-unsolved problems. A perfect example is the Netflix Prize, the current leader of which (and winner of the recent Progress Prize) will be speaking at PAW-09 – see here.

Interesting Papers at SODA 2009

Several talks seem potentially interesting to ML folks at this year’s SODA.

  1. Maria-Florina Balcan, Avrim Blum, and Anupam Gupta, Approximate Clustering without the Approximation. This paper gives reasonable algorithms with provable approximation guarantees for k-median and other notions of clustering. It’s conceptually interesting, because it’s the second example I’ve seen where NP hardness is subverted by changing the problem definition subtle but reasonable way. Essentially, they show that if any near-approximation to an optimal solution is good, then it’s computationally easy to find a near-optimal solution. This subtle shift bears serious thought. A similar one occurred in our ranking paper with respect to minimum feedback arcset. With two known examples, it suggests that many more NP-complete problems might be finessed into irrelevance in this style.
  2. Yury Lifshits and Shengyu Zhang, Combinatorial Algorithms for Nearest Neighbors, Near-Duplicates, and Small-World Design. The basic idea of this paper is that actually creating a metric with a valid triangle inequality inequality is hard for real-world problems, so it’s desirable to have a datastructure which works with a relaxed notion of triangle inequality. The precise relaxation is more extreme than you might imagine, implying the associated algorithms give substantial potential speedups in incomparable applications. Yuri tells me that a cover tree style “true O(n) space” algorithm is possible. If worked out and implemented, I could imagine substantial use.
  3. Elad Hazan and Satyen Kale Better Algorithms for Benign Bandits. The basic idea of this paper is that in real-world applications, an adversary is less powerful than is commonly supposed, so carefully taking into account the observed variance can yield an algorithm which works much better in practice, without sacrificing the worst case performance.
  4. Kevin Matulef, Ryan O’Donnell, Ronitt Rubinfeld, Rocco Servedio, Testing Halfspaces. The basic point of this paper is that testing halfspaces is qualitatively easier than finding a good half space with respect to 0/1 loss. Although the analysis is laughably far from practical, the result is striking, and it’s plausible that the algorithm works much better than the analysis. The core algorithm is at least conceptually simple: test that two correlated random points have the same sign, with “yes” being evidence of a halfspace and “no” not.
  5. I also particularly liked Yuval Peres‘s invited talk The Unreasonable Effectiveness of Martingales. Martingale’s are endemic to learning, especially online learning, and I suspect we can tighten and clarify several arguments using some of the techniques discussed.

Adversarial Academia

One viewpoint on academia is that it is inherently adversarial: there are finite research dollars, positions, and students to work with, implying a zero-sum game between different participants. This is not a viewpoint that I want to promote, as I consider it flawed. However, I know several people believe strongly in this viewpoint, and I have found it to have substantial explanatory power.

For example:

  1. It explains why your paper was rejected based on poor logic. The reviewer wasn’t concerned with research quality, but rather with rejecting a competitor.
  2. It explains why professors rarely work together. The goal of a non-tenured professor (at least) is to get tenure, and a case for tenure comes from a portfolio of work that is undisputably yours.
  3. It explains why new research programs are not quickly adopted. Adopting a competitor’s program is impossible, if your career is based on the competitor being wrong.

Different academic groups subscribe to the adversarial viewpoint in different degrees. In my experience, NIPS is the worst. It is bad enough that the probability of a paper being accepted at NIPS is monotonically decreasing in it’s quality. This is more than just my personal experience over a number of years, as it’s corroborated by others who have told me the same. ICML (run by IMLS) used to have less of a problem, but since it has become more like NIPS over time, it has inherited this problem. COLT has not suffered from this problem as much in my experience, although it had other problems related to the focus being defined too narrowly. I do not have enough experience with UAI or KDD to comment there.

There are substantial flaws in the adversarial viewpoint.

  1. The adversarial viewpoint makes you stupid. When viewed adversarially, any idea has crippling disadvantages and no advantages. Contorting your viewpoint enough to make this true damages your ability to conduct research. In short, it promotes poor mental hygiene.
  2. Many activities become impossible. Doing research is in general extremely hard, so there are many instances where working with other people can allow you to do things which are otherwise impossible.
  3. The previous two disadvantages apply even more strongly for a community—good ideas are more likely to be missed, change comes slowly, and often with steps backward.
  4. At it’s most basic level, the assumption that research is zero-sum is flawed, because the process of research is not done in a closed system. If the rest of society at large discovers that research is valuable, then the budget increases.

Despite these disadvantages, there is a substantial advantage as well: you can materially protect and aid your career by rejecting papers, preventing grants, and generally discriminating against key people doing interesting but competitive work.

The adversarial viewpoint has a validity in proportion to the number of people subscribing to it. For those of us who would like to deemphasize the adversarial viewpoint, what’s unclear is: how?

One concrete thing is: use Arxiv. For a long time, physicists have adopted an Arxiv-first philosophy, which I’ve come to respect. Arxiv functions as a universal timestamp which decreases the power of an adversarial reviewer. Essentially, you avoid giving away the power to muddy the track of invention. I’m expecting to use Arxiv for essentially all my past-but-unpublished and future papers.

It is plausible that limiting the scope of bidding, as Andrew McCallum suggested at the last ICML, and as is effectively implemented at this ICML, will help. The system of review at journals might also help for the same reason. In my experience as an author, if an anonymous reviewer wants to kill a paper they usually succeed. Most area chairs or program chairs are more interested in avoiding conflict with the reviewer (who they picked and may consider a friend) than reading the paper to determine the illogic of the review (which is a difficult task that simply cannot be done for all papers). NIPS experimented with a reputation system for reviewers last year, but I’m unclear on how well it worked, as an author’s score for a review and a reviewer’s score for the paper may be deeply correlated, revealing little additional information.

Public discussion of research can help with this, because very poor logic simply doesn’t stand up under public scrutiny. While I hope to nudge people in this direction, it’s clear that most people aren’t yet comfortable with public discussion.

Use of Learning Theory

I’ve had serious conversations with several people who believe that the theory in machine learning is “only useful for getting papers published”. That’s a compelling statement, as I’ve seen many papers where the algorithm clearly came first, and the theoretical justification for it came second, purely as a perceived means to improve the chance of publication.

Naturally, I disagree and believe that learning theory has much more substantial applications.

Even in core learning algorithm design, I’ve found learning theory to be useful, although it’s application is more subtle than many realize. The most straightforward applications can fail, because (as expectation suggests) worst case bounds tend to be loose in practice (*). In my experience, considering learning theory when designing an algorithm has two important effects in practice:

  1. It can help make your algorithm behave right at a crude level of analysis, leaving finer details to tuning or common sense. The best example I have of this is the Isomap, where the algorithm was informed by the analysis yielding substantial improvements in sample complexity over earlier algorithmic ideas.
  2. An algorithm with learning theory considered in it’s design can be more automatic. I’ve gained more respect for Rifkin’s claim: that the one-against-all reduction, when tuned well, can often perform as well as other approaches. The “when tuned well” caveat is however substantial, because learning algorithms may be applied by nonexperts or by other algorithms which are computationally constrained. A reasonable and worthwhile hope for other methods of addressing multiclass problems is that they are more automatic and computationally faster. The subtle issue here is: How do you measure “more automatic”?

In my experience, learning theory is most useful in it’s crudest forms. A good example comes in the architecting problem: how do you go about solving a learning problem? I mean this in the broadest sense imaginable:

  1. Is it a learning problem or not? Many problems are most easily solved via other means such as engineering, because that’s easier, because there is a severe data gathering problem, or because there is so much data that memorization works fine. Learning theory such as statistical bounds and online learning with experts helps substantially here because it provides guidelines about what is possible to learn and what not.
  2. What type of learning problem is it? Is it a problem where exploration is required or not? Is it a structured learning problem? A multitask learning problem? A cost sensitive learning problem? Are you interested in the median or the mean? Is active learning useable or not? Online or not? Answering these questions correctly can easily make a difference between a succesful application and not. Answering these questions is partly definition checking, and since the answer is often “all of the above”, figuring out which aspect of the problem to address first or next is helpful.
  3. What is the right learning algorithm to use? Here the relative capacity of a learning algorithm and it’s computational efficiency are most important. If you have few features and many examples, a nonlinear algorithm with more representational capacity is a good idea. If you have many features and little data, linear representations or even exponentiated gradient style algorithms are important. If you have very large amounts of data, the most scalable algorithms (so far) use a linear representation. If you have little data and few features, a Bayesian approach may be your only option. Learning theory can help in all of the above by quantifying “many”, “little”, “most”, and “few”. How do you deal with the overfitting problem? One thing I realized recently is that the overfitting problem can be a concern even with very large natural datasets, because some examples are naturally more important than others.

As might be clear, I think of learning theory as somewhat broader than might be traditional. Some of this is simply education. Many people have only been exposed to one piece of learning theory, often VC theory or it’s cousins. After seeing this, they come to think of it as learning theory. VC theory is a good theory, but it is not complete, and other elements of learning theory seem at least as important and useful. Another aspect is publishability. Simply sampling from the learning theory in existing papers does not necessarily give a good distribution of subjects for teaching, because the goal of impressing reviewers does not necessarily coincide with the clean simple analysis that is teachable.

(*) There is significant investigation into improving the tightness of bounds to the point of usefulness, and maybe it will pay off.

Summer Conferences

Here’s a handy table for the summer conferences.

Conference Deadline Reviewer Targeting Double Blind Author Feedback Location Date
ICML (wrong ICML) January 26 Yes Yes Yes Montreal, Canada June 14-17
COLT February 13 No No Yes Montreal June 19-21
UAI March 13 No Yes No Montreal June 19-21
KDD February 2/6 No No No Paris, France June 28-July 1

Reviewer targeting is new this year. The idea is that many poor decisions happen because the papers go to reviewers who are unqualified, and the hope is that allowing authors to point out who is qualified results in better decisions. In my experience, this is a reasonable idea to test.

Both UAI and COLT are experimenting this year as well with double blind and author feedback, respectively. Of the two, I believe author feedback is more important, as I’ve seen it make a difference. However, I still consider double blind reviewing a net win, as it’s a substantial public commitment to fairness.