I just attended CODE. The set of people interested in digital experimentation have very diverse backgrounds encompassing theory, machine learning, social science, economics, and industry so this seems like a good subject for a new conference. I hope it continues.
I found several talks interesting.
- Eytan Bakshy talked about PlanOut which is language/platform for flexibly specifying experiments.
- Ron Kohavi talked about EXP which is a heavily used A/B testing platform.
- Susan Athey talked about long term vs short term metrics which seems both important to address, a constant problem, and not yet systematically solved.
There was a panel about the ongoing Facebook experimentation controversy. The issue here is complex. My understanding is that Facebook users have some expected ownership of the content they create, and hence aren’t comfortable with the content being used in unexpected ways. On the other hand, experimentation is so necessary to the functioning of all large modern internet sites that banning it or slowing down the process by a factor of a million (as some advocated) would badly degrade the future of these sites in practice.
My belief is that what’s lacking is education and trust. W.r.t. education, people need to understand that experimentation is unavoidable when trying to figure out how to optimize an enormously complex system, as there is just no other way to systematically make 1000 right decisions as is necessary for basic things like choosing the best homepage/search result/etc… W.r.t. trust, companies are not particularly good at creating trust in general, but finding the right mechanism for doing so seems critical. I would point out Vanguard as a company that managed to successfully create trust by design.
A new version of VW is out. The primary changes are:
- Learning Reductions: I’ve wanted to get learning reductions working and we’ve finally done it. Not everything is implemented yet, but VW now supports direct:
- Multiclass Classification –oaa or –ect.
- Cost Sensitive Multiclass Classification –csoaa or –wap.
- Contextual Bandit Classification –cb.
- Sequential Structured Prediction –searn or –dagger
In addition, it is now easy to build your own custom learning reductions for various plausible uses: feature diddling, custom structured prediction problems, or alternate learning reductions. This effort is far from done, but it is now in a generally useful state. Note that all learning reductions inherit the ability to do cluster parallel learning.
- Library interface: VW now has a basic library interface. The library provides most of the functionality of VW, with the limitation that it is monolithic and nonreentrant. These will be improved over time.
- Windows port: The priority of a windows port jumped way up once we moved to Microsoft. The only feature which we know doesn’t work at present is automatic backgrounding when in daemon mode.
- New update rule: Stephane visited us this summer, and we fixed the default online update rule so that it is unit invariant.
There are also many other small updates including some contributed utilities that aid the process of applying and using VW.
Plans for the near future involve improving the quality of various items above, and of course better documentation: several of the reductions are not yet well documented.
At KDD I enjoyed Stephen Boyd‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.
Some of the papers which interested me are:
- Kai-Wei Chang and Dan Roth, Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models, which is about effectively using a hard-example cache to speedup learning.
- Leland Wilkinson, Anushka Anand, and Dang Nhon Tuan, CHIRP: A New Classifier Based on Composite Hypercubes on Iterated Random Projections. The bar on creating new classifiers is pretty high. The approach here uses a combination of random projection and partition which appears to be compelling for some nonlinear and relatively high computation settings. They do a more thorough empirical evaluation than most papers.
- Zhuang Wang, Nemanja Djuric, Koby Crammer, and Slobodan Vucetic Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for Nonlinear Classification. The paper explores an interesting idea: having lots of weight vectors (effectively infinity) associated with a particular label, showing that algorithms on this representation can deal with lots of data as per linear predictors, but with superior-to-linear performance. The authors don’t use the hashing trick, but their representation is begging for it.
- Michael Bruckner and Tobias Scheffer, Stackelberg Games for Adversarial Prediction Problem. This is about email spam filtering, where the authors use a theory of adversarial equilibria to construct a more robust filter, at least in some cases. Demonstrating this on noninteractive data is inherently difficult.
There were also three papers that were about creating (or perhaps composing) learning systems to do something cool.
- Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor, I Want to Answer, Who Has a Question? Yahoo! Answers Recommender System. This is about how to learn to route a question to the appropriate answerer automatically.
- Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler, Automatically Tagging Email by Leveraging Other Users’ Folders. This is about helping people organize their email with machine learning.
- D. Sculley, Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou, Detecting Adversarial Advertisements in the Wild. The title is an excellent abstract here, and there are quite a few details about the implementation.
I also attended MUCMD, a workshop on the Meaningful Use of Complex Medical Data shortly afterwards. This workshop is about the emergent area of using data to improve medicine. The combination of electronic health records, the economic importance of getting medicine right, and the relatively weak use of existing data implies there is much good work to do.
This finally gave us a chance to discuss radically superior medical trial designs based on work in exploration and learning
Jeff Hammerbacher‘s talk was a hilarilously blunt and well stated monologue about the need and how to gather data in a usable way.
Amongst the talks on using medical data, Suchi Saria‘s seemed the most mature. They’ve constructed a noninvasive test for problem infants which is radically superior to the existing Apgar score according to leave-one-out cross validation.
From the doctor’s side, there was discussion of the deep balkanization of data systems within hospitals, efforts to overcome that, and the (un)trustworthiness of data. Many issues clearly remain here, but it also looks like serious progress is being made.
Overall, the workshop went well, with the broad cross-section of talks providing quite a bit of extra context you don’t normally see. It left me believing that a community centered on MUCMD is rising now, with attendant workshops, conferences, etc… to be expected.
Nikos pointed out this new york times article about poor clinical design killing people. For those of us who study learning from exploration information this is a reminder that low regret algorithms are particularly important, as regret in clinical trials is measured by patient deaths.
Two obvious improvements on the experimental design are:
- With reasonable record keeping of existing outcomes for the standard treatments, there is no need to explicitly assign people to a control group with the standard treatment, as that approach is effectively explored with great certainty. Asserting otherwise would imply that the nature of effective treatments for cancer has changed between now and a year ago, which denies the value of any clinical trial.
- An optimal experimental design will smoothly phase between exploration and exploitation as evidence for a new treatment shows that it can be effective. This is old tech, for example in the EXP3.P algorithm (page 12 aka 59) although I prefer the generalized and somewhat clearer analysis of EXP4.P.
Done the right way, the clinical trial for a successful treatment would start with some initial small pool (equivalent to “phase 1″ in the article) and then simply expanded the pool of participants over time as it proved superior to the existing treatment, until the pool is everyone. And as a bonus, you can even compete with policies on treatments rather than raw treatments (i.e. personalized medicine).
Getting from here to there seems difficult. It’s been 15 years since EXP3.P was first published, and the progress in clinical trial design seems glacial to us outsiders. Partly, I think this is a communication and education failure, but partly, it’s also a failure of imagination within our own field. When we design algorithms, we often don’t think about all the applications, where a little massaging of the design in obvious-to-us ways so as to suit these applications would go a long ways. Getting this right here has a substantial moral aspect, potentially saving millions of lives over time through more precise and fast deployments of new treatments.
Consider the contextual bandit setting where, repeatedly:
- A context x is observed.
- An action a is taken given the context x.
- A reward r is observed, dependent on x and a.
Where the goal of a learning agent is to find a policy for step 2 achieving a large expected reward.
This setting is of obvious importance, because in the real world we typically make decisions based on some set of information and then get feedback only about the single action taken. It also fundamentally differs from supervised learning settings because knowing the value of one action is not equivalent to knowing the value of all actions.
A decade ago the best machine learning techniques for this setting where implausibly inefficient. Dean Foster once told me he thought the area was a research sinkhole with little progress to be expected. Now we are on the verge of being able to routinely attack these problems, in almost exactly the same sense that we routinely attack bread and butter supervised learning problems. Just as for supervised learning, we know how to create and reuse datasets, how to benchmark algorithms, how to reuse existing supervised learning algorithms in this setting, and how to achieve optimal rates of learning quantitatively similar to supervised learning.
This is also one of the times when understanding the basic theory can make a huge difference in your success. There are many wrong ways to attack contextual bandit problems or prepare datasets, and taking a wrong turn can easily mean the difference between failure and success. Understanding how contextual bandit problems differ from basic supervised learning problems is critical to routine success here.
All of the above is not meant to claim that everything is done research-wise here so we’ll try to outline where the current boundary of research lies as best we can. However, we are surely at a point both in terms of application demand (especially for internet applications of search, advertising, page optimization, but also medical applications and surely others) and methodology supply (with basic reliable techniques now easily available or created) where these techniques are shifting from theory esoterica to required education.
Given the above, Alina and I decided to prepare a tutorial to be given at Yahoo! Labs summer school (my first India trip!), ICML, KDD, and hopefully videolectures.net. Please join us. The subjects we plan to cover are essentially the keys to the kingdom of solving shallow interactive learning problems.