Key Scientific Challenges and the Franklin Symposium

For graduate students, the Yahoo! Key Scientific Challenges program including in machine learning is on again, due March 9. The application is easy and the $5K award is high quality “no strings attached” funding. Consider submitting.

Those in Washington DC, Philadelphia, and New York, may consider attending the Franklin Institute Symposium April 25 which has several speakers and an award for V. Attendance is free with an RSVP.

Berkeley Streaming Data Workshop

The From Data to Knowledge workshop May 7-11 at Berkeley should be of interest to the many people encountering streaming data in different disciplines. It’s run by a group of astronomers who encounter streaming data all the time. I met Josh Bloom recently and he is broadly interested in a workshop covering all aspects of Machine Learning on streaming data. The hope here is that techniques developed in one area turn out useful in another which seems quite plausible. Particularly if you are in the bay area, consider checking it out.

2011 ML symposium and the bears

The New York ML symposium was last Friday. Attendance was 268, significantly larger than last year. My impression was that the event mostly still fit the space, although it was crowded. If anyone has suggestions for next year, speak up.

The best student paper award went to Sergiu Goschin for a cool video of how his system learned to play video games (I can’t find the paper online yet). Choosing amongst the submitted talks was pretty difficult this year, as there were many similarly good ones.

By coincidence all the invited talks were (at least potentially) about faster learning algorithms. Stephen Boyd talked about ADMM. Leon Bottou spoke on single pass online learning via averaged SGD. Yoav Freund talked about parameter-free hedging. In Yoav’s case the talk was mostly about a better theoretical learning algorithm, but it has the potential to unlock an exponential computational complexity improvement via oraclization of experts algorithms… but some serious thought needs to go in this direction.

Unrelated, I found quite a bit of truth in Paul’s talking bears and Xtranormal always adds a dash of funny. My impression is that the ML job market has only become hotter since 4 years ago. Anyone who is well trained can find work, with the key limiting factor being “well trained”. In this environment, efforts to make ML more automatic and more easily applied are greatly appreciated. And yes, Yahoo! is still hiring too 🙂

ML Symposium and ICML details

Everyone should have received notice for NY ML Symposium abstracts. Check carefully, as one was lost by our system.

The event itself is October 21, next week. Leon Bottou, Stephen Boyd, and Yoav Freund are giving the invited talks this year, and there are many spotlights on local work spread throughout the day. Chris Wiggins has setup 6(!) ML-interested startups to follow the symposium, which should be of substantial interest to the employment interested.

I also wanted to give an update on ICML 2012. Unlike last year, our deadline is coordinated with AIStat (which is due this Friday). The paper deadline for ICML has been pushed back to February 24 which should allow significant time for finishing up papers after the winter break. Other details may interest people as well:

  1. We settled on using CMT after checking out the possibilities. I wasn’t looking for this, because I’ve often found CMT clunky in terms of easy access to the right information. Nevertheless, the breadth of features and willingness to support new/better approaches to reviewing was unrivaled. We are also coordinating with Laurent, Rich, and CMT to enable their paper/reviewer recommendation system. The outcome should be a standardized interface in CMT for any recommendation system, which others can then code to if interested.
  2. Area chairs have been picked. The list isn’t sacred, so if we discover significant holes in expertise we’ll deal with it. We expect to start inviting PC members in a little while. Right now, we’re looking into invited talks. If you have any really good suggestions, they could be considered.
  3. CCC is interested in sponsoring travel costs for any climate/environment related ML papers, which seems great to us. In general, this seems like an area of growing interest.
  4. We now have a permanent server and the beginnings of the permanent website setup. Much more work needs to be done here.
  5. We haven’t settled yet on how videos will work. Last year, ICML experimented with Weyond with results here. Previously, ICML had used videolectures, which is significantly more expensive. If you have an opinion about cost/quality tradeoffs or other options, speak up.
  6. Plans for COLT have shifted slightly—COLT will start a day early, overlap with tutorials, then overlap with a coordinated first day of ICML conference papers.

KDD and MUCMD 2011

At KDD I enjoyed Stephen Boyd‘s invited talk about optimization quite a bit. However, the most interesting talk for me was David Haussler‘s. His talk started out with a formidable load of biological complexity. About half-way through you start wondering, “can this be used to help with cancer?” And at the end he connects it directly to use with a call to arms for the audience: cure cancer. The core thesis here is that cancer is a complex set of diseases which can be distentangled via genetic assays, allowing attacking the specific signature of individual cancers. However, the data quantity and complex dependencies within the data require systematic and relatively automatic prediction and analysis algorithms of the kind that we are best familiar with.

Some of the papers which interested me are:

  1. Kai-Wei Chang and Dan Roth, Selective Block Minimization for Faster Convergence of Limited Memory Large-Scale Linear Models, which is about effectively using a hard-example cache to speedup learning.
  2. Leland Wilkinson, Anushka Anand, and Dang Nhon Tuan, CHIRP: A New Classifier Based on Composite Hypercubes on Iterated Random Projections. The bar on creating new classifiers is pretty high. The approach here uses a combination of random projection and partition which appears to be compelling for some nonlinear and relatively high computation settings. They do a more thorough empirical evaluation than most papers.
  3. Zhuang Wang, Nemanja Djuric, Koby Crammer, and Slobodan Vucetic Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for Nonlinear Classification. The paper explores an interesting idea: having lots of weight vectors (effectively infinity) associated with a particular label, showing that algorithms on this representation can deal with lots of data as per linear predictors, but with superior-to-linear performance. The authors don’t use the hashing trick, but their representation is begging for it.
  4. Michael Bruckner and Tobias Scheffer, Stackelberg Games for Adversarial Prediction Problem. This is about email spam filtering, where the authors use a theory of adversarial equilibria to construct a more robust filter, at least in some cases. Demonstrating this on noninteractive data is inherently difficult.

There were also three papers that were about creating (or perhaps composing) learning systems to do something cool.

  1. Gideon Dror, Yehuda Koren, Yoelle Maarek, and Idan Szpektor, I Want to Answer, Who Has a Question? Yahoo! Answers Recommender System. This is about how to learn to route a question to the appropriate answerer automatically.
  2. Yehuda Koren, Edo Liberty, Yoelle Maarek, and Roman Sandler, Automatically Tagging Email by Leveraging Other Users’ Folders. This is about helping people organize their email with machine learning.
  3. D. Sculley, Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, Yunkai Zhou, Detecting Adversarial Advertisements in the Wild. The title is an excellent abstract here, and there are quite a few details about the implementation.

I also attended MUCMD, a workshop on the Meaningful Use of Complex Medical Data shortly afterwards. This workshop is about the emergent area of using data to improve medicine. The combination of electronic health records, the economic importance of getting medicine right, and the relatively weak use of existing data implies there is much good work to do.

This finally gave us a chance to discuss radically superior medical trial designs based on work in exploration and learning 🙂

Jeff Hammerbacher‘s talk was a hilarilously blunt and well stated monologue about the need and how to gather data in a usable way.

Amongst the talks on using medical data, Suchi Saria‘s seemed the most mature. They’ve constructed a noninvasive test for problem infants which is radically superior to the existing Apgar score according to leave-one-out cross validation.

From the doctor’s side, there was discussion of the deep balkanization of data systems within hospitals, efforts to overcome that, and the (un)trustworthiness of data. Many issues clearly remain here, but it also looks like serious progress is being made.

Overall, the workshop went well, with the broad cross-section of talks providing quite a bit of extra context you don’t normally see. It left me believing that a community centered on MUCMD is rising now, with attendant workshops, conferences, etc… to be expected.