Machine Learning (Theory)

4/18/2011

A paper not at Snowbird

Tags: Conferences, Machine Learning, Papers jl@ 11:20 am

Unfortunately, a scheduling failure meant I missed all of AIStat and most of the learning workshop, otherwise known as Snowbird, when it’s at Snowbird.

At snowbird, the talk on Sum-Product networks by Hoifung Poon stood out to me (Pedro Domingos is a coauthor.). The basic point was that by appropriately constructing networks based on sums and products, the normalization problem in probabilistic models is eliminated, yielding a highly tractable yet flexible representation+learning algorithm. As an algorithm, this is noticeably cleaner than deep belief networks with a claim to being an order of magnitude faster and working better on an image completion task.

Snowbird doesn’t have real papers—just the abstract above. I look forward to seeing the paper. (added: Rodrigo points out the deep learning workshop draft.)

3/20/2011

KDD Cup 2011

Yehuda points out KDD-Cup 2011 which Markus and Gideon helped setup. This is a prediction and recommendation contest for music. In addition to being a fun chance to show your expertise, there are cash prizes of $5K/$2K/$1K.

1/16/2011

2011 Summer Conference Deadline Season

Tags: Announcements, Conferences jl@ 9:20 pm

Machine learning always welcomes the new year with paper deadlines for summer conferences. This year, we have:

Conference Paper Deadline When/Where Double blind? Author Feedback? Notes
ICML February 1 June 28-July 2, Bellevue, Washington, USA Y Y Weak colocation with ACL
COLT February 11 July 9-July 11, Budapest, Hungary N N colocated with FOCM
KDD February 11/18 August 21-24, San Diego, California, USA N N
UAI March 18 July 14-17, Barcelona, Spain Y N

The larger conferences are on the west coast in the United States, while the smaller ones are in Europe.

12/26/2010

NIPS 2010

I enjoyed attending NIPS this year, with several things interesting me. For the conference itself:

  1. Peter Welinder, Steve Branson, Serge Belongie, and Pietro Perona, The Multidimensional Wisdom of Crowds. This paper is about using mechanical turk to get label information, with results superior to a majority vote approach.
  2. David McAllester, Tamir Hazan, and Joseph Keshet Direct Loss Minimization for Structured Prediction. This is about another technique for directly optimizing the loss in structured prediction, with an application to speech recognition.
  3. Mohammad Saberian and Nuno Vasconcelos Boosting Classifier Cascades. This is about an algorithm for simultaneously optimizing loss and computation in a classifier cascade construction. There were several other papers on cascades which are worth looking at if interested.
  4. Alan Fern and Prasad Tadepalli, A Computational Decision Theory for Interactive Assistants. This paper carves out some forms of natural not-MDP problems and shows their RL-style solution is tractable. It’s good to see people moving beyond MDPs, which at this point are both well understood and limited.
  5. Oliver Williams and Frank McSherry Probabilistic Inference and Differential Privacy. This paper is about a natural and relatively unexplored, and potentially dominating approach for achieving differential privacy and learning.

I also attended two workshops—Coarse-To-Fine and LCCC which were a fine combination. The first was about more efficient (and sometimes more effective) methods for learning which start with coarse information and refine, while the second was about parallelization and distribution of learning algorithms. Together, they were about how to learn fast and effective solutions.

The CtF workshop could have been named “Integrating breadth first search and learning”. I was somewhat (I hope not too) pesky, discussing Searn repeatedly during questions, since it seems quite plausible that a good application of Searn would compete with and plausibly improve on results from several of the talks. Eventually, I hope the conventional wisdom shifts to a belief that search and learning must be integrated for efficiency and robustness reasons. The talks in this workshop were uniformly strong in making that case. I was particularly interested in Drew’s talk on a plausible improvement on Searn.

The level of agreement in approaches at the LCCC workshop was much lower, with people discussing many radically different approaches.

  1. Should data be organized by feature partition or example partition? Fernando points out that features often scale sublinearly in the number of examples, implying that an example partition addresses scale better. However, basic learning theory tells us that if the number of parameters scales sublinearly in the number of examples, then the value of additional samples asymptotes, implying a mismatched solution design. My experience is that a ‘not enough features’ problem can be dealt with by throwing all the missing features you couldn’t properly previously use, for example personalization.
  2. How can we best leverage existing robust distributed filesystem/MapReduce frameworks? There was near unanimity on the belief that MapReduce itself is of limited value for machine learning, but the step forward is unclear. I liked what Markus said: that no one wants to abandon the ideas of robustly storing data and moving small amounts of code to large amounts of data. The best way to leverage this capability to build great algorithms remains unclear to me.
  3. Every speaker was in agreement that their approach was faster, but there was great disagreement about what “fast” meant in an absolute sense. This forced me to think about an absolute measure of (input complexity)/(time) where we see results between 100 features/s and 10*106 features/s being considered “fast” depending on who is speaking. This scale disparity is remarkably extreme. A related detail is that the strength of baseline algorithms varies greatly.

I hope we’ll discover convincing answers to these questions in the near future.

10/29/2010

To Vidoelecture or not

Tags: Conferences, Machine Learning jl@ 1:21 pm

(update: cross-posted on CACM)

For the first time in several years, ICML 2010 did not have videolectures attending. Luckily, the tutorial on exploration and learning which Alina and I put together can be viewed, since we also presented at KDD 2010, which included videolecture support.

ICML didn’t cover the cost of a videolecture, because PASCAL didn’t provide a grant for it this year. On the other hand, KDD covered it out of registration costs. The cost of videolectures isn’t cheap. For a workshop the baseline quote we have is 270 euro per hour, plus a similar cost for the cameraman’s travel and accomodation. This can be reduced substantially by having a volunteer with a camera handle the cameraman duties, uploading the video and slides to be processed for a quoted 216 euro per hour.

Youtube is the most predominant free video site with a cost of $0, but it turns out to be a poor alternative. 15 minute upload limits do not match typical talk lengths. Videolectures also have side-by-side synchronized slides & video which allows quick navigation of the videostream and acceptable resolution of typical talk slides. Overall, these benefits are substantial enough that youtube is not presently a serious alternative.

So, if we can’t avoid paying the cost, is it worthwhile? One way to judge this is by comparing how much authors currently spend traveling to a conference and presenting research vs. the size of the audience. In general, costs vary wildly, but for a typical academic international conference, airfare, hotel, and registration are commonly at least $1000 even after scrimping. The sizes of audiences also varies substantially, but something in the 30-100 range is a typical average. For KDD 2010, the average number of views per presentation is 14.6, but this is misleadingly low, as KDD presentations were just put up. A better number is for KDD 2009, where the average view number is presently 74.2. This number is representative with ICML 2009 presently averaging 115.8. We can argue about the relative merits of online vs. in-person viewing, but the order of their value is at least unclear, since in an online system people specifically seek out lectures to view while at the conference itself people are often opportunistic viewers. Valuing these equally, we see that videolectures increases the size of the audience, and (hence) the value to authors by perhaps a factor of 2 for a cost around 1/3 of current presentation costs.

This conclusion is conservative, because a videolecture is almost surely viewed over more than a year, cost of conference attendance are often higher, and the cost in terms of a presenter’s time is not accounted for. Overall, videolecture coverage seems quite worthwhile. Since authors also typically are the attendees of a conference, increasing the registration fees to cover the cost of videolectures seems reasonable. A videolecture is simply a new publishing format.

We can hope that the price will drop over time, as it’s not clear to me that the 216 euros/hour reflects the real costs of videolectures.net. Some competition of a similar quality would be the surest way to do that. But in the near future, there are two categories of conferences—those that judge the value of their content above 216 euros/hour, and those that do not. Whether or not a conference has videolecture support substantially impacts its desirability as a place to send papers.

9/17/2010

New York Area Machine Learning Events

Tags: Announcements, Conferences, Workshop jl@ 7:37 am

On Sept 21, there is another machine learning meetup where I’ll be speaking. Although the topic is contextual bandits, I think of it as “the future of machine learning”. In particular, it’s all about how to learn in an interactive environment, such as for ad display, trading, news recommendation, etc…

On Sept 24, abstracts for the New York Machine Learning Symposium are due. This is the largest Machine Learning event in the area, so it’s a great way to have a conversation with other people.

On Oct 22, the NY ML Symposium actually happens. This year, we are expanding the spotlights, and trying to have more time for posters. In addition, we have a strong set of invited speakers: David Blei, Sanjoy Dasgupta, Tommi Jaakkola, and Yann LeCun. After the meeting, a late hackNY related event is planned where students and startups can meet.

I’d also like to point out the related CS/Econ symposium as I have interests there as well.

9/13/2010

AIStats

Tags: Announcements, Conferences jl@ 5:35 pm

Geoff Gordon points out AIStats 2011 in Ft. Lauderdale, Florida. The call for papers is now out, due Nov. 1. The plan is to experiment with the review process to encourage quality in several ways. I expect to submit a paper and would encourage others with good research to do likewise.

8/22/2010

KDD 2010

Tags: Conferences, Machine Learning jl@ 6:39 pm

There were several papers that seemed fairly interesting at KDD this year. The ones that caught my attention are:

  1. Xin Jin, Mingyang Zhang, Nan Zhang, and Gautam Das, Versatile Publishing For Privacy Preservation. This paper provides a conservative method for safely determining which data is publishable from any complete source of information (for example, a hospital) such that it does not violate privacy rules in a natural language. It is not differentially private, so no external sources of join information can exist. However, it is a mechanism for publishing data rather than (say) the output of a learning algorithm.
  2. Arik Friedman Assaf Schuster, Data Mining with Differential Privacy. This paper shows how to create effective differentially private decision trees. Progress in differentially private datamining is pretty impressive, as it was defined in 2006.
  3. David Chan, Rong Ge, Ori Gershony, Tim Hesterberg, Diane Lambert, Evaluating Online Ad Campaigns in a Pipeline: Causal Models At Scale This paper is about automated estimation of ad campaign effectiveness. The double robust estimation technique seems intuitively appealing and plausibly greatly enhances effectiveness.
  4. Naoki Abe et al. Optimizing Debt Collections Using Constrained Reinforcement Learning This is an application paper about optimizing the New York State income tax collection agency. As you might expect, there are several cludgy aspects due to working within legal and organizational constraints. They deal with them, and expect to end up making NY state around $108/year. Too bad I live in NY :)
  5. Vikas C Raykar, Balaji Krishnapuram, and Shinpeng Yu Designing Efficient Cascaded Classifiers: Tradeoff between Accuracy and Cost This paper is about a continuization based solution to designing a cost-efficient yet accurate classifier cascade. It’s a step beyond the Viola Jones style boosting with cutouts, but I suspect not yet a final solution.
  6. D. Sculley, Combined Regression and Ranking. There are lots of applications where you want both a correct ordering and an estimated value of each item. This paper shows a simple combined-loss approach to getting both which empirically improves on either metric.

In addition, I enjoyed Konrad Feldman’s invited talk on Quantcast’s data and learning systems which sounded pretty slick.

In general, it seems like KDD is substantially maturing as a conference. The work on empirically effective privacy-preserving algorithms and some of the stats-work is ahead of what I’ve seen at other machine learning conferences. Presumably this is due to KDD being closer to the business side of machine learning and hence more aware of what are real problems there. An annoying aspect of KDD as a publishing venue is that they don’t put the papers on the conference website, due to ACM constraints. A substantial compensation is that all talks are scheduled to appear on videolectures.net and, as you can see, most papers can be found on author webpages.

KDD also experimented with crowdvine again this year so people could announce which talks they were interested in and setup meetings. My impression was that it worked a bit less well than last year, partly because it wasn’t pushed as much by the conference organizers. Small changes in the interface might make a big difference—for example, just providing a ranking of papers by interest might make it pretty compelling.

7/18/2010

ICML & COLT 2010

The papers which interested me most at ICML and COLT 2010 were:

  1. Thomas Walsh, Kaushik Subramanian, Michael Littman and Carlos Diuk Generalizing Apprenticeship Learning across Hypothesis Classes. This paper formalizes and provides algorithms with guarantees for mixed-mode apprenticeship and traditional reinforcement learning algorithms, allowing RL algorithms that perform better than for either setting alone.
  2. István Szita and Csaba Szepesvári Model-based reinforcement learning with nearly tight exploration complexity bounds. This paper and anotherrepresent the frontier of best-known algorithm for Reinforcement Learning in a Markov Decision Process.
  3. James Martens Deep learning via Hessian-free optimization. About a new not-quite-online second order gradient algorithm for learning deep functional structures. Potentially this is very powerful because while people have often talked about end-to-end learning, it has rarely worked in practice.
  4. Chrisoph Sawade, Niels Landwehr, Steffen Bickel. and Tobias Scheffer Active Risk Estimation. When a test set is not known in advance, the model can be used to safely aid test set evaluation using importance weighting techniques. Relative to the paper, placing a lower bound on p(y|x) is probably important in practice.
  5. H. Brendan McMahan and Matthew Streeter Adaptive Bound Optimization for Online Convex Optimization and the almost-same paper John Duchi, Elad Hazan, and Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. These papers provide tractable online algorithms with regret guarantees over a family of metrics rather than just euclidean metrics. They look pretty useful in practice.
  6. Nicolò Cesa-Bianchi, Claudio Gentile, Fabio Vitale, Giovanni Zappella, Active Learning on Trees and Graphs Various subsets of these authors have other papers about actively learning graph-obeying functions which in total provide a good basis for understanding what’s possible and how to learn.

The program chairs for ICML did a wide-ranging survey over participants. The results seem to suggest that participants generally agree with the current ICML process. I expect there is some amount of anchoring effect going on where participants have an apparent preference for the known status quo, although it’s difficult to judge the degree of that. Some survey results which aren’t of that sort are:

  1. 7.7% of reviewers say author feedback changed their mind. It would be interesting to know for which fraction of accepted papers reviewers had their mind changed, but that isn’t there.
  2. 85.4% of authors don’t know if the reviewers read their response, believe they read and ignored it, or believe they didn’t read it. Authors clearly don’t feel like they are communicating with reviewers.
  3. 58.6% support growing the conference with the largest fraction suggesting poster-only papers.
  4. Other conferences attended by the ICML community in order are NIPS, ECML/PKDD, AAAI, IJCAI, AIStats, UAI, KDD, ICDM, COLT, SIGIR, ECAI, EMNLP, CoNLL. This is pretty different from the standard colocation list for ICML. Many possibilities are precluded by scheduling, but AAAI, IJCAI, UAI, KDD, COLT, SIGIR are all serious possibilities some of which haven’t been used much in the past.

My experience with Mark’s new paper discussion site is generally positive—having comments emailed to interested parties really helps the discussion. There are a few comments that authors haven’t responded to, so if you are an author you might want to sign up to receive comments.

In addition, I was the workshop chair for ICML&COLT this year. My overall impression was that things went reasonably well, with the exception of internet connectivity at Dan Panorama which was a minidisaster courtesy of a broken per-machine authentication system. One of the things I’m particularly happy about was the Learning to Rank Challenge workshop. I think it would be great if ICML can continue to attract new challenge workshops in the future. If anyone else has comments about the workshops, I’d love to hear them.

6/20/2010

2010 ICML discussion site

A substantial difficulty with the 2009 and 2008 ICML discussion system was a communication vacuum, where authors were not informed of comments, and commenters were not informed of responses to their comments without explicit monitoring. Mark Reid has setup a new discussion system for 2010 with the goal of addressing this.

Mark didn’t want to make it to intrusive, so you must opt-in. As an author, find your paper and “Subscribe by email” to the comments. As a commenter, you have the option of providing an email for follow-up notification.

4/26/2010

Compassionate Reviewing

Most long conversations between academics seem to converge on the topic of reviewing where almost no one is happy. A basic question is: Should most people be happy?

The case against is straightforward. Anyone who watches the flow of papers realizes that most papers amount to little in the longer term. By it’s nature research is brutal, where the second-best method is worthless, and the second person to discover things typically gets no credit. If you think about this for a moment, it’s very different from most other human endeavors. The second best migrant laborer, construction worker, manager, conductor, quarterback, etc… all can manage quite well. If a reviewer has even a vaguely predictive sense of what’s important in the longer term, then most people submitting papers will be unhappy.

But this argument unravels, in my experience. Perhaps half of reviews are thoughtless or simply wrong with a small part being simply malicious. And yet, I’m sure that most reviewers genuinely believe they can predict what will and will not be useful in the longer term. This disparity is a lack of communication. When academics have conversations about reviewing, the presumption of participants in each conversation is that they all share about the same beliefs about what will be useful, and what will take off. Such conversations rarely go into specifics, because the specifics are boring in particular, technical, and because their is a real chance of disagreement on the specifics themselves.

When double blind reviewing was first being considered for ICML, I remember speaking about the experience in the Crypto community, where in my estimate the reviewing was both fairer and less happy. Many conferences in machine learning have shifted to doubleblind reviewing, and I think we have seen this come to pass here as well. Without double blind reviewing, it is common to have an “in” crowd who everyone respects and whose papers are virtually always accepted. These people are happy, and the rest have little voice. With double blind reviewing, everyone suffers substantial rejections.

We might say “fine, at least it’s fair”, but in my experience there is a real problem. From a viewpoint external to the community, when the reviewing is poor and the viewpoint of people in the community highly contradictory, nothing good happens. Outsiders (i.e. most people) viewing the acrimony choose some other way to solve problems, proposals don’t get funded, and the community itself tends to fracture. For example, in cryptography, TCC (not double blind) has started, presumably because the top theory people got tired of having their papers rejected at Crypto (double blind). From a process-of-research standpoint, this seems suboptimal, as different groups using different methods to solve similar problems are particularly the people who you would prefer talking to each other.

What seems to be lost with double blind reviewing is some amount of compassion, unfairly allocated. In a double blind system, any given paper is plausibly from someone you don’t know, and since most papers go nowhere, plausibly not going anywhere. Consequently, the bias starts “against” for all work, a disadvantage which can be quite difficult to overcome. Some time ago, I discussed how I thought motivation should be the responsibility of the reviewer. Aaron Hertzman strongly disagreed on the grounds that this belief could dead end your career as an author. I’ve come to appreciate his viewpoint to an extent. But, it misses the point slightly—the question of “What is good for the community?” differs from “What is good for the author?” In a healthy community, reviewers will actively understand why a piece of work is or is not important, filling in and extending the motivation as they consider the problem.

So, a question is: How can we get compassionate reviewing? (And in a fair way?) It might help somewhat for reviewers to actively consider, as part of their review, the level and mechanism of impact that a paper may have. Reducing reviewing load is certainly helpful, but it is not sufficient alone, because many people naturally interpret a reduced reviewing load as time to work on other things. And, some mechanisms seem to even harm. For example, the two-phase reviewing process that ICML currently uses might save 0.5 reviews/paper, while guaranteeing that for half of the papers, the deciding review is done hastily with no author feedback, a recipe for mistakes.

What creates a great deal of compassion? Public responsibility helps (witness workshops more interesting than conferences). A natural conversation helps (the current method of single round response tends to be very stilted). And time, of course, helps. What else?

4/24/2010

COLT Treasurer is now Phil Long

Tags: Conferences, Funding jl@ 2:14 pm

For about 5 years, I’ve been the treasurer of the Association for Computational Learning, otherwise known as COLT, taking over from John Case before me. A transfer of duties to Phil Long is now about complete. This probably matters to almost no one, but I wanted to describe things a bit for those interested.

The immediate impetus for this decision was unhappiness over reviewing decisions at COLT 2009, one as an author and several as a member of the program committee. I seem to have disagreements fairly often about what is important work, partly because I’m focused on learning theory with practical implications, partly because I define learning theory more broadly than is typical amongst COLT members, and partly because COLT suffers a bit from insider-clique issues. The degree to which these issues come up varies substantially each year so last year is not predictive of this one. And, it’s important to understand that COLT remains healthy with these issues not nearly so bad as they were. Nevertheless, I would like to see them taken more actively into account than I’ve been able to persuade people so far.

After thinking about it for a few days before acting, I decided to go ahead with the transfer for another reason: I’ve been suffering from multitask poisoning. Partly this is Ada, but partly it’s many other things, each of which takes a small bit of my time, in aggregate leaving me disappointing people, myself in particular. The effect of this has been quite obvious in terms of the posting rate on hunch.net.

Fortunately, Phil Long was ready to take up the duties, and he’s well positioned to do so.

Despite the above, I found being treasurer not particularly difficult. The functions of the treasury part of ACL have been

  1. Self-insurance for the conference each year. Prior to the formation of ACL-the-nonprofit (which Bob was instrumental in), COLT used to buy insurance against the possibility that some disaster would strike canceling the conference while leaving the local organizer on the hook for substantial expenses. When I came in, the treasury was a little bit low for this function, and when I left, somewhat too high.
  2. Budget fragmentation avoidance. Local organizers typically have a local account from which they spend for expenses and collect registration fees. Without the ACL, dealing with net positive or negative local accounts from year to year was awkward. With the ACL, it’s easy to square things up at the end of each year.
  3. A stable point of contact for funding related things. COLT is partly sponsored by several big CS-related companies including IBM, Microsoft, and Google. Providing a stable point of contact definitely helps ease this process. This also helps on the publishing side, where Omnipress is the current publisher of proceedings.
  4. Budget advice for local organizers. Somewhat to my surprise, the proper role of the treasurer was typically asking the local organizer to reduce registration fees rather than increase. The essential observation is that local organizers, because they operate out of a local account, tend to be a bit conservative in budget estimates. On the other hand, because ACL has an adequate interest bearing account, we should expect and desire to spend the interest in each typical year. In effect, ACL is naturally in a position to sponsor COLT to a small but nontrivial degree.

After having been treasurer for a little while, I’m convinced that having a nonprofit to back a conference is a good idea easing many difficulties with relatively small effort.

12/9/2009

Future Publication Models @ NIPS

Yesterday, there was a discussion about future publication models at NIPS. Yann and Zoubin have specific detailed proposals which I’ll add links to when I get them (Yann’s proposal and Zoubin’s proposal).

What struck me about the discussion is that there are many simultaneous concerns as well as many simultaneous proposals, which makes it difficult to keep all the distinctions straight in a verbal conversation. It also seemed like people were serious enough about this that we may see some real movement. Certainly, my personal experience motivates that as I’ve posted many times about the substantial flaws in our review process, including some very poor personal experiences.

Concerns include the following:

  1. (Several) Reviewers are overloaded, boosting the noise in decision making.
  2. (Yann) A new system should run with as little built-in delay and friction to the process of research as possible.
  3. (Hanna Wallach(updated)) Double-blind review is particularly important for people who are unknown or from an unknown institution.
  4. (Several) But, it’s bad to take double blind so seriously as to disallow publishing on arxiv or personal webpages.
  5. (Yann) And double-blind is bad when it prevents publishing for substantial periods of time. Apparently, this comes up in CVPR.
  6. (Zoubin) Any new system should appear to outsiders as if it’s the old system, or a journal, because it’s already hard enough to justify CS tenure cases to other disciplines.
  7. (Fernando) There shouldn’t be a big change with a complex bureaucracy, but rather a smaller changes which are obviously useful or at least worth experimenting with.

There were other concerns as well, but these are the ones that I remember.

Elements of proposals include:

  1. (Yann) Everything should go to Arxiv or an arxiv-like system first, as per physics or mathematics. This addresses (1), because it delinks dissemination from review, relieving some of the burden of reviewing. It also addresses (2) since with good authors they can immediately begin building on each other’s work. It conflicts with (3), because Arxiv does not support double-blind submission. It does not conflict if we build our own system.
  2. (Fernando) Create a conference coincident journal in which people can publish at any time. VLDB has apparently done this. It can be done smoothly by allowing submission in either conference deadline mode or journal mode. This proposal addresses (1) by reducing peak demand on reviewing. It also addresses (6) above.
  3. (Daphne) Perhaps we should have a system which only reviews papers for correctness, which is not nearly as subjective as for novelty or interestingness. This addresses (1), by eliminating some concerns for the reviewer. It is orthogonal to the double blind debate. In biology, such a journal exists (pointer updated), because delays were becoming absurd and intolerable.
  4. (Yann) There should be multiple publishing entities (people or groups of people) that can bless a paper as interesting. This addresses (1).

There are many other proposal elements (too many for my memory), which hopefully we’ll see in particular proposals. If other people have concrete proposals, now is probably the right time to formalize them.

10/26/2009

NIPS workshops

Tags: Announcements, Conferences, Workshop jl@ 8:40 am

Many of the NIPS workshops have a deadline about now, and the NIPS early registration deadline is Nov. 6. Several interest me:

  1. Adaptive Sensing, Active Learning, and Experimental Design due 10/27.
  2. Discrete Optimization in Machine Learning: Submodularity, Sparsity & Polyhedra, due Nov. 6.
  3. Large-Scale Machine Learning: Parallelism and Massive Datasets, due 10/23 (i.e. past)
  4. Analysis and Design of Algorithms for Interactive Machine Learning, due 10/30.

And I’m sure many of the others interest others. Workshops are great as a mechanism for research, so take a look if there is any chance you might be interested.

10/10/2009

ALT 2009

Tags: Conferences, Online, Papers jl@ 2:58 pm

I attended ALT (”Algorithmic Learning Theory”) for the first time this year. My impression is ALT = 0.5 COLT, by attendance and also by some more intangible “what do I get from it?” measure. There are many differences which can’t quite be described this way though. The program for ALT seems to be substantially more diverse than COLT, which is both a weakness and a strength.

One paper that might interest people generally is:

Alexey Chernov and Vladimir Vovk, Prediction with Expert Evaluators’ Advice. The basic observation here is that in the online learning with experts setting you can simultaneously compete with several compatible loss functions simultaneously. Restated, debating between competing with log loss and squared loss is a waste of breath, because it’s almost free to compete with them both simultaneously. This might interest anyone who has run into “which loss function?” debates that come up periodically.

6/24/2009

Interesting papers at UAICMOLT 2009

Tags: Conferences, Machine Learning, Papers jl@ 2:36 am

Here’s a list of papers that I found interesting at ICML/COLT/UAI in 2009.

  1. Elad Hazan and Comandur Seshadhri Efficient learning algorithms for changing environments at ICML. This paper shows how to adapt learning algorithms that compete with fixed predictors to compete with changing policies. The definition of regret they deal with seems particularly useful in many situation.
  2. Hal Daume, Unsupervised Search-based Structured Prediction at ICML. This paper shows a technique for reducing unsupervised learning to supervised learning which (a) make a fast unsupervised learning algorithm and (b) makes semisupervised learning both easy and highly effective.
  3. There were two papers with similar results on active learning in the KWIK framework for linear regression, both reducing the sample complexity to . One was Nicolo Cesa-Bianchi, Claudio Gentile, and Francesco Orabona Robust Bounds for Classification via Selective Sampling at ICML and the other was Thomas Walsh, Istvan Szita, Carlos Diuk, Michael Littman Exploring compact reinforcement-learning representations with linear regression at UAI. The UAI paper covers application to RL as well.
  4. Ping Li, Improving Compressed Counting at UAI. This paper talks about how to keep track of the moments in a datastream with very little space and computation. I’m not sure I have a use for it yet, but it seems like a cool piece of basic technology.
  5. Mark Reid and Bob Williamson Surrogate Regret Bounds for Proper Losses at ICML. This paper points out that via the integral characterization of proper losses, proper scoring rules can be reduced to binary classification. The results unify and generalize the Probing and Quanting reductions we worked on previously. This paper is also related to Nicolas Lambert’s work, which is quite thought provoking in terms of specifying what is learnable.
  6. Daniel Hsu, Sham M. Kakade and Tong Zhang. A Spectral Algorithm for Learning Hidden Markov Models COLT. This paper shows that a subset of HMMs can be learned using an SVD-based algorithm.
  7. Samory Kpotufe, Escaping the curse of dimensionality with a tree-based regressor at COLT. This paper shows how to directly applying regression in high dimensional vector spaces and have it succeed anyways because the data is naturally low-dimensional.
  8. Shai Ben-David, David Pal and Shai Shalev-Shwartz. Agnostic Online Learning at COLT. This paper characterizes the ability to learn when an adversary is choosing features in the online setting as the “Littlestone dimension”.

5/24/2009

2009 ICML discussion site

Mark Reid has setup a discussion site for ICML papers again this year and Monica Dinculescu has linked it in from the ICML site. Last year’s attempt appears to have been an acceptable but not wild success as a little bit of fruitful discussion occurred. I’m hoping this year will be a bit more of a success—please don’t be shy :-)

I’d like to also point out that ICML’s early registration deadline has a few hours left, while UAI’s and COLT’s are in a week.

4/21/2009

Interesting Presentations at Snowbird

Here are a few of presentations interesting me at the snowbird learning workshop (which, amusingly, was in Florida with AIStat).

  1. Thomas Breuel described machine learning problems within OCR and an open source OCR software/research platform with modular learning components as well has a 60Million size dataset derived from Google’s scanned books.
  2. Kristen Grauman and Fei-Fei Li discussed using active learning with different cost labels and large datasets for image ontology. Both of them used Mechanical Turk as a labeling system, which looks to become routine, at least for vision problems.
  3. Russ Tedrake discussed using machine learning for control, with a basic claim that it was the way to go for problems involving a medium Reynold’s number such as in bird flight, where simulation is extremely intense.
  4. Yann LeCun presented a poster on an FPGA for convolutional neural networks yielding a factor of 100 speedup in processing. In addition to the graphics processor approach Rajat has worked on, this seems like an effective approach to deal with the need to compute many dot products.

2/18/2009

Decision by Vetocracy

Few would mistake the process of academic paper review for a fair process, but sometimes the unfairness seems particularly striking. This is most easily seen by comparison:

Paper Banditron Offset Tree Notes
Problem Scope Multiclass problems where only the loss of one choice can be probed. Strictly greater: Cost sensitive multiclass problems where only the loss of one choice can be probed. Often generalizations don’t matter. That’s not the case here, since every plausible application I’ve thought of involves loss functions substantially different from 0/1.
What’s new Analysis and Experiments Algorithm, Analysis, and Experiments As far as I know, the essence of the more general problem was first stated and analyzed with the EXP4 algorithm (page 16) (1998). It’s also the time horizon 1 simplification of the Reinforcement Learning setting for the random trajectory method (page 15) (2002). The Banditron algorithm itself is functionally identical to One-Step RL with Traces (page 122) (2003) in Bianca’s thesis with the epsilon greedy strategy and a multiclass perceptron with update scaled by the importance weight.
Computational Time O(k) per example where k is the number of choices O(log k) per example Lower bounds on the sample complexity of learning in this setting are a factor of k worse than for supervised learning, implying that many more examples may be needed in practice. Consequently, learning algorithm speed is more important than in standard supervised learning.
Analysis Incomparable. An online regret analysis showing that if a small hinge loss predictor exists, a bounded number of mistakes occur. Also, an algorithm independent analysis of the fully realizable case. Incomparable. A learning reduction analysis showing how the regret of any base classifier bounds policy regret. Also contains a lower bound and comparable analysis of all plausible alternative reductions.
Experiments 1 dataset, comparing with no other approaches to solving the problem. 13 datasets, comparing with 2 other approaches to solve the problem.
Outcome Accepted at ICML Rejected at ICML, NIPS, UAI, and NIPS.

The reviewers of the Banditron paper made the right call. The subject is interesting, and analysis of a new learning domain is of substantial interest. Real advances in machine learning often come as new domains of application. The talk was well attended and generated substantial interest. It’s also important to remember the reviewers of the two papers probably did not overlap, so there was no explicit preference for A over B.

Why was the Offset Tree rejected? One of these rejections is easily explained as a fluke—we ran into a reviewer at UAI who believes that learning by memorization is the way to go. I, and virtually all machine learning people, disagree but some reviewers at UAI aren’t interested or expert in machine learning.

The striking thing about the other 3 rejects is that they all contain a reviewer who doesn’t read the paper. Instead, the reviewer asserts that learning reductions are bogus because for an alternative notion of learning reduction, made up by the reviewer, an obviously useless approach yields a factor of 2 regret bound. I believe this is the same reviewer each time, because the alternative theorem statement drifted over the reviews fixing bugs we pointed out in the author response.

The first time we encountered this review, we assumed the reviewer was just cranky that day—maybe we weren’t quite clear enough in explaining everything as it’s always difficult to get every detail clear in new subject matter. I have sometimes had a very strong negative impression of a paper which later turned out to be unjustified upon further consideration. Sometimes when a reviewer is cranky, they change their mind after the authors respond, or perhaps later, or perhaps never but you get a new set of reviewers the next time.

The second time the review came up, we knew there was a problem. If we are generous to the reviewer, and taking into account the fact that learning reduction analysis is a relatively new form of analysis, the fear that because an alternative notion of reduction is vacuous our notion of reduction might also be vacuous isn’t too outlandish. Fortunately, there is a way to completely address that—we added an algorithm independent lower bound to the draft (which was the only significant change in content over the submissions). This lower bound conclusively proves that our notion of learning reduction is not vacuous as is the reviewer’s notion of learning reduction.

The review came up a third time. Despite pointing out the lower bound quite explicitly, the reviewer simply ignored it. This more-or-less confirms our worst fears. Some reviewer is bidding for the paper with the intent to torpedo review it. They are uninterested in and unwiling to read the content itself.

Shouldn’t author feedback address this? Not if the reviewer ignores it.

Shouldn’t Double Blind reviewing help? Not if the paper only has one plausible source. The general problem area and method of analysis were freely discussed on hunch.net. We withheld public discussion of the algorithm itself for much of the time (except for a talk at CMU) out of respect for the review process.

Why doesn’t the area chair/program chair catch it? It took us 3 interactions to get it, so it seems unrealistic to expect someone else to get it in one interaction. In general, these people are strongly overloaded and the reviewer wasn’t kind enough to boil down the essence of the stated objection as I’ve done above. Instead, they phrase it as an example and do not clearly state the theorem they have in mind or distinguish the fact that the quantification of that theorem differs from the quantification of our theorems. More generally, my observation is that area chairs rarely override negative reviews because:

  1. It risks their reputation since defending a criticized work requires the kind of confidence that can only be inspired by a thorough personal review they don’t have time for.
  2. They may offend the reviewer they invited to review and personally know.
  3. They figure that the average review is similar to the average perception/popularity by the community anyways.
  4. Even if they don’t agree with the reviewer, it’s hard to fully discount the review in their consideration.

I’ve seen these effects create substantial mental gymnastics elsewhere.

Maybe you just ran into a cranky reviewer 3 times randomly Maybe so. However, the odds seem low enough and the 1/2 year cost of getting another sample high enough, that going with the working hypothesis seems indicated.

Maybe the writing needs improving. Often that’s a reasonable answer for a rejection, but in this case I believe not. We’ve run the paper by several people, who did not have substantial difficulties understanding it. They even understand the draft well enough to make a suggestion or two. More generally, no paper is harder to read than the one you picked because you want to reject it.

What happens next? With respect to the Offset Tree, I’m hopeful that we eventually find reviewers who appreciate an exponentially faster algorithm, good empirical results, or the very tight and elegant analysis, or even all three. For the record, I consider the Offset Tree a great paper. It remains a substantial advance on the state of the art, even 2 years later, and as far as I know the Offset Tree (or the Realizable Offset Tree) consistently beat all reasonable contenders both in prediction and computational performance. This is rare and precious, as many papers tradeoff one for the other. It yields a practical algorithm applicable to real problems. It substantially addresses the RL to classification reduction problem. It also has the first nonconstant algorithm independent lower bound for learning reductions.

With respect to the reviewer, I expect remarkably little. The system is designed to protect reviewers, so they have virtually no responsibility for their decisions. This reviewer has a demonstrated capability to sabotage the review process at ICML and NIPS and a demonstrated willingness to continue doing so indefinitely. The process of bidding for papers and making up reasons to reject them seems tedious, but there is no fundamental reason why they can’t continue doing so for several decades if they remain active in academia.

This experience has substantially altered my understanding and appreciation of the review process at conferences. The bidding mechanism commonly used, coupled with responsibility-free reviewing is an invitation to abuse. A clever abusive reviewer can sabotage perhaps 5 papers per conference (out of 8 reviewed), while maintaining a typical average score. While I don’t believe most people choose papers with intent to sabotage, the capability is there and used by at least one person and possibly others. If, for example, 5% of reviewers are willing to abuse the process this way and there are 100 reviewers, every paper must survive 5 vetoes. If there are 200 reviewers, every paper must survive 10 vetoes. And if there are 400 reviewers, every paper must survive 20 vetoes. This makes publishing any paper that offends someone difficult. The surviving papers are typically inoffensive or part of a fad strong enough that vetoes are held back. Neither category is representative of high quality decision making. These observations suggest that the conference with the most reviewers tend strongly toward faddy and inoffensive papers, both of which often lack impact in the long term. Perhaps this partly explains why NIPS is so weak when people start citation counting. Conversely, this would suggest that smaller conferences and workshops have a natural advantage. Similarly, the reviewing style in theory conferences seems better—the set of bidders for any paper is substantially smaller, implying papers must survive fewer vetos.

This decision making process can be modeled as a group of n decision makers, each of which has the opportunity to veto any action. When n is relatively small, this decision making process might work ok, depending on the decision makers, but as n grows larger, it’s difficult to imagine a worse decision making process. The closest representatives outside of academia I know are deeply bureacratic governments and other large organizations where many people must sign off on something before it takes place. These vetocracies are universally frustrating to interact with. A reasonable conjecture is that any decision making process with a large veto number has poor characteristics.

A basic question is: Is a vetocracy inevitable for large organizations? I believe the answer is no. The basic observation is that the value of n can be logarithmic in the number of participants in an organization rather than linear, as per reviewing under a bidding process. An essential force driving vetocracy creation is a desire to offload responsibility for decisions, so there is no clear decision maker. A large organization not deciding by vetocracy must have a very different structure, with clearly dilineated responsibility.

NIPS provides an almost perfect natural experiment in it’s workshop organization, which involves the very same community of people and subject matter, yet works in a very different manner. There are one or two workshop chairs who are responsible for selecting amongst workshop proposals, after which the content of the workshop is entirely up to the workshop organizers. If a workshop is rejected, it’s clear who is at fault, and if a workshop presentation is rejected, it is often clear by who. Some workshop chairs use a small set of reviewers, but even then the effective veto number remains small. Similarly, if a workshop ends up a flop, it’s relatively easy to see who to blame—either the workshop chair for not predicting it, or the organizers for failing to organize. I can’t think of a single time when I attended both the workshops and the conference that the workshops were less interesting than the conference. My understanding is that this observation is common. Given this discussion, it will be particularly interesting to see how the review process Michael and Leon setup for ICML this year pans out, as it is a system with notably more responsibility assignment than in previous years.

Journals end up looking relatively good with respect to vetocracy avoidance. The ones I’m familiar with have a chief editor who bears responsibility for routing papers to an action editor, who bears responsibility for choosing good reviewers. Every agent except the reviewers is often known by the authors, and the reviewers don’t act as additional vetoers in nearly as strong a manner as reviewers with the opportunity to bid.

This experience has also altered my view of blogging and research. On one hand, I’m very enthusiastic about research in general, and my research in particular, where we are regularly cracking conventionally impossible problems. On the other hand, it seems that some small number of people viewing a discussion silently decide they don’t like it, and veto it given the opportunity. It only takes one to turn strong paper into a years-long odyssey, so public discussion of research directions and topics in a vetocracy is akin to voluntarily wearing a “kick me” sign. While this a problem for me, I expect it to be even worse for the members of a vetocracy in the long term.

It’s hard to imagine any research community surviving without a serious online presence. When a prospective new researcher looks around at existing research, if they don’t find serious online discussion, they’ll assume it doesn’t exist under the “not on the internet so it doesn’t exist” principle. This will starve a field of new people. More generally, there is an opportunity to get feedback about research directions and problems much more rapidly than is otherwise possible, allowing us to avoid research on dead end topics which are pervasive. At some point, it may even seem that people not willing to discuss their research simply avoid doing so because it is critically lacking in one way or another. Since a vetocracy creates a substantial disincentive to discuss research directions online, we can expect that communities sticking with decision by vetocracy to be at a substantial disadvantage.

1/8/2009

Predictive Analytics World

Tags: Conferences, Machine Learning jl@ 7:21 pm

Carla Vicens and Eric Siegel contacted me about Predictive Analytics World in San Francisco February 18&19, which I wasn’t familiar with. A quick look at the agenda reveals several people I know working on applications of machine learning in businesses, covering deployed applications topics. It’s interesting to see a business-focused machine learning conference, as it says that we are succeeding as a field. If you are interested in deployed applications, you might attend.

Eric and I did a quick interview by email.

John >
I’ve mostly published and participated in academic machine learning conferences like ICML, COLT, and NIPS. When I look at the set of speakers and subjects for your conference I think “machine learning for business”. Is that your understanding of things? What I’m trying to ask is: what do you view as the primary goal for this conference?

Eric >
You got it. This is the business event focused on the commercial deployment of technology developed at the research conferences you named. Academics’ term, “machine learning,” is essentially synonymous with the business world’s “predictive modeling”. Predictive Analytics World focuses on business applications of this technology, such as response modeling, churn modeling, email targeting, product recommendations, insurance pricing, and credit scoring. PAW’s goal is to strengthen the business impact delivered by predictive analytics deployment, and establish new opportunities with predictive analytics. The conference delivers case studies, expertise and resources to this end.

The conference program is designed to speak the language of marketing and business professionals using or planning to use predictive analytics to solve business challenges — but for the hands-on practitioner or analytical expert focused on commercial deployment who wishes to speak this same language, it’s an equally valuable event.

John >
People at academic conferences would hope that technology developed there can transfer into business use. In your experience, does this happen? And how fast or difficult is it?

Eric >
The best way to catalyze commercial deployment is to show the people it really works outside “the lab” – which is why PAW’s program is packed primarily with named case studies of commercial deployment. These success stories answer your question with a resounding “yes” that the core technology developed academically is indeed put to use.

But predictive analytics has not yet been broadly adopted across all industries, although success stories show at least initial reach in each vertical. So, sure, as one who previously wore a researcher’s hat, commercial deployment can feel slow; having solved the hardest theoretical, mathematical or statistical problems, the rest should go smoothly, right? Not exactly. The main challenges come in ramping up the business “consumer” on the technology so they see its value, positioning the technology in a way that provides business value, and, on the integration side, in preparing corporate data for predictive modeling (that’s a doozy!) and in integrating predict scores into existing systems and processes. These things take time.

John >
Sometimes people working in the academic world don’t have a good understanding of what the real problems are. Do you have a sense of which areas of research are underemphasized in the academic world?

Eric >
To reach commercial success in deploying predictive analytics for the business applications I listed above, the main challenges are on the process and non-analytical integration side, rather than core machine learning technology; its good enough. So, I don’t consider there to be glaring ommissions in the capabilities of core machine learning (I taught the machine learning graduate course at Columbia University and still consider Tom Mitchell’s textbook to be my bible).

On the other hand, there are always places where “real-world” data is going to bring researchers’ attention to as-yet-unsolved problems. A perfect example is the Netflix Prize, the current leader of which (and winner of the recent Progress Prize) will be speaking at PAW-09 – see here.

« Newer PostsOlder Posts »

Powered by WordPress