Machine Learning (Theory)

12/27/2008

Adversarial Academia

One viewpoint on academia is that it is inherently adversarial: there are finite research dollars, positions, and students to work with, implying a zero-sum game between different participants. This is not a viewpoint that I want to promote, as I consider it flawed. However, I know several people believe strongly in this viewpoint, and I have found it to have substantial explanatory power.

For example:

  1. It explains why your paper was rejected based on poor logic. The reviewer wasn’t concerned with research quality, but rather with rejecting a competitor.
  2. It explains why professors rarely work together. The goal of a non-tenured professor (at least) is to get tenure, and a case for tenure comes from a portfolio of work that is undisputably yours.
  3. It explains why new research programs are not quickly adopted. Adopting a competitor’s program is impossible, if your career is based on the competitor being wrong.

Different academic groups subscribe to the adversarial viewpoint in different degrees. In my experience, NIPS is the worst. It is bad enough that the probability of a paper being accepted at NIPS is monotonically decreasing in it’s quality. This is more than just my personal experience over a number of years, as it’s corroborated by others who have told me the same. ICML (run by IMLS) used to have less of a problem, but since it has become more like NIPS over time, it has inherited this problem. COLT has not suffered from this problem as much in my experience, although it had other problems related to the focus being defined too narrowly. I do not have enough experience with UAI or KDD to comment there.

There are substantial flaws in the adversarial viewpoint.

  1. The adversarial viewpoint makes you stupid. When viewed adversarially, any idea has crippling disadvantages and no advantages. Contorting your viewpoint enough to make this true damages your ability to conduct research. In short, it promotes poor mental hygiene.
  2. Many activities become impossible. Doing research is in general extremely hard, so there are many instances where working with other people can allow you to do things which are otherwise impossible.
  3. The previous two disadvantages apply even more strongly for a community—good ideas are more likely to be missed, change comes slowly, and often with steps backward.
  4. At it’s most basic level, the assumption that research is zero-sum is flawed, because the process of research is not done in a closed system. If the rest of society at large discovers that research is valuable, then the budget increases.

Despite these disadvantages, there is a substantial advantage as well: you can materially protect and aid your career by rejecting papers, preventing grants, and generally discriminating against key people doing interesting but competitive work.

The adversarial viewpoint has a validity in proportion to the number of people subscribing to it. For those of us who would like to deemphasize the adversarial viewpoint, what’s unclear is: how?

One concrete thing is: use Arxiv. For a long time, physicists have adopted an Arxiv-first philosophy, which I’ve come to respect. Arxiv functions as a universal timestamp which decreases the power of an adversarial reviewer. Essentially, you avoid giving away the power to muddy the track of invention. I’m expecting to use Arxiv for essentially all my past-but-unpublished and future papers.

It is plausible that limiting the scope of bidding, as Andrew McCallum suggested at the last ICML, and as is effectively implemented at this ICML, will help. The system of review at journals might also help for the same reason. In my experience as an author, if an anonymous reviewer wants to kill a paper they usually succeed. Most area chairs or program chairs are more interested in avoiding conflict with the reviewer (who they picked and may consider a friend) than reading the paper to determine the illogic of the review (which is a difficult task that simply cannot be done for all papers). NIPS experimented with a reputation system for reviewers last year, but I’m unclear on how well it worked, as an author’s score for a review and a reviewer’s score for the paper may be deeply correlated, revealing little additional information.

Public discussion of research can help with this, because very poor logic simply doesn’t stand up under public scrutiny. While I hope to nudge people in this direction, it’s clear that most people aren’t yet comfortable with public discussion.

12/23/2008

Use of Learning Theory

Filed under: Machine Learning, TheoryJohn Langford @ 11:55 am

I’ve had serious conversations with several people who believe that the theory in machine learning is “only useful for getting papers published”. That’s a compelling statement, as I’ve seen many papers where the algorithm clearly came first, and the theoretical justification for it came second, purely as a perceived means to improve the chance of publication.

Naturally, I disagree and believe that learning theory has much more substantial applications.

Even in core learning algorithm design, I’ve found learning theory to be useful, although it’s application is more subtle than many realize. The most straightforward applications can fail, because (as expectation suggests) worst case bounds tend to be loose in practice (*). In my experience, considering learning theory when designing an algorithm has two important effects in practice:

  1. It can help make your algorithm behave right at a crude level of analysis, leaving finer details to tuning or common sense. The best example I have of this is the Isomap, where the algorithm was informed by the analysis yielding substantial improvements in sample complexity over earlier algorithmic ideas.
  2. An algorithm with learning theory considered in it’s design can be more automatic. I’ve gained more respect for Rifkin’s claim: that the one-against-all reduction, when tuned well, can often perform as well as other approaches. The “when tuned well” caveat is however substantial, because learning algorithms may be applied by nonexperts or by other algorithms which are computationally constrained. A reasonable and worthwhile hope for other methods of addressing multiclass problems is that they are more automatic and computationally faster. The subtle issue here is: How do you measure “more automatic”?

In my experience, learning theory is most useful in it’s crudest forms. A good example comes in the architecting problem: how do you go about solving a learning problem? I mean this in the broadest sense imaginable:

  1. Is it a learning problem or not? Many problems are most easily solved via other means such as engineering, because that’s easier, because there is a severe data gathering problem, or because there is so much data that memorization works fine. Learning theory such as statistical bounds and online learning with experts helps substantially here because it provides guidelines about what is possible to learn and what not.
  2. What type of learning problem is it? Is it a problem where exploration is required or not? Is it a structured learning problem? A multitask learning problem? A cost sensitive learning problem? Are you interested in the median or the mean? Is active learning useable or not? Online or not? Answering these questions correctly can easily make a difference between a succesful application and not. Answering these questions is partly definition checking, and since the answer is often “all of the above”, figuring out which aspect of the problem to address first or next is helpful.
  3. What is the right learning algorithm to use? Here the relative capacity of a learning algorithm and it’s computational efficiency are most important. If you have few features and many examples, a nonlinear algorithm with more representational capacity is a good idea. If you have many features and little data, linear representations or even exponentiated gradient style algorithms are important. If you have very large amounts of data, the most scalable algorithms (so far) use a linear representation. If you have little data and few features, a Bayesian approach may be your only option. Learning theory can help in all of the above by quantifying “many”, “little”, “most”, and “few”. How do you deal with the overfitting problem? One thing I realized recently is that the overfitting problem can be a concern even with very large natural datasets, because some examples are naturally more important than others.

As might be clear, I think of learning theory as somewhat broader than might be traditional. Some of this is simply education. Many people have only been exposed to one piece of learning theory, often VC theory or it’s cousins. After seeing this, they come to think of it as learning theory. VC theory is a good theory, but it is not complete, and other elements of learning theory seem at least as important and useful. Another aspect is publishability. Simply sampling from the learning theory in existing papers does not necessarily give a good distribution of subjects for teaching, because the goal of impressing reviewers does not necessarily coincide with the clean simple analysis that is teachable.

(*) There is significant investigation into improving the tightness of bounds to the point of usefulness, and maybe it will pay off.

12/12/2008

Summer Conferences

Filed under: Conferences, Machine LearningJohn Langford @ 6:35 pm

Here’s a handy table for the summer conferences.

Conference Deadline Reviewer Targeting Double Blind Author Feedback Location Date
ICML (wrong ICML) January 26 Yes Yes Yes Montreal, Canada June 14-17
COLT February 13 No No Yes Montreal June 19-21
UAI March 13 No Yes No Montreal June 19-21
KDD February 2/6 No No No Paris, France June 28-July 1

Reviewer targeting is new this year. The idea is that many poor decisions happen because the papers go to reviewers who are unqualified, and the hope is that allowing authors to point out who is qualified results in better decisions. In my experience, this is a reasonable idea to test.

Both UAI and COLT are experimenting this year as well with double blind and author feedback, respectively. Of the two, I believe author feedback is more important, as I’ve seen it make a difference. However, I still consider double blind reviewing a net win, as it’s a substantial public commitment to fairness.

12/7/2008

A NIPS paper

Filed under: Bayesian, Empirical, Machine Learning, PapersJohn Langford @ 7:46 pm

I’m skipping NIPS this year in favor of Ada, but I wanted to point out this paper by Andriy Mnih and Geoff Hinton. The basic claim of the paper is that by carefully but automatically constructing a binary tree over words, it’s possible to predict words well with huge computational resource savings over unstructured approaches.

I’m interested in this beyond the application to word prediction because it is relevant to the general normalization problem: If you want to predict the probability of one of a large number of events, often you must compute a predicted score for all the events and then normalize, a computationally inefficient operation. The problem comes up in many places using probabilistic models, but I’ve run into it with high-dimensional regression.

There are a couple workarounds for this computational bug:

  1. Approximate. There are many ways. Often the approximations are uncontrolled (i.e. can be arbitrarily bad), and hence finicky in application.
  2. Avoid. You don’t really want a probability, you want the most probable choice which can be found more directly. Energy based model update rules are an example of that approach and there are many other direct methods from supervised learning. This is great when it applies, but sometimes a probability is actually needed.

This paper points out that a third approach can be viable empirically: use a self-normalizing structure. It seems highly likely that this is true in other applications as well.

11/28/2008

A Bumper Crop of Machine Learning Graduates

Filed under: Machine LearningJohn Langford @ 7:26 pm

My impression is that this is a particularly strong year for machine learning graduates. Here’s my short list of the strong graduates I know. Analpha (for perversity’s sake) by last name:

  1. Jenn Wortmann. When Jenn visited us for the summer, she had one, two, three, four papers. That is typical—she’s smart, capable, and follows up many directions of research. I believe approximately all of her many papers are on different subjects.
  2. Ruslan Salakhutdinov. A Science paper on bijective dimensionality reduction, mastered and improved on deep belief nets which seems like an important flavor of nonlinear learning, and in my experience he’s very fast, capable and creative at problem solving.
  3. Marc’Aurelio Ranzato. I haven’t spoken with Marc very much, but he had a great visit at Yahoo! this summer, and has an impressive portfolio of applications and improvements on convolutional neural networks and other deep learning algorithms.
  4. Lihong Li. Lihong developed the KWIK (”Knows what it Knows”) learning framework, for analyzing and creating uncertainty-aware learning algorithms. New mathematical models of learning are rare, and the topic is of substantial interest, so this is pretty cool. He’s also worked on a wide variety of other subjects and in my experience is broadly capable.
  5. Steve Hanneke: When the chapter on active learning is written in a machine learning textbook, I expect the disagreement coefficient to be in it. Steve’s work is strongly distinguished from his adviser’s, so he is guaranteed capable of independent research.

There are a couple others such as Daniel and Jake for whom I’m unsure of their graduation plans, although they have already done good work. In addition, I’m sure there are several others that I don’t know—feel free to mention others I don’t know in comments.

It’s traditional to imagine that one is best overall for hiring purposes, but I have substantial difficulty with that—the field of ML is simply to broad. Instead, if you are interested in hiring, each should be considered in your context.

11/26/2008

Efficient Reinforcement Learning in MDPs

Filed under: Reinforcement, TheoryJohn Langford @ 7:29 am

Claude Sammut is attempting to put together an Encyclopedia of Machine Learning. I volunteered to write one article on Efficient RL in MDPs, which I would like to invite comment on. Is something critical missing?

11/16/2008

Observations on Linearity for Reductions to Regression

Filed under: Machine Learning, ReductionsJohn Langford @ 6:54 pm

Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11).

Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes.

In particular, If there exists a set of weight vectors wi such that P(i|x)= <wi,x>, then for any invertible error correcting output code C, there exists weight vectors wc which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector wc can be constructed according to the linear superposition of wi implied by the code, and invertibility implies that a correct encoding implies a correct decoding.

This observation extends to all-pairs like codes which compare subsets of choices to subsets of choices using “don’t cares”.

Using this observation, Daniel created a very short proof of the PECOC regret transform theorem (here, and Daniel’s updated version).

One further observation is that under ridge regression (a special case of linear regression), for any code, there exists a setting of parameters such that you might as well use one-against-all instead, because you get the same answer numerically. The implication is that the advantages of codes more complex than one-against-all is confined to other prediction methods.

11/11/2008

COLT CFP

Filed under: Announcements, ConferencesJohn Langford @ 5:13 pm

Adam Klivans, points out the COLT call for papers. The important points are:

  1. Due Feb 13.
  2. Montreal, June 18-21.
  3. This year, there is author feedback.

11/10/2008

ICML Reviewing Criteria

Filed under: ConferencesJohn Langford @ 7:13 pm

Michael Littman and Leon Bottou have decided to use a franchise program chair approach to reviewing at ICML this year. I’ll be one of the area chairs, so I wanted to mention a few things if you are thinking about naming me.

  1. I take reviewing seriously. That means papers to be reviewed are read, the implications are considered, and decisions are only made after that. I do my best to be fair, and there are zero subjects that I consider categorical rejects. I don’t consider several arguments for rejection-not-on-the-merits reasonable.
  2. I am generally interested in papers that (a) analyze new models of machine learning, (b) provide new algorithms, and (c) show that they work empirically on plausibly real problems. If a paper has the trifecta, I’m particularly interested. With 2 out of 3, I might be interested. I often find papers with only one element harder to accept, including papers with just (a).
  3. I’m a bit tough. I rarely jump-up-and-down about a paper, because I believe that great progress is rarely made. I’m not very interested in new algorithms with the same theorems as older algorithms. I’m also cautious about new analysis for older algorithms, since I like to see analysis driving algorithm rather than vice-versa. I prioritize a proof-of-possibility over a quantitative improvement. I consider quantitative improvements of small constant factors in sample complexity significant. For computationaly complexity, I generally want to see at least an order of magnitude improvement. I generally disregard any experiments on toy data, because I’ve found that toy data and real data can too-easily differ in their behavior.
  4. My personal interests are pretty well covered by existing papers, but this is perhaps not too important a criteria, compared to the above, as I easily believe other subjects are interesting.

11/9/2008

A Healthy COLT

Filed under: Conferences, Machine LearningJohn Langford @ 10:49 am

A while ago, we discussed the health of COLT. COLT 2008 substantially addressed my concerns. The papers were diverse and several were interesting. Attendance was up, which is particularly notable in Europe. In my opinion, the colocation with UAI and ICML was the best colocation since 1998.

And, perhaps best of all, registration ended up being free for all students due to various grants from the Academy of Finland, Google, IBM, and Yahoo.

A basic question is: what went right? There seem to be several answers.

  1. Cost-wise, COLT had sufficient grants to alleviate the high cost of the Euro and location at a university substantially reduces the cost compared to a hotel.
  2. Organization-wise, the Finns were great with hordes of volunteers helping set everything up. Having too many volunteers is a good failure mode.
  3. Organization-wise, it was clear that all 3 program chairs were cooperating in designing the program.
  4. Facilities-wise, proximity in time and space made the colocation much more real than many others have been in the past.
  5. Program-wise, COLT notably had two younger program chairs, Tong and Rocco, which seemed to work well.

11/4/2008

Rise of the Machines

Filed under: AILawrence Saul @ 12:42 pm

On the enduring topic of how people deal with intelligent machines, we have this important election bulletin.

10/20/2008

New York’s ML Day

Filed under: Machine Learning, WorkshopJohn Langford @ 1:54 pm

I’m not as naturally exuberant as Muthu 2 or David about CS/Econ day, but I believe it and ML day were certainly successful.

At the CS/Econ day, I particularly enjoyed Toumas Sandholm’s talk which showed a commanding depth of understanding and application in automated auctions.

For the machine learning day, I enjoyed several talks and posters (I better, I helped pick them.). What stood out to me was number of people attending: 158 registered, a level qualifying as “scramble to find seats”. My rule of thumb for workshops/conferences is that the number of attendees is often something like the number of submissions. That isn’t the case here, where there were just 4 invited speakers and 30-or-so posters. Presumably, the difference is due to a critical mass of Machine Learning interested people in the area and the ease of their attendance.

Are there other areas where a local Machine Learning day would fly? It’s easy to imagine something working out in the San Francisco bay area and possibly Germany or England.

The basic formula for the ML day is a committee picks a few people to give talks, and posters are invited, with some of them providing short presentations. The CS/Econ day was similar, except they managed to let every submitter do a presentation. Are there tweaks to the format which would improve things?

10/19/2008

NIPS 2008 workshop on Kernel Learning

Filed under: Announcements, Conferences, Machine Learning @ 10:52 am

We’d like to invite hunch.net readers to participate in the NIPS 2008 workshop on kernel learning. While the main focus is on automatically learning kernels from data, we are also also looking at the broader questions of feature selection, multi-task learning and multi-view learning. There are no restrictions on the learning problem being addressed (regression, classification, etc), and both theoretical and applied work will be considered. The deadline for submissions is October 24.

More detail can be found here.

Corinna Cortes, Arthur Gretton, Gert Lanckriet, Mehryar Mohri, Afshin Rostamizadeh

10/14/2008

Who is Responsible for a Bad Review?

Although I’m greatly interested in machine learning, I think it must be admitted that there is a large amount of low quality logic being used in reviews. The problem is bad enough that sometimes I wonder if the Byzantine generals limit has been exceeded. For example, I’ve seen recent reviews where the given reasons for rejecting are:

  1. [NIPS] Theorem A is uninteresting because Theorem B is uninteresting.
  2. [UAI] When you learn by memorization, the problem addressed is trivial.
  3. [NIPS] The proof is in the appendix.
  4. [NIPS] This has been done before. (… but not giving any relevant citations)

Just for the record I want to point out what’s wrong with these reviews. A future world in which such reasons never come up again would be great, but I’m sure these errors will be committed many times more in the future.

  1. This is nonsense. A theorem should be evaluated based on it’s merits, rather than the merits of another theorem.
  2. Learning by memorization requires an exponentially larger sample complexity than many other common approaches that often work well. Consequently, what is possible under memorization does not have any substantial bearing on common practice or what might be useful in the future.
  3. Huh? Other people, thank you for putting the proof in the appendix, so the paper reads better. It seems absurd to base a decision on the placement of the content rather than the content.
  4. This is a red flag for a bogus review. Every time I’ve seen a review (as an author or a fellow reviewer) where such claims are made without a concrete citation, they are false. Often they are false even when concrete citations are given.

A softer version of (4) is when someone is cranky because their own paper wasn’t cited. This is understandable, but a more appropriate response seems to be pointing things out, and reviewing anyways. This avoids creating the extra work (for authors and reviewers) of yet another paper resubmission, and reasonable authors do take such suggestions into account.

NIPS figures fairly prominently here. While these are all instances in the last year, my experience after interacting with NIPS for almost a decade is that the average quality of reviews is particularly low there—in many instances reviewers clearly don’t read the papers before writing the review. Furthermore, such low quality reviews are often the deciding factor for the paper decision. Blaming the reviewer seems to be the easy solution for a bad review, but a bit more thought suggests other possibilities:

  1. Area Chair In some conferences an “area chair” or “senior PC” makes or effectively makes the decision on a paper. In general, I’m not a fan of activist area chairs, but when a reviewer isn’t thinking well, I think it is appropriate to step in. This rarely happens, because the easy choice is to simply accept the negative review. In my experience, many Area Chairs are eager to avoid any substantial controversy, and there is a general tendency to believe that something must be wrong with a paper that has a negative review, even if it isn’t what was actually pointed out.
  2. Program Chair In smaller conferences, Program Chairs play the same role as the area chair, so all of the above applies, except now you know the persons name explicitly making them easier to blame. This is a little bit too tempting, I think. For example, I know David McAllester understands that learning by memorization is a bogus reference point, and probably he was just too busy to really digest the reviews. However, a Program Chair is responsible for finding appropriate reviewers for papers, and doing so (or not) has a huge impact on whether a paper is accepted. Not surprisingly, if a paper about the sample complexity of learning is routed to people who have never seen a proof involving sample complexity before, the reviews tend to be spuriously negative (and the paper unread).
  3. Author A reviewer might blame an author, if it turns out later that the reasons given in the review for rejection were bogus. This isn’t absurd—writing a paper well is hard and it’s easy for small mistakes to be drastically misleading in technical content.
  4. Culture A conference has a culture associated with it that is driven by the people who keep coming back. If in this culture it is considered ok to do all the reviews on the last day, it’s unsurprising to see reviews lacking critical thought that could be written without reading the paper. Similarly, it’s unsurprising to see little critical thought at the area chair level, or in the routing of papers to reviewers. This answer is pretty convincing: it explains why low quality reviews keep happening year after year at a conference.

If you believe the Culture reason, then what’s needed is a change in the culture. The good news is that this is both possible and effective. There are other conferences where reviewers expect to spend several hours reviewing a paper. In my experience this year, it was true of COLT and for my corner of SODA. Effecting the change is simply a matter of community standards, and that is just a matter of leaders in the community leading.

10/1/2008

NIPS 2008 workshop on ‘Learning over Empirical Hypothesis Spaces’

This workshop asks for insights how far we may/can push the theoretical boundary of using data in the design of learning machines. Can we express our classification rule in terms of the sample, or do we have to stick to a core assumption of classical statistical learning theory, namely that the hypothesis space is to be defined independent from the sample? This workshop is particularly interested in - but not restricted to - the ‘luckiness framework’ and the recently introduced notion of ‘compatibility functions’ in a semi-supervised learning context (more information can be found at http://www.kuleuven.be/wehys).

9/26/2008

The SODA Program Committee

Filed under: Conferences, Reviewing , TheoryJohn Langford @ 7:10 am

Claire asked me to be on the SODA program committee this year, which was quite a bit of work.

I had a relatively light load—merely 49 theory papers. Many of these papers were not on subjects that I was expert about, so (as is common for theory conferences) I found various reviewers that I trusted to help review the papers. I ended up reviewing about 1/3 personally. There were a couple instances where I ended up overruling a subreviewer whose logic seemed off, but otherwise I generally let their reviews stand.

There are some differences in standards for paper reviews between the machine learning and theory communities. In machine learning it is expected that a review be detailed, while in the theory community this is often not the case. Every paper given to me ended up with a review varying between somewhat and very detailed.

I’m sure not every author was happy with the outcome. While we did our best to make good decisions, they were difficult decisions to make. For example, if there is a well written paper on an interesting topic which analyzes a flawed abstraction of the topic, should it get in? I would rate this a ‘weak accept’.

Here are some observations/thoughts about the process (Several also appear in Claire’s report).

  1. Better feedback isn’t too hard. The real time sink in reviewing a theory paper is reading it. Leaving a few comments, even if just “I don’t like the model analyzed because it misses important feature X.” is relatively easy. My impression of the last COLT was that COLT had entirely switched from minimal author feedback to substantial author feedback. This year’s SODA was somewhere inbetween, depending on the PC member involved, which is a definite trend towards stronger comments for SODA.
  2. Normalization There were very substantial differences amongst the PC members in what fraction of papers they wanted to accept, and this leaked into the final decisions. Normalizing reviewer ratings is standard operating procedure at some machine learning conferences, so I helped with that. Even with that help, further efforts at normalization in the future seem like they could help, for example in getting the decision on the paper above right.
  3. Ordering There were various areas where we tried to order all the reasonable papers and make a decision based on the ordering. Where the papers are sufficiently related, I think this is very helpful, and the act even changed my opinion on some papers a bit by putting them in better context. Not everyone imposed the same ordering, because there are somewhat different tastes: Do you care about the techniques used? (A traditional theory concern) or about the quality of the result? (I’m more focused here.) Nevertheless, it helped reduce the noise. Incidentally, there is substantial theoretical evidence that decisions by ordering are more robust than decisions by absolute score producing an ordering.
  4. Writing quality I was surprised by the poor writing quality of some SODA papers—several were basically not readable without a thorough understanding of referenced papers, and a substantial ability to infer what was meant rather than what was said. Some of these papers were accepted, which would have been impossible in a conference with double-blind reviewing.
  5. PC size The tradition in theory conferences is to have a relatively small program committee. I don’t see much advantage to this for SODA. The program committe is small enough and SODA is broad enough that it seems dubious to claim that every PC member is an expert on the subject of all of their papers. Also, (frankly) the highest quality reviews from my batch of papers weren’t written by me, but rather by reviewers that I picked who had the time to really grind through all the nitty-gritty of the paper. It’s easy to imagine that a larger PC would improve reviewing quality by avoiding overload.

9/12/2008

How do we get weak action dependence for learning with partial observations?

Filed under: Machine Learning, Papers, ProblemsJohn Langford @ 9:53 am

This post is about contextual bandit problems where, repeatedly:

  1. The world chooses features x and rewards for each action r1,…,rk then announces the features x (but not the rewards).
  2. A policy chooses an action a.
  3. The world announces the reward ra

The goal in these situations is to learn a policy which maximizes ra in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction.

One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k0.5, epoch-greedy (and the simpler epsilon greedy) have a dependence on k1/3, and the offset tree has a dependence on k-1. These results aren’t directly comparable because different things are being analyzed. The fact that all analyses have poor dependence on k is troublesome. The lower bounds in the EXP4 paper and the Offset Tree paper demonstrate that this isn’t a matter of lazy proof writing or a poor choice of algorithms: it’s essential to the nature of the problem.

In supervised learning, it’s typical to get no dependence or very weak dependence on the number of actions/choices/labels. For example, if we do empirical risk minimization over a finite hypothesis space H, the dependence is at most ln |H| using an Occam’s Razor bound. Similarly, the PECOC algorithm (page 12) has dependence bounded by a constant. This kind of dependence is great for the feasibility of machine learning: it means that we can hope to tackle seemingly difficult problems.

Why is there such a large contrast between these settings? At the level of this discussion, they differ only in step 3, where for supervised learning, all of the rewards are revealed instead of just one.

One of the intuitions you develop after working with supervised learning is that holistic information is often better. As an example, given a choice between labeling the same point multiple times (perhaps revealing and correcting noise) or labeling other points once, an algorithm with labels other points typically exists and typically yields as good or better performance in theory and in practice. This appears untrue when we have only partial observations.

For example, consider the following problem(*): “Find an action with average reward greater than 0.5 with probability at least 0.99″ and consider two algorithms:

  1. Sample actions at random until we can prove (via Hoeffding bounds) that one of them has large reward.
  2. Pick an action at random, sample it 100 times, and if we can prove (via a Hoeffding bound) that it has large average reward return it, otherwise pick another action randomly and repeat.

When there are 1010 actions and 109 of them have average reward 0.6, it’s easy to prove that algorithm 2 is much better than algorithm 1.

Lower bounds for the partial observation settings imply that more tractable algorithms only exist under additional assumptions. Two papers which do this without context features are:

  1. Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandit problems in metric spaces, STOC 2008. Here the idea is that you have access to a covering oracle on the actions where actions with similar average rewards cover each other.
  2. Deepak Agarwal, , and Deepayan Chakrabati, Multi-armed Bandit Problems with Dependent Arms, ICML 2007. Here the idea is that the values of actions are generated recursively, preserving structure through the recursion.

Basic questions: Are there other kinds of natural structure which allows a good dependence on the total number of actions? Can these kinds of structures be extended to the setting with features? (Which seems essential for real applications.)

(*) Developed in discussion with Yisong Yue and Bobby Kleinberg.

9/4/2008

Fall ML Conferences

If you are in the New York area and interested in machine learning, consider submitting a 2 page abstract to the ML symposium by tomorrow (Sept 5th) midnight. It’s a fun one day affair on October 10 in an awesome location overlooking the world trade center site.

A bit further off (but a real conference) is the AI and Stats deadline on November 5, to be held in Florida April 16-19.

9/3/2008

Bidding Problems

One way that many conferences in machine learning assign reviewers to papers is via bidding, which has steps something like:

  1. Invite people to review
  2. Accept papers
  3. Reviewers look at title and abstract and state the papers they are interested in reviewing.
  4. Some massaging happens, but reviewers often get approximately the papers they bid for.

At the ICML business meeting, Andrew McCallum suggested getting rid of bidding for papers. A couple reasons were given:

  1. Privacy The title and abstract of the entire set of papers is visible to every participating reviewer. Some authors might be uncomfortable about this for submitted papers. I’m not sympathetic to this reason: the point of submitting a paper to review is to publish it, so the value (if any) of not publishing a part of it a little bit earlier seems limited.
  2. Cliques A bidding system is gameable. If you have 3 buddies and you inform each other of your submissions, you can each bid for your friend’s papers and express a disinterest in others. There are reasonable odds that at least two of your friends (out of 3 reviewers) will get your papers, and with 2 adamantly positive reviews, your paper has good odds of acceptance.

The clique issue is real, but it doesn’t seem like a showstopper to me. If a group of friends succeeds at this game for awhile, but their work is not fundamentally that interesting, then there will be no long term success. The net effect is an unfocused displacement of other perhaps-better papers and ideas.

It’s important to recall that there are good aspects of a bidding system. If reviewers are nonstrategic (like I am), they simply pick the papers that seem the most interesting. Having reviewers review the papers that most interest them isn’t terrible—it means they pay close attention and generally write better reviews than a disinterested reviewer might. In many situations, simply finding reviewers who are willing to do an attentive thorough review is challenging.

However, since ICML I’ve come to believe there is a more serious flaw than any of the above: torpedo reviewing. If a research direction is controversial in the sense that just 2-or-3 out of hundreds of reviewers object to it, those 2 or 3 people can bid for the paper, give it terrible reviews, and prevent publication. Repeated indefinitely, this gives the power to kill off new lines of research to the 2 or 3 most close-minded members of a community, potentially substantially retarding progress for the community as a whole.

A basic question is: “Does torpedo reviewing actually happen?” The evidence I have is only anecdotal, but perhaps the answer is “yes”. As an author, I’ve seen several reviews poor enough that a torpedo reviewer is a plausible explanation. In talking to other people, I know that some folks do a lesser form: they intentionally bid for papers that they want to reject on the theory that rejections are less work than possible acceptances. Even without more substantial evidence (it is hard to gather, after all), it’s clear that the potential for torpedo reviewing is real in a bidding system, and if done well by the reviewers, perhaps even undectectable.

The fundamental issue is: “How do you chose who reviews a paper?” We’ve discussed bidding above, but other approaches have their own advantages and drawbacks. The simplest approach I have right now is “choose diversely”: perhaps a reviewer from bidding, a reviewer from assignment by a PC/SPC/area chair, and another reviewer from assignment by a different PC/SPC/area chair.

8/24/2008

Mass Customized Medicine in the Future?

Filed under: Machine LearningJohn Langford @ 7:00 pm

This post is about a technology which could develop in the future.

Right now, a new drug might be tested by finding patients with some diagnosis and giving or not giving them a drug according to a secret randomization. The outcome is observed, and if the average outcome for those treated is measurably better than the average outcome for those not treated, the drug might become a standard treatment.

Generalizing this, a filter F sorts people into two groups: those for treatment A and those not for treatment B based upon observations x. To measure the outcome, you randomize between treatment and nontreatment of group A and measure the relative performance of the treatment.

A problem often arises: in many cases the treated group does not do better than the nontreated group. A basic question is: does this mean the treatment is bad? With respect to the filter F it may mean that, but with respect to another filter F’, the treatment might be very effective. For example, a drug might work great for people which have one blood type, but not so well for others.

Finding F’ is a situation where machine learning can help. The setting is essentially isomorphic to this one. The basic import is that we can learn a rule F’ for filters which are more strict than the original F. This can be done on past recorded data, and if done properly we can even statistically prove that F’ works, without another randomized trial. All of the technology exists to do this now—the rest is a matter of education, inertia, and desire.

Here’s what this future might look like:

  1. Doctors lose a bit of control. Right now, the filters F are typically a diagnosis of one sort or another. If machine learning is applied, the resulting learned F’ may not be easily described as a particular well-known diagnosis. Instead, a doctor might record many observations, and have many learned filters F’ applied to suggest treatments.
  2. The “not understanding the details” problem is sometimes severe, so we can expect a renewed push for understandable machine learning rules. Some tradeoff between understandability and predictive power seems to exist creating a tension: do you want a good treatment or do you want an understandable treatment?
  3. The more information fed into a learning algorithm, the greater it’s performance can be. If we manage to reach a pointer in the future where Gattaca style near instantaneous genomic sequencing is available, feeding this into a learning algorithm is potentially very effective. In general a constant pressure to measure more should be expected. Given that we can learn from past data, going back and measuring additional characteristics of past patients may even be desirable.
  4. Since many treatments are commercial in the US, there will be a great deal of pressure to find a filter F’ which appears good, and a company investing millions into the question is quite capable of overfitting so that F’ is better than it appears. Safe and sane ways to deal with this exist, as showcased by various machine learning challenges, such as the Netflix challenge. To gain trust in such approaches, a trustable and trusted third party capable of this sort of testing must exist. Or, more likely, it won’t exist, and so we’ll need a new trial to test any new F’.
Next Page »