Machine Learning (Theory) – Page 47 – Machine learning and learning theory research

9/12/2008

How do we get weak action dependence for learning with partial observations?

This post is about contextual bandit problems where, repeatedly:

The world chooses features x and rewards for each action r₁,…,r_k then announces the features x (but not the rewards).
A policy chooses an action a.
The world announces the reward r_a

The goal in these situations is to learn a policy which maximizes r_a in expectation efficiently. I’m thinking about all situations which fit the above setting, whether they are drawn IID or adversarially from round to round and whether they involve past logged data or rapidly learning via interaction.

One common drawback of all algorithms for solving this setting, is that they have a poor dependence on the number of actions. For example if k is the number of actions, EXP4 (page 66) has a dependence on k^0.5, epoch-greedy (and the simpler epsilon greedy) have a dependence on k^1/3, and the offset tree has a dependence on k-1. These results aren’t directly comparable because different things are being analyzed. The fact that all analyses have poor dependence on k is troublesome. The lower bounds in the EXP4 paper and the Offset Tree paper demonstrate that this isn’t a matter of lazy proof writing or a poor choice of algorithms: it’s essential to the nature of the problem.

In supervised learning, it’s typical to get no dependence or very weak dependence on the number of actions/choices/labels. For example, if we do empirical risk minimization over a finite hypothesis space H, the dependence is at most ln |H| using an Occam’s Razor bound. Similarly, the PECOC algorithm (page 12) has dependence bounded by a constant. This kind of dependence is great for the feasibility of machine learning: it means that we can hope to tackle seemingly difficult problems.

Why is there such a large contrast between these settings? At the level of this discussion, they differ only in step 3, where for supervised learning, all of the rewards are revealed instead of just one.

One of the intuitions you develop after working with supervised learning is that holistic information is often better. As an example, given a choice between labeling the same point multiple times (perhaps revealing and correcting noise) or labeling other points once, an algorithm with labels other points typically exists and typically yields as good or better performance in theory and in practice. This appears untrue when we have only partial observations.

For example, consider the following problem(*): “Find an action with average reward greater than 0.5 with probability at least 0.99” and consider two algorithms:

Sample actions at random until we can prove (via Hoeffding bounds) that one of them has large reward.
Pick an action at random, sample it 100 times, and if we can prove (via a Hoeffding bound) that it has large average reward return it, otherwise pick another action randomly and repeat.

When there are 10¹⁰ actions and 10⁹ of them have average reward 0.6, it’s easy to prove that algorithm 2 is much better than algorithm 1.

Lower bounds for the partial observation settings imply that more tractable algorithms only exist under additional assumptions. Two papers which do this without context features are:

Robert Kleinberg, Aleksandrs Slivkins, and Eli Upfal. Multi-armed bandit problems in metric spaces, STOC 2008. Here the idea is that you have access to a covering oracle on the actions where actions with similar average rewards cover each other.
Deepak Agarwal, , and Deepayan Chakrabati, Multi-armed Bandit Problems with Dependent Arms, ICML 2007. Here the idea is that the values of actions are generated recursively, preserving structure through the recursion.

Basic questions: Are there other kinds of natural structure which allows a good dependence on the total number of actions? Can these kinds of structures be extended to the setting with features? (Which seems essential for real applications.)

(*) Developed in discussion with Yisong Yue and Bobby Kleinberg.

9/4/2008

Fall ML Conferences

If you are in the New York area and interested in machine learning, consider submitting a 2 page abstract to the ML symposium by tomorrow (Sept 5th) midnight. It’s a fun one day affair on October 10 in an awesome location overlooking the world trade center site.

A bit further off (but a real conference) is the AI and Stats deadline on November 5, to be held in Florida April 16-19.

9/3/200810/1/2008

Bidding Problems

One way that many conferences in machine learning assign reviewers to papers is via bidding, which has steps something like:

Invite people to review
Accept papers
Reviewers look at title and abstract and state the papers they are interested in reviewing.
Some massaging happens, but reviewers often get approximately the papers they bid for.

At the ICML business meeting, Andrew McCallum suggested getting rid of bidding for papers. A couple reasons were given:

Privacy The title and abstract of the entire set of papers is visible to every participating reviewer. Some authors might be uncomfortable about this for submitted papers. I’m not sympathetic to this reason: the point of submitting a paper to review is to publish it, so the value (if any) of not publishing a part of it a little bit earlier seems limited.
Cliques A bidding system is gameable. If you have 3 buddies and you inform each other of your submissions, you can each bid for your friend’s papers and express a disinterest in others. There are reasonable odds that at least two of your friends (out of 3 reviewers) will get your papers, and with 2 adamantly positive reviews, your paper has good odds of acceptance.

The clique issue is real, but it doesn’t seem like a showstopper to me. If a group of friends succeeds at this game for awhile, but their work is not fundamentally that interesting, then there will be no long term success. The net effect is an unfocused displacement of other perhaps-better papers and ideas.

It’s important to recall that there are good aspects of a bidding system. If reviewers are nonstrategic (like I am), they simply pick the papers that seem the most interesting. Having reviewers review the papers that most interest them isn’t terrible—it means they pay close attention and generally write better reviews than a disinterested reviewer might. In many situations, simply finding reviewers who are willing to do an attentive thorough review is challenging.

However, since ICML I’ve come to believe there is a more serious flaw than any of the above: torpedo reviewing. If a research direction is controversial in the sense that just 2-or-3 out of hundreds of reviewers object to it, those 2 or 3 people can bid for the paper, give it terrible reviews, and prevent publication. Repeated indefinitely, this gives the power to kill off new lines of research to the 2 or 3 most close-minded members of a community, potentially substantially retarding progress for the community as a whole.

A basic question is: “Does torpedo reviewing actually happen?” The evidence I have is only anecdotal, but perhaps the answer is “yes”. As an author, I’ve seen several reviews poor enough that a torpedo reviewer is a plausible explanation. In talking to other people, I know that some folks do a lesser form: they intentionally bid for papers that they want to reject on the theory that rejections are less work than possible acceptances. Even without more substantial evidence (it is hard to gather, after all), it’s clear that the potential for torpedo reviewing is real in a bidding system, and if done well by the reviewers, perhaps even undectectable.

The fundamental issue is: “How do you chose who reviews a paper?” We’ve discussed bidding above, but other approaches have their own advantages and drawbacks. The simplest approach I have right now is “choose diversely”: perhaps a reviewer from bidding, a reviewer from assignment by a PC/SPC/area chair, and another reviewer from assignment by a different PC/SPC/area chair.

8/24/2008

Mass Customized Medicine in the Future?

This post is about a technology which could develop in the future.

Right now, a new drug might be tested by finding patients with some diagnosis and giving or not giving them a drug according to a secret randomization. The outcome is observed, and if the average outcome for those treated is measurably better than the average outcome for those not treated, the drug might become a standard treatment.

Generalizing this, a filter F sorts people into two groups: those for treatment A and those not for treatment B based upon observations x. To measure the outcome, you randomize between treatment and nontreatment of group A and measure the relative performance of the treatment.

A problem often arises: in many cases the treated group does not do better than the nontreated group. A basic question is: does this mean the treatment is bad? With respect to the filter F it may mean that, but with respect to another filter F’, the treatment might be very effective. For example, a drug might work great for people which have one blood type, but not so well for others.

Finding F’ is a situation where machine learning can help. The setting is essentially isomorphic to this one. The basic import is that we can learn a rule F’ for filters which are more strict than the original F. This can be done on past recorded data, and if done properly we can even statistically prove that F’ works, without another randomized trial. All of the technology exists to do this now—the rest is a matter of education, inertia, and desire.

Here’s what this future might look like:

Doctors lose a bit of control. Right now, the filters F are typically a diagnosis of one sort or another. If machine learning is applied, the resulting learned F’ may not be easily described as a particular well-known diagnosis. Instead, a doctor might record many observations, and have many learned filters F’ applied to suggest treatments.
The “not understanding the details” problem is sometimes severe, so we can expect a renewed push for understandable machine learning rules. Some tradeoff between understandability and predictive power seems to exist creating a tension: do you want a good treatment or do you want an understandable treatment?
The more information fed into a learning algorithm, the greater it’s performance can be. If we manage to reach a pointer in the future where Gattaca style near instantaneous genomic sequencing is available, feeding this into a learning algorithm is potentially very effective. In general a constant pressure to measure more should be expected. Given that we can learn from past data, going back and measuring additional characteristics of past patients may even be desirable.
Since many treatments are commercial in the US, there will be a great deal of pressure to find a filter F’ which appears good, and a company investing millions into the question is quite capable of overfitting so that F’ is better than it appears. Safe and sane ways to deal with this exist, as showcased by various machine learning challenges, such as the Netflix challenge. To gain trust in such approaches, a trustable and trusted third party capable of this sort of testing must exist. Or, more likely, it won’t exist, and so we’ll need a new trial to test any new F’.

8/18/200810/1/2008

Radford Neal starts a blog

here on statistics, ML, CS, and other things he knows well.