Machine Learning – Page 38 – Machine Learning (Theory)

9/3/200810/1/2008

Bidding Problems

One way that many conferences in machine learning assign reviewers to papers is via bidding, which has steps something like:

Invite people to review
Accept papers
Reviewers look at title and abstract and state the papers they are interested in reviewing.
Some massaging happens, but reviewers often get approximately the papers they bid for.

At the ICML business meeting, Andrew McCallum suggested getting rid of bidding for papers. A couple reasons were given:

Privacy The title and abstract of the entire set of papers is visible to every participating reviewer. Some authors might be uncomfortable about this for submitted papers. I’m not sympathetic to this reason: the point of submitting a paper to review is to publish it, so the value (if any) of not publishing a part of it a little bit earlier seems limited.
Cliques A bidding system is gameable. If you have 3 buddies and you inform each other of your submissions, you can each bid for your friend’s papers and express a disinterest in others. There are reasonable odds that at least two of your friends (out of 3 reviewers) will get your papers, and with 2 adamantly positive reviews, your paper has good odds of acceptance.

The clique issue is real, but it doesn’t seem like a showstopper to me. If a group of friends succeeds at this game for awhile, but their work is not fundamentally that interesting, then there will be no long term success. The net effect is an unfocused displacement of other perhaps-better papers and ideas.

It’s important to recall that there are good aspects of a bidding system. If reviewers are nonstrategic (like I am), they simply pick the papers that seem the most interesting. Having reviewers review the papers that most interest them isn’t terrible—it means they pay close attention and generally write better reviews than a disinterested reviewer might. In many situations, simply finding reviewers who are willing to do an attentive thorough review is challenging.

However, since ICML I’ve come to believe there is a more serious flaw than any of the above: torpedo reviewing. If a research direction is controversial in the sense that just 2-or-3 out of hundreds of reviewers object to it, those 2 or 3 people can bid for the paper, give it terrible reviews, and prevent publication. Repeated indefinitely, this gives the power to kill off new lines of research to the 2 or 3 most close-minded members of a community, potentially substantially retarding progress for the community as a whole.

A basic question is: “Does torpedo reviewing actually happen?” The evidence I have is only anecdotal, but perhaps the answer is “yes”. As an author, I’ve seen several reviews poor enough that a torpedo reviewer is a plausible explanation. In talking to other people, I know that some folks do a lesser form: they intentionally bid for papers that they want to reject on the theory that rejections are less work than possible acceptances. Even without more substantial evidence (it is hard to gather, after all), it’s clear that the potential for torpedo reviewing is real in a bidding system, and if done well by the reviewers, perhaps even undectectable.

The fundamental issue is: “How do you chose who reviews a paper?” We’ve discussed bidding above, but other approaches have their own advantages and drawbacks. The simplest approach I have right now is “choose diversely”: perhaps a reviewer from bidding, a reviewer from assignment by a PC/SPC/area chair, and another reviewer from assignment by a different PC/SPC/area chair.

8/24/2008

Mass Customized Medicine in the Future?

This post is about a technology which could develop in the future.

Right now, a new drug might be tested by finding patients with some diagnosis and giving or not giving them a drug according to a secret randomization. The outcome is observed, and if the average outcome for those treated is measurably better than the average outcome for those not treated, the drug might become a standard treatment.

Generalizing this, a filter F sorts people into two groups: those for treatment A and those not for treatment B based upon observations x. To measure the outcome, you randomize between treatment and nontreatment of group A and measure the relative performance of the treatment.

A problem often arises: in many cases the treated group does not do better than the nontreated group. A basic question is: does this mean the treatment is bad? With respect to the filter F it may mean that, but with respect to another filter F’, the treatment might be very effective. For example, a drug might work great for people which have one blood type, but not so well for others.

Finding F’ is a situation where machine learning can help. The setting is essentially isomorphic to this one. The basic import is that we can learn a rule F’ for filters which are more strict than the original F. This can be done on past recorded data, and if done properly we can even statistically prove that F’ works, without another randomized trial. All of the technology exists to do this now—the rest is a matter of education, inertia, and desire.

Here’s what this future might look like:

Doctors lose a bit of control. Right now, the filters F are typically a diagnosis of one sort or another. If machine learning is applied, the resulting learned F’ may not be easily described as a particular well-known diagnosis. Instead, a doctor might record many observations, and have many learned filters F’ applied to suggest treatments.
The “not understanding the details” problem is sometimes severe, so we can expect a renewed push for understandable machine learning rules. Some tradeoff between understandability and predictive power seems to exist creating a tension: do you want a good treatment or do you want an understandable treatment?
The more information fed into a learning algorithm, the greater it’s performance can be. If we manage to reach a pointer in the future where Gattaca style near instantaneous genomic sequencing is available, feeding this into a learning algorithm is potentially very effective. In general a constant pressure to measure more should be expected. Given that we can learn from past data, going back and measuring additional characteristics of past patients may even be desirable.
Since many treatments are commercial in the US, there will be a great deal of pressure to find a filter F’ which appears good, and a company investing millions into the question is quite capable of overfitting so that F’ is better than it appears. Safe and sane ways to deal with this exist, as showcased by various machine learning challenges, such as the Netflix challenge. To gain trust in such approaches, a trustable and trusted third party capable of this sort of testing must exist. Or, more likely, it won’t exist, and so we’ll need a new trial to test any new F’.

8/18/200810/1/2008

Radford Neal starts a blog

here on statistics, ML, CS, and other things he knows well.

7/26/20087/26/2008

Compositional Machine Learning Algorithm Design

There were two papers at ICML presenting learning algorithms for a contextual bandit-style setting, where the loss for all labels is not known, but the loss for one label is known. (The first might require a exploration scavenging viewpoint to understand if the experimental assignment was nonrandom.) I strongly approve of these papers and further work in this setting and its variants, because I expect it to become more important than supervised learning. As a quick review, we are thinking about situations where repeatedly:

The world reveals feature values (aka context information).
A policy chooses an action.
The world provides a reward.

Sometimes this is done in an online fashion where the policy can change based on immediate feedback and sometimes it’s done in a batch setting where many samples are collected before the policy can change. If you haven’t spent time thinking about the setting, you might want to because there are many natural applications.

I’m going to pick on the Banditron paper (second one), which attacks the special case of the contextual bandit setting where exactly one of the rewards is 1 and all other actions result in reward 0, and show that (a) similar performance is achievable via a simple combination of existing modular technologies and (b) superior performance is achievable by optimizing some existing modular technologies for the realizable case. This algorithm is the hardest of the two to compete with, because it explicitly deals with the explore/exploit tradeoff. Note that I’m definitely not trying to minimize the paper—there is analysis in that paper which remains interesting to me and isn’t covered by what follows. I am happy that it was published. The point of this post is showing that a modular approach to building learning algorithms is a strong contender when we encounter new learning problems.

Given the problem statement, my approach to solving the problem would be to compose some modular technologies I know.

Perceptron learning algorithm We chose a Binary Perceptron as a classification algorithm. This choice is intrinsically motivated by the great computational performance of a Perceptron. It’s also the closest binary supervised learning algorithm to the Banditron, which eliminates a source of variation in comparison. We could have easily chosen a different base learning algorithm, and in many applications this is highly desirable.
Offset Tree Reduction The offset tree is a newer machine learning reduction from the contextual bandit setting to the standard supervised learning setting. It more robustly transforms a supervised learner’s performance into good policy performance than any other reduction. The offset tree also has good computational properties, since it produces at most log₂ k binary examples per train or test event, where k is the number of actions. In some sense it’s unfair to include the offset tree because it hasn’t yet been formally published. In another sense, that’s what this post is about.
Epoch-Greedy exploration. The Epoch-greedy approach shows how to handle the explore/exploit tradeoff for learning in a contextual bandit setting as a function of a sample complexity bound. For common sample complexity bounds, we get an O(T^2/3) online regret where T is the total number of timesteps.
Occam’s Razor Bound The Occam’s Razor bound limits the regret of an empirical error minimizing perceptron as function of the number of examples. The bound (and it’s many cousins) are often loose, so the only thing we’ll really use is the denominator which says that regret scales as 1/sqrt(number of training examples) in the worst case. Applying Epoch-Greedy to the Occam’s Razor bound gives you an exploration probability of about C/t^1/3 where t is the round number.

Each component above has been analyzed in isolation and is at least a reasonable approach (some are the best possible). Each of these components is also composable. Fitting these pieces together, we get an online learning algorithm (agnostic offset-tree perceptron) that chooses to explore uniformly at random amongst the actions with probability about 1/t^1/3. How well does it perform? On the 4 class reuters based dataset used in the Banditron paper, we get the following accumulated average error rates with some code.:

The right plot is from the Banditron paper. The Perceptron line in both plots is for an algorithm which learns knowing the full loss function of each example, so it represents an ideal we don’t expect to achieve here. There are three results in the left plot:

The blue line is a version of the component set where the Occam’s Razor bound and the Offset Tree reduction have been optimized for the realizable case. This was the first thing we tested (and it’s the result I mentioned at Shai‘s ICML talk). It turns out this approach works substantially better than the Banditron, achieving an error rate about halfway between the Banditron and the Multiclass Perceptron. The two components that we tweaked are:
1. Realizable case Bound It’s well known that in the realizable case the regret of a chosen classifier should scale as 1/t rather than 1/t^0.5. Plugging this into epoch-greedy, we get that the probability of exploration should be about 1/t^0.5.
2. Realizable case Offset Tree A basic observation is that in the realizable setting, every observation should create an example to tune the learning algorithm. In the context of the perceptron, this implies every error creates an example. The offset tree reduction can be altered to take this into account by eliminating the importance weight from all updates, and updating even for exploitation examples which are not drawn uniform randomly.
The red line is what you get with exactly the component set stated above. We were curious about the degree to which a general purpose algorithm can perform well on this application as the realizable case algorithm is definitely broken when the problem is inherently noisy. The performance is somewhat worse than Banditron. I believe this is because it explores only about 1% of the time while the Banditron plot comes from exploring about 5% of the time.
The green line is from a component set where epoch greedy and the offset tree have been tweaked to keep track of and use the distribution over actions at every timestep. This tweaks allows the amount of exploration as measured by the sum of importance weights of training examples to almost double. As we see, this approach improves performance, almost as if we doubled the number of examples, giving it similar performance to the Banditron. The tweaks used for the component set are:
1. Stochastic Epoch-Greedy Instead of deterministically exploring every 1/(bound_gap) times, choose to explore with probability 1/(bound_gap), and pass this probability to the offset tree reduction.
2. Nonuniform Offset Tree Tweak the Offset Tree in the obvious way to take into account nonuniform exploration. In particular, 1/2 is replaced with K/p(a) where p(a) is the probability of the action taken conditioned on one of two actions being taken. We set K so that this value is 1 when a nonexploit action is taken, which implies the importance weight is p(a)/(2-p(a)) when the exploit action is taken.
3. Importance Weighted Perceptron We dealt with importance weights generated by the offset tree reduction by scaling any update by the importance weight.

People may be dissatisfied with the component assembly approach to learning algorithm design, because all of the pieces are not analyzed together. For example, the Banditron paper essentially proves that certain conditions are sufficient for the Banditron algorithm to perform well. This is a more complete guarantee than “all the pieces we stuck together are known to work well when analyzed in isolation”, but this style of guarantee has limitations which are both obvious and often overlooked. These guarantees do not show necessity of these conditions, and hence characterize only a subset of settings where the Banditron works. Unless you are lucky enough to know that your application satisfies the sufficient conditions, you’ll basically have to try the Banditron and see if it works for any particular application. Another drawback is that the complexity of analyzing all the different pieces simultaneously means that it’s difficult to use the best elements together. This last point is what leaves me dissatisfied with the complete analysis approach—it produces algorithms which are simple to analyze rather than optimized for performance.

If you are thinking “I have a better algorithm for solving one problem”, then you’ve missed the point of this post. The point of this post is that compositional design is good for solving many learning problems. This post is about one example of that approach in action. We start by assembling a set of components which we know work well from individual component analysis. Then, we try to optimize the performance of the assembly by swapping components or improving individual components to address known properties of the problem or observed deficiencies. In this particular case, we end up with a better performing algorithm and better components (such as stochastic epoch greedy) which are directly reusable in other settings. The essence of this approach is understanding that there is a real vocabulary of interchangeable components and actively using it in designing learning algorithms.

I would like to thank Alina who helped substantially with this post. I would also like to thank Shai for providing data and helping setup a clean comparison and Sham for helpful discussion.

7/15/20087/20/2008

Interesting papers at COLT (and a bit of UAI & workshops)

Here are a few papers from COLT 2008 that I found interesting.

Maria-Florina Balcan, Steve Hanneke, and Jenn Wortman, The True Sample Complexity of Active Learning. This paper shows that in an asymptotic setting, active learning is always better than supervised learning (although the gap may be small). This is evidence that the only thing in the way of universal active learning is us knowing how to do it properly.
Nir Ailon and Mehryar Mohri, An Efficient Reduction of Ranking to Classification. This paper shows how to robustly rank n objects with n log(n) classifications using a quicksort based algorithm. The result is applicable to many ranking loss functions and has implications for others.
Michael Kearns and Jennifer Wortman. Learning from Collective Behavior. This is about learning in a new model, where the goal is to predict how a collection of interacting agents behave. One claim is that learning in this setting can be reduced to IID learning.

Due to the relation with Metric-E³, I was particularly interested in a couple other papers on reinforcement learning in navigation-like spaces.
I also particularly enjoyed Dan Klein‘s talk, which was the most impressive application of graphical model technology I’ve seen.

I also attended the large scale learning challenge workshop and enjoyed Antoine Bordes talk about a fast primal space algorithm that won by a hair over other methods in the wild track. Ronan Collobert‘s talk was also notable in that they are doing relatively featuritis-free NLP.