Machine Learning – Page 53 – Machine Learning (Theory)

12/4/2006

Structural Problems in NIPS Decision Making

This is a very difficult post to write, because it is about a perenially touchy subject. Nevertheless, it is an important one which needs to be thought about carefully.

There are a few things which should be understood:

The system is changing and responsive. We-the-authors are we-the-reviewers, we-the-PC, and even we-the-NIPS-board. NIPS has implemented ‘secondary program chairs’, ‘author response’, and ‘double blind reviewing’ in the last few years to help with the decision process, and more changes may happen in the future.
Agreement creates a perception of correctness. When any PC meets and makes a group decision about a paper, there is a strong tendency for the reinforcement inherent in a group decision to create the perception of correctness. For the many people who have been on the NIPS PC it’s reasonable to entertain a healthy skepticism in the face of this reinforcing certainty.
This post is about structural problems. What problems arise because of the structure of the process? The post is not about individual people, because this is unlikely to be fruitful.

Although the subject is nominally about NIPS (which I have experience with as an author, reviewer, and PC member), the points may apply elsewhere.

For those that don’t know, it’s worth reviewing how the NIPS process currently works. Temporally, it looks like the following:

PC chair is appointed.
PC chair picks PC committee to cover many different areas. NIPS is notably diverse.
PC committee members pick reviewers for their areas.
Authors submit blinded papers.
Papers are assigned to two PC committee members, the “primary” and the “secondary”.
Reviewers bid for papers within their areas which they want and don’t want to review.
Reviewers are assigned papers based on bid plus coverage.
Reviewers review papers.
Authors respond to blinded reviews.
Reviewers discuss and rate papers.
PC members digest author/reviewer interaction (and sometimes the paper) into an impression.
PC members meet physically at the PC meeting.
PC members present all papers that they believe are worth considering to other PC members and a decision is made.

Naturally, there are many details left out of this long list.

Here is my attempt to describe the problems I’ve seen:

Attention deficit disorder. The attention paid to individual accept/reject decisions is (and structurally must be) small. There are several effects which drive this:
1. The people on the NIPS PC are typically busy and time constrained.
2. The number of papers assigned to individual PC members is large—perhaps 40 to 80, plus a similar number assigned as a secondary.
3. Many of the people have traveled a very long ways to reach the PC meeting. Jetlag is common, and often significantly effects your ability to think carefully.
4. The meeting itself is 2 days long. The average time spent on any decision must be less than 5 minutes, and everyone knows this. The implicit encouragement to digest a paper down to its most simple description is significant. No one on the PC has seen the paper except for the primary and the secondary (if you are lucky) PC members, so decisions are made quickly based upon relatively little information. (This is better than it sounds in most cases because effectively the decision was made by the primary PC member beforehand.)
Artificial scarcity. NIPS is a single track conference with 3 levels of acceptance “Accept for an oral presentation”, “Accept for a poster with a spotlight”, and “Accept as a poster only”. It’s fairly difficult to justify a paper as “of broad interest”, which is ideal for an oral presentation. Will a neuroscientist really pay attention to this learning theory paper? Is this dimensionality reduction algorithm going to interest someone in learning theory? It’s substantially easier to justify a paper as “possibly of interest to a number of people”, which is about right for poster spotlight. Since the number of spotlights and the number of orals is similar, two effects occur: papers which are about right for spotlights become orals, and many reasonable spotlights aren’t spotlights because they don’t fit.
The Veto Effect. If someone on the PC has a strong dislike for your paper, there is a very good chance for reject. This is true even when attention is explicitly payed by the PC chair to avoiding the veto problem. It’s even true when your paper has the strongest reviews in the area (no joke!). There are several fundamental problems here:
1. People, especially in person, do not generally want to be confrontational. Consequently, if someone who is rarely confrontational speaks strongly against a paper, it’s rare² for an alternate voice to be heard.
2. It is easy to instill “fear, uncertainty, and doubt” in people. Was this paper covering the same material as some other paper no one knows? Are the assumptions criticizable? This problem is greatly exaggerated by attention deficit disorder.

It is easy to complain about these problems and substantially harder to fix them. (There is previous discussion on this.) Here is my best attempt to imagine fixes.

Attention Deficit Disorder. The fundamental problem here is that papers aren’t getting the attention that they deserve by the final decision maker. Several changes might help, but nothing is going to be a silver bullet here.
1. Author responsibility. Unfortunately, some authors abuse the system by submitting papers which should not be submitted. Much of this has to do with inexperience—many authors are first time paper writers. For these authors, some better effort educating people about what is an appropriate paper is good. This year, an effort was made to do this, and followups may be helpful. For a small fraction of papers, authors intentionally skate the edge of what is reasonable. Should an ICML paper with 30% different content be submitted to NIPS? This small fraction takes more time than their fraction indicates and (frankly) isn’t always caught. Some form of “shame list” may be an appropriate way to deal with this, although much caution would have to be exercised.
2. Many of the problems here are unremovable artifacts of a physically present PC meeting. Going to a virtualized process would eliminate these problems (and introduce others). Any such decision would have to be carefully considered, but it is not impossible—there are plenty of succesful conference committees which never meet physically.
3. The PC meeting can be run a bit differently.
  1. Bob Williamson and I managed to go through our secondary assignments and make independent decisions, then reconcile. In contrast, for most papers, the secondary PC member was inoperative at the PC meeting. This made some difference, and it’s easy to imagine that systematically having this reconciliation be a part of the PC meeting is helpful. The reconciliation step does not take very long and is parallelizable.
  2. Not making a decision at the PC meeting could be a real option for a small number of troublesome papers. There is perhaps a week-long timegap between the PC meeting and the release of the decisions during which decisions could be double checked. This option must only be used rarely, and never as a means for excluding interested PC members from the decision.
  3. Information can be more widely shared. I don’t see any real advantage to limiting the knowledge of papers not in your area to “title+authors”. At the PC meeting itself, it would be helpful to have all of the papers available to all of the members.
Artificial Scarcity. My understanding is that the makers of NIPS purposefully preferred a single track conference, and it’s hard to argue with the success NIPS has enjoyed. Nevertheless, it seems notable that the NIPS workshops (which are excessively multitracked) are more succesful than the NIPS conference by some measures. Going to a two-track or partially two-track format would ease some of the decision making.
Even working within the single track format, it’s not clear that the ratio between orals and spotlights is right. Spotlights take about 1/10th the time that an oral presentation takes, and yet only 1/10th or so of the overall time is allocated to spotlight presentations. Losing one oral presentation (out of about 20) would yield a
significant increase in the number of spotlights, and it’s easy to imagine this would be beneficial to attendees while easing decision making.
The Veto Effect. The veto effect is hard to deal with, and it’s only relevant to a small number of decisions. Nevertheless it’s important because some of the best papers are controversial at the time they are published. The are two ways I can imagine for dealing with the veto effect: (1) allowing author feedback (2) devolving power from the PC to the reviewers. Allowing author feedback would have to be coupled with delayed decision making. Eliminating the power of the PC to reject very highly rated papers is also controversial, but may be worth considering.

11/27/2006

Continuizing Solutions

This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful.

Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples.

ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set.
Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially.
Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that simply adopting the new policy can go very wrong.

For each of these problems, there are “continuized” solutions which can yield smaller computation, more elegant mathematics, or both.

ERM By shifting from choosing a single classifier to choosing a stochastic classifier we can prove a new style of bound which is significantly tighter, easier to state, and easier to understand than traditional bounds in the traditional setting. This is the PAC-Bayes bound idea.
Experts By giving the adversary slightly more power—the ability to split experts and have them fractionally predict one way vs. another, the optimal policy becomes much easier to compute (quadratic in the horizon, or maybe less). This is the continuous experts idea.
Policy Iteration For policy iteration, by stochastically mixing the old and the new policy, we can find a new policy better than the old policy. This is the conservative policy iteration idea.

There is some danger to continuizing. The first and second examples both involve a setting shift, which may not be valid—in general your setting should reflect your real problem rather than the thing which is easy to solve. However, even with the setting shift, the solutions appear so compellingly more elegant that it is hard to not hope to use them in a solution to the original setting.

I have not seen a good formulation of the general approach of continuizing. Nevertheless, I expect to see continuizing in more places and to use it in the future. By making it explicit, perhaps this can be made eaesier.

11/22/2006

Explicit Randomization in Learning algorithms

There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others:

Neural Networks. Neural networks use randomization to assign initial weights.
Boltzmann Machines/Deep Belief Networks. Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness.
Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote.
Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies.
Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees.

A basic question is: “Should there be explicit randomization in learning algorithms?” It seems perverse to feed extra random bits into your prediction process since they don’t contain any information about the problem itself. Can we avoid using random numbers? This question is not just philosophy—we might hope that deterministic version of learning algorithms are both more accurate and faster.

There seem to be several distinct uses for randomization.

Symmetry breaking. In the case of a neural network, if every weight started as 0, the gradient of the loss with respect to every weight would be the same, implying that after updating, all weights remain the same. Using random numbers to initialize weights breaks this symmetry. It is easy to believe that there are good deterministic methods for symmetry breaking.
Overfit avoidance. A basic observation is that deterministic learning algorithms tend to overfit. Bagging avoids this by randomizing the input of these learning algorithms in the hope that directions of overfit for individual predictions cancel out. Similarly, using random bits internally as in a deep belief network avoids overfitting by forcing the algorithm to learn a robust-to-noise set of internal weights, which are then robust-to-overfit. Large margin learning algorithms and maximum entropy learning algorithms can be understood as deterministic operations attempting to achieve the same goal. A significant gap remains between randomized and deterministic learning algorithms: the deterministic versions just deal with linear predictions while the randomized techniques seem to yield improvements in general.
Continuizing. In reinforcement learning, it’s hard to optimize a policy over multiple timesteps because the optimal decision at timestep 2 is dependent on the decision at timestep 1 and vice versa. Randomized interpolation of policies offers a method to remove this cyclic dependency. PSDP can be understood as a derandomization of CPI which trades off increased computation (learning a new predictor for each timestep individually). Whether or not we can avoid a tradeoff in general is unclear.
Adversary defeating. Some algorithms, such as randomized weighted majority are designed to work against adversaries who know your algorithm, except for random bits. The randomization here is provably essential, but the setting is often far more adversarial than the real world.

The current state-of-the-art is that random bits provide performance (computational and predictive) which we don’t know (or at least can’t prove we know) how to achieve without randomization. Can randomization be removed or is it essential to good learning algorithms?

11/20/2006

Context and the calculation misperception

This post is really for people not in machine learning (or related fields). It is about a common misperception which affects people who have not thought about the process of trying to predict somethinng. Hopefully, by precisely stating it, we can remove it.

Suppose we have a set of events, each described by a vector of features.

0	1	0	1	1
1	0	1	0	1
1	1	0	1	0
0	0	1	1	1
1	1	0	0	1
1	0	0	0	1
0	1	1	1	0

Suppose we want to predict the value of the first feature given the others. One approach is to bin the data by one feature. For the above example, we might partition the data according to feature 2, then observe that when feature 2 is 0 the label (feature 1) is mostly 1. On the other hand, when feature 2 is 1, the label (feature 1) is mostly 0. Using this simple rule we get an observed error rate of 3/7.

There are two issues here. The first is that this is really a training error rate, and (hence) may be an overoptimistic prediction. This is not a very serious issue as long as there are a reasonable number of representative examples.

The second issue is more serious. A simple rule (number of 1’s less than 3 implies 1, else 0) achieves error rate 0. By binning the data according to only one feature, the potential of achieving error rate 0 is removed.

The reason for binning is often definitional. Many people think of probability as an observed (or observable) rate. For these people, the probabilities of events can only be learned by finding a large number of identical events and then calculating the observed rate. Constructing “identical events” always involves throwing away the unique context of the event. This disposal of information eliminates the possibility of good prediction performance.

The solution to this problem is education. There are other definitions of probability which are more appropriate when every event is unique. One thing which makes people uncomfortable about probabilities over unique events is that probabilities are no longer observable—they are only estimatable. This loss of grounding is a price which must be paid for improved performance. Luckily, we can tell if our prediction performance improves on labeled examples.

11/6/2006

Data Linkage Problems

Data linkage is a problem which seems to come up in various applied machine learning problems. I have heard it mentioned in various data mining contexts, but it seems relatively less studied for systemic reasons.

A very simple version of the data linkage problem is a cross hospital patient record merge. Suppose a patient (John Doe) is admitted to a hospital (General Health), treated, and released. Later, John Doe is admitted to a second hospital (Health General), treated, and released. Given a large number of records of this sort, it becomes very tempting to try and predict the outcomes of treatments. This is reasonably straightforward as a machine learning problem if there is a shared unique identifier for John Doe used by General Health and Health General along with time stamps. We can merge the records and create examples of the form “Given symptoms and treatment, did the patient come back to a hospital within the next year?” These examples could be fed into a learning algorithm, and we could attempt to predict whether a return occurs.

The problem is that General Health and Health General don’t have any shared unique identifier for John Doe. Information is often mispelled (name misspellings are very common), mistyped, changed (people move), and simply not unique (how many people were born on your birthday?).

Although this is just one example, data linkage problems seem to be endemic to learning applications. There seem to be several solutions:

Improved recording. Sometimes minor changes to what information is recorded can strongly disambiguate. For example, there is a big difference between recording the pages visited at a website versus tracking the sequence of pages visited. The essential thing to think about when designing the information to record is: How will I track the consequences of decisions?
Two-stage learning. First predict which records should be linked, based upon a smaller dataset that is hand checked. Then, use your learned predictor to do the linkage, and then solve your real prediction problem. There are several pitfalls here.
1. Rarity problems. Links are typically much more rare than nonlinks. The training process needs to take this into account by properly representing the scarcity of nonlinks.
2. Information interfaces. A prediction of “link” or “no link” is too scarce an information source in an inherently noisy environment. Instead, a probability of link may need to be estimated.
3. Two stage estimation. A common approach to improving performance is turning a double approximation (given x predict y, given y predict z) into a single approximation (given x predict z). A method for achieving single approximation here is tricky because we have ancillary information about the intermediate prediction.
Customized algorithms. The Bayesian approach of “specify a prior, then use Bayes law to get a posterior, then predict with the posterior” is attractive here because we often have strong prior beliefs about at least the linkage portion of the problem.
Others?

The data linkage problem also makes very clear the tension between privacy and machine learning. For example, being able to cross index hospital cases might yield a large jump in our ability to predict outcomes, which might suggest improved treatments (it is only a weak suggestion that must be verified—we must be very careful about applying a predictor to an input distribution it did not learn with respect to). And yet, linking records can result in unexpectedly large pools of information on individuals. Furthermore explicitly sensitive information (like credit card numbers) might easily be the most useful bit of information for linkage.