Machine Learning (Theory)


The Machine Learning Award goes to …

Tags: Machine Learning,Research jl@ 10:33 am

Perhaps the biggest CS prize for research is the Turing Award, which has a $0.25M cash prize associated with it. It appears none of the prizes so far have been for anything like machine learning (the closest are perhaps database awards).

In CS theory, there is the Gödel Prize which is smaller and newer, offering a $5K prize along and perhaps (more importantly) recognition. One such award has been given for Machine Learning, to Robert Schapire and Yoav Freund for Adaboost.

In Machine Learning, there seems to be no equivalent of these sorts of prizes. There are several plausible reasons for this:

There is no coherent community. People drift in and out of the central conferences all the time. Most of the author names from 10 years ago do not occur in the conferences of today. In addition, the entire subject area is fairly new. There are at least a core group of people who have stayed around.
Machine Learning work doesn’t last Almost every paper is forgotten, because {the goals change, there isn’t any real progress, there are no teachable foundations}. It’s true of all fields of research.
There is no consensus (or clarity) on what’s important. The field is fractured between many very different viewpoints—statistical, empirical, AI, and theoretical. The prioritization of results across these very different viewpoints is hard. We can hope that people who have been around for awhile can learn to appreciate these other viewpoints. The Turing Award is over a much more diverse set of interests.

I think there are some sound reasons why we-as-a-field-of-research should want a field-wide award.

  1. Aspiration Perhaps the most valuable aspect of an award is that it gives people an incentive to aim for something in the long term. The closest approximation that we have right now is “best papers” awards at individual conferences. Best paper awards have a role, but it’s not the same. 10 years from now, when we look back 10 years, which papers will seem most significant? Across many different conferences?
  2. Representation One function of an award is that tells other people what we consider good work. In an academic reference frame, it gives information of the form “this person deserves tenure”. From a less cynical point of view, having a short list of “what’s important” that outsiders can peruse might be quite helpful as a starting point in understanding what’s going on.
  3. Crystallization Research is a process of discovering information and placing it carefully in context. An award has some role in furthering that process.

The worst part of any award is administering it. How do you avoid wasting time and playing favorites while keeping the higher level vision of what might be useful in the long term? I don’t have any great insight, except that it would need to be done.


The Privacy Problem

Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems.

It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious:

  1. Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data.
  2. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings that might be in the data and check to see if they are in the data.
  3. … then encrypt all the search terms and throw away the key! This prevents a dictionary analysis, but it is still entirely possible to do a frequency analysis. If 10 terms appear with known relative frequencies in public data, then finding 10 terms encrypted terms with the same relative frequencies might give you very good evidence for what these terms are.

All of these strategies turn out to be broken. For those not familiar with Machine Learning, other obvious strategies turn out to not work that well.

  1. Just don’t collect the data. We are not too far off from a world where setting the “please don’t collect my information” flag in your browser implies that you volunteer to have your searches return less relevant results, to not find friends, to not filter spam, etc… If everyone effectively has that flag set by legislation the effect would be very substantial. Many internet companies run off of advertising so eliminating the ability to do targeted advertising will eliminate the ability of these companies to exist.
  2. …Then just keep aggregations of the data! Aggregating data is very bad for machine learning in general. When we are figuring out how to do machine learning it’s even worse because we don’t know in advance which aggregations would be most useful.
  3. …Then keep just enough data around and throw out everything else! Unfortunately, there is no such thing as “enough data”. More data is always better.

This is a particularly relevant topic right now, because it’s news and because CMU and NSF are organizing a workshop on the topic next month, which I’m planning to attend. However, this is not simply an interest burst—the long term trend of increasing data collection implies this problem will repeatedly come up over the indefinite future.

The privacy problem breaks into at least two parts.

  1. Cultural Norms. Historically, almost no monetary transactions were recorded and there was a reasonable expectation that people would forget a visitor. This is rapidly changing with the rise of credit cards and cameras. This change in what can be expected is profoundly uncomfortable.
  2. Power Balance. Data is power. The ability to collect and analyze large quantities of data which many large organizations now have or are constructing increases their power relative to ordinary people. This power can be used for good (to improve services) or for bad (to maximize monopoly status or for spying).

The cultural norm privacy problem is sometimes solvable by creating an opt-in or opt-out protocol. This is particularly helpful on the internet because a user could simply request “please don’t record my search” or “please don’t record which news articles I read”. Needing to do this for every search or every news article would be annoying. However, this is easily fixed by having a system wide setting—perhaps a special browser cookie which says “please don’t record me” that any site could check. None of this is helpful for cameras (where no interface exists) or monetary transactions (where the transaction itself determines whether or not some item is shipped).

The power balance privacy problem is much more difficult. Some solutions that people attempt are:

  1. Accept the change in power balance. This is the default action. There are plenty of historical examples where large organizations have abused their power, so providing them more power to abuse may be unwise.
  2. Legislate a halt. Forbid cameras in public places. Forbid the collection or retention of data by any organization. The problem with this method is that technology simply isn’t moving in this direction. At some point, we may end up with cameras and storage devices so small, cheap, and portable that forbidding their use is essentially absurd. The other difficulty with this solution is that it keeps good things from happening. For example, a reasonable argument can be made that the British were effective at tracking bomb planters because the cameras of London helped them source attacks.
  3. Legislate an acceleration. Instead of halting the collection of data, open it up to more general use. One example of this is cameras in police cars in the US. Recordings from these cameras can often settle disputes very definitively. As technology improves, it’s reasonable to expect cameras just about anywhere people are in public. Some legislation and good engineering could make these cameras available to anyone. This would involve a substantial shift in cultural norms—essentially people would always be in potential public view when not at home. This directly collides with the “privacy as a cultural norm” privacy problem.

The hardness of the privacy problem mentioned at the post beginning implies difficult tradeoffs.

  1. If you have cultural norm privacy concerns, then you really don’t appreciate method (3) for power balance privacy concerns.
  2. If you value privacy greatly and the default action is taken, then you prefer monopolistic marketplaces. The advantages of a large amount of private data are prohibitive to new market entrance.
  3. If you want the internet to work better, then there are limits on how little data can be collected.

All of the above is even murkier because what can be done with data is not fully known, nor is what can be done in a privacy sensitive way.


Asking questions

Tags: Research jl@ 6:02 am

There are very substantial differences in how question asking is viewed culturally. For example, all of the following are common:

  1. If no one asks a question, then no one is paying attention.
  2. To ask a question is disrespectful of the speaker.
  3. Asking a question is admitting your own ignorance.

The first view seems to be the right one for research, for several reasons.

  1. Research is quite hard—it’s difficult to guess how people won’t understand something in advance while preparing a presentation. Consequently, it’s very common to lose people. No worthwhile presenter wants that.
  2. Real understanding is precious. By asking a question, you are really declaring “I want to understand”, and everyone should respect that.
  3. Asking a question wakes you up. I don’t mean from “asleep” to “awake” but from “awake” to “really awake”. It’s easy to drift through something sort-of-understanding. When you ask a question, especially because you are on the spot, you will do much better.

Some of these effects might seem minor, but they accumulate over time, and their accumulation can have a big effect in terms of knowledge and understanding by the questioner as well as how well ideas are communicated. A final piece of evidence comes from checking cultural backgrounds. People with a cultural background that accepts question asking simply tend to do better in research. If this isn’t you, it’s ok. By being conscious of the need to ask questions and working up the courage to do it, you can do fine.

A reasonable default is that the time to not ask a question is when the speaker (or the environment) explicitly say “let me make progress in the talk”.


The View From China

Tags: Machine Learning,Research jl@ 7:44 am

I’m visiting Beijing for the Pao-Lu Hsu Statistics Conference on Machine Learning.

I had several discussions about the state of Chinese research. Given the large population and economy, you might expect substantial research—more than has been observed at international conferences. The fundamental problem seems to be the Cultural Revolution which lobotimized higher education, and the research associated with it. There has been a process of slow recovery since then, which has begun to be felt in the research world via increased participation in international conferences and (now) conferences in China.

The amount of effort going into construction in Beijing is very impressive—people are literally building a skyscraper at night outside the window of the hotel I’m staying at (and this is not unusual). If a small fraction of this effort is later focused onto supporting research, the effect could be very substantial. General growth in China’s research portfolio should be expected.


Presentation Preparation

Tags: Research jl@ 11:02 pm

A big part of doing research is presenting it at a conference. Since many people start out shy of public presentations, this can be a substantial challenge. Here are a few notes which might be helpful when thinking about preparing a presentation on research.

  1. Motivate. Talks which don’t start by describing the problem to solve cause many people to zone out.
  2. Prioritize. It is typical that you have more things to say than time to say them, and many presenters fall into the failure mode of trying to say too much. This is an easy-to-understand failure mode as it’s very natural to want to include everything. A basic fact is: you can’t. Example of this are:
    1. Your slides are so densely full of equations and words that you can’t cover them.
    2. Your talk runs over and a moderator prioritizes for you by cutting you off.
    3. You motor-mouth through the presentation, and the information absorption rate of the audience prioritizes in some uncontrolled fashion.
    4. The rate of flow of concepts simply exceeds the information capacity of the audience. Even with nondense slides and an easy succinct delivery, this can often happen.

    One way to prioritize is figure out: “What would I present in 1 minute?” or “What would I present in 5 minutes?”, and then let this guide your overall presentation.

  3. Unassume. When you are working in an area, it’s typical to buildup an internal shorthand for concepts. This needs to be peeled away when preparing a presentation. Decide what the minimal set of concepts are, and then be sure to define them as they are introduced. For people familiar with the basic concepts, this gives them a way to reconcile choices of language, and others at least have a prayer of following.
  4. Practice Well. Some people try to get a talk right by practicing it relentlessly until it is memorized, and then deliver it as a memorized monologue. This is terrible, because people in the audience know it is a memorized monologue and zone out. A good talk is delivered like a conversation, where it happens to be your turn to speak for awhile, and practicing that is more difficult. Some practice by yourself can be helpful—but not too much. A much better method is to practice on your friends by delivering to them before delivering it to the wider world.

The points above avoid the common failure modes which seem to come up with first-time presenters. There is much more advice to give (and for me to learn) about giving better presentations.


The Coming Patent Apocalypse

Tags: Funding,Research jl@ 5:46 pm

Many people in computer science believe that patents are problematic. The truth is even worse—the patent system in the US is fundamentally broken in ways that will require much more significant reform than is being considered now.

The myth of the patent is the following: Patents are a mechanism for inventors to be compensated according to the value of their inventions while making the invention available to all. This myth sounds pretty desirable, but the reality is a strange distortion slowly leading towards collapse.

There are many problems associated with patents, but I would like to focus on just two of them:

  1. Patent Trolls The way that patents have generally worked over the last several decades is that they were a tool of large companies. Large companies would amass a large number of patents and then cross-license each other’s patents—in effect saying “we agree to owe each other nothing”. Smaller companies would sometimes lose in this game, essentially because they didn’t have enough patents to convince the larger companies that cross-licensing was a good idea. However, they didn’t necessarily lose, because small companies are also doing fewer things which makes their patent violation profile smaller.

    The patent trolls arrived recently. They are a new breed of company which does nothing but produce patents and collect money from them. The thing which distinguishes patent troll companies is that they have no patent violation profile. A company with a large number of patents can not credibly threaten to sue them unless they cross-license their patents, because they don’t do anything which violates a patent.

    The best example (and proof that this method works) is NTP, which extracted $612.5M from RIM. Although this is the big case with lots of publicity, the process of extracting money goes on constantly all around your in backroom negotiations with the companies that actually do things. In effect, patent trolls impose an invisible tax on companies that do things by companies that don’t. Restated in another way, patent trolls are akin to exploiting tax loopholes—except they exploit the law to make money rather than simply to avoid losing it. Smaller companies are particularly prone to lose, because they simply can not afford the extreme legal fees associated with fighting even a winning battle, but even large companies are also vulnerable to a patent troll.

    The other side of this argument is that patent trolls are simply performing a useful business function: employing researchers to come up with ideas or (at least) putting a floor on the value of ideas which they buy up through patents. Unfortunately, this is simply not true in my experience, due to the next problem.

  2. Combinatorial Ideas Patents are too easy. In fact, the process of coming up with patentable ideas is simply a matter of combinatorial application of existing ideas. This is a simple game that any reasonably intelligent person can play: you take idea 1 and idea 2, glue them together in any reasonable way, and get a patent.

    There are several reasons why the combinatorial application of existing ideas has become standard for patents.

    1. One of these is regulatory capture. It should surprise no one that the patent office, which gets paid for every patent application, has found a way to increase the number of patent applications.
    2. Another reason has to do with the way that patent law developed. Initially, patents were for processes doing things, rather than ideas. The scope of patents has steadily extended over time but the basic idea of patenting a process has been preserved. The fundamental problem is that processes can be created by the combinatorial application of ideas.

    The ease of patents is fundamentally valuable to patent troll type companies because they can acquire a large number of patents on processes which other companies accidentally violate.

The patent apocalypse happens when we project forward these two trends. Patents become ever easier to acquire and patent troll companies become ever more prevalent. In the end, every company which does something uses some obvious process that violates someone’s patent, and they have to pay at rates the patent owner chooses. There is no inherent bound on the number of patent troll type companies which can exist—they can multiply unchecked and drain money from every other company which does things until the system collapses.

I would like to make some positive suggestions here about how to reform the patent system, but it’s a hard mechanism design problem. Some obvious points are:

  1. Patents should have higher standards than research papers rather than substantially lower standards.
  2. The patent office should not make money from patents (this is not equivalent to saying that the patent applications should not be charged).
  3. The decision in whether or not to grant a patent should weigh the cost of granting. Many private property afficionados think “patents are great, because they compensate inventors”, but there is a real cost to society in granting obvious patents. You block people from doing things in the straightforward way or create hidden liabilities.
  4. Patents should be far more readable if the “available for all” part of the myth is to be upheld. Published academic papers are substantially more readable (which isn’t all that high of a bar).

Patent troll companies have found a clever way to exhibit the flaws in the current patent system. Substantial patent reform to eliminate this style of company would benefit just about everyone, except for these companies.


Davor has been working to setup which is the new site for the many lectures mentioned here. (Tragically, they seem to only be available in windows media format.) I went through my own projects and added a few links to the videos. The day when every result is a set of {paper, slides, video} isn’t quite here yet, but it’s within sight. (For many papers, of course, code is a 4th component.)


The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year 2006 2005 2002 1997 1987 1967
count 4 10 5 1 0 0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

  1. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
  2. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
  3. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
  4. Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
  5. The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

  1. For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
  2. For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?


Best Practices for Collaboration

Tags: Papers,Research jl@ 1:51 pm

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

  1. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
  2. Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
    1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
    2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
    3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
    4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3″ rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
  3. Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
    1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
    2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
    3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
  4. Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.


Recruitment Conferences

One of the subsidiary roles of conferences is recruitment. NIPS is optimally placed in time for this because it falls right before the major recruitment season.

I personally found job hunting embarrassing, and was relatively inept at it. I expect this is true of many people, because it is not something done often.

The basic rule is: make the plausible hirers aware of your interest. Any corporate sponsor is a “plausible”, regardless of whether or not there is a booth. CRA and the acm job center are other reasonable sources.

There are substantial differences between the different possibilities. Putting some effort into understanding the distinctions is a good idea, although you should always remember where the other person is coming from.


Structural Problems in NIPS Decision Making

This is a very difficult post to write, because it is about a perenially touchy subject. Nevertheless, it is an important one which needs to be thought about carefully.

There are a few things which should be understood:

  1. The system is changing and responsive. We-the-authors are we-the-reviewers, we-the-PC, and even we-the-NIPS-board. NIPS has implemented ‘secondary program chairs’, ‘author response’, and ‘double blind reviewing’ in the last few years to help with the decision process, and more changes may happen in the future.
  2. Agreement creates a perception of correctness. When any PC meets and makes a group decision about a paper, there is a strong tendency for the reinforcement inherent in a group decision to create the perception of correctness. For the many people who have been on the NIPS PC it’s reasonable to entertain a healthy skepticism in the face of this reinforcing certainty.
  3. This post is about structural problems. What problems arise because of the structure of the process? The post is not about individual people, because this is unlikely to be fruitful.

Although the subject is nominally about NIPS (which I have experience with as an author, reviewer, and PC member), the points may apply elsewhere.

For those that don’t know, it’s worth reviewing how the NIPS process currently works. Temporally, it looks like the following:

  1. PC chair is appointed.
  2. PC chair picks PC committee to cover many different areas. NIPS is notably diverse.
  3. PC committee members pick reviewers for their areas.
  4. Authors submit blinded papers.
  5. Papers are assigned to two PC committee members, the “primary” and the “secondary”.
  6. Reviewers bid for papers within their areas which they want and don’t want to review.
  7. Reviewers are assigned papers based on bid plus coverage.
  8. Reviewers review papers.
  9. Authors respond to blinded reviews.
  10. Reviewers discuss and rate papers.
  11. PC members digest author/reviewer interaction (and sometimes the paper) into an impression.
  12. PC members meet physically at the PC meeting.
  13. PC members present all papers that they believe are worth considering to other PC members and a decision is made.

Naturally, there are many details left out of this long list.

Here is my attempt to describe the problems I’ve seen:

  1. Attention deficit disorder. The attention paid to individual accept/reject decisions is (and structurally must be) small. There are several effects which drive this:
    1. The people on the NIPS PC are typically busy and time constrained.
    2. The number of papers assigned to individual PC members is large—perhaps 40 to 80, plus a similar number assigned as a secondary.
    3. Many of the people have traveled a very long ways to reach the PC meeting. Jetlag is common, and often significantly effects your ability to think carefully.
    4. The meeting itself is 2 days long. The average time spent on any decision must be less than 5 minutes, and everyone knows this. The implicit encouragement to digest a paper down to its most simple description is significant. No one on the PC has seen the paper except for the primary and the secondary (if you are lucky) PC members, so decisions are made quickly based upon relatively little information. (This is better than it sounds in most cases because effectively the decision was made by the primary PC member beforehand.)
  2. Artificial scarcity. NIPS is a single track conference with 3 levels of acceptance “Accept for an oral presentation”, “Accept for a poster with a spotlight”, and “Accept as a poster only”. It’s fairly difficult to justify a paper as “of broad interest”, which is ideal for an oral presentation. Will a neuroscientist really pay attention to this learning theory paper? Is this dimensionality reduction algorithm going to interest someone in learning theory? It’s substantially easier to justify a paper as “possibly of interest to a number of people”, which is about right for poster spotlight. Since the number of spotlights and the number of orals is similar, two effects occur: papers which are about right for spotlights become orals, and many reasonable spotlights aren’t spotlights because they don’t fit.
  3. The Veto Effect. If someone on the PC has a strong dislike for your paper, there is a very good chance for reject. This is true even when attention is explicitly payed by the PC chair to avoiding the veto problem. It’s even true when your paper has the strongest reviews in the area (no joke!). There are several fundamental problems here:
    1. People, especially in person, do not generally want to be confrontational. Consequently, if someone who is rarely confrontational speaks strongly against a paper, it’s rare2 for an alternate voice to be heard.
    2. It is easy to instill “fear, uncertainty, and doubt” in people. Was this paper covering the same material as some other paper no one knows? Are the assumptions criticizable? This problem is greatly exaggerated by attention deficit disorder.

It is easy to complain about these problems and substantially harder to fix them. (There is previous discussion on this.) Here is my best attempt to imagine fixes.

  1. Attention Deficit Disorder. The fundamental problem here is that papers aren’t getting the attention that they deserve by the final decision maker. Several changes might help, but nothing is going to be a silver bullet here.
    1. Author responsibility. Unfortunately, some authors abuse the system by submitting papers which should not be submitted. Much of this has to do with inexperience—many authors are first time paper writers. For these authors, some better effort educating people about what is an appropriate paper is good. This year, an effort was made to do this, and followups may be helpful. For a small fraction of papers, authors intentionally skate the edge of what is reasonable. Should an ICML paper with 30% different content be submitted to NIPS? This small fraction takes more time than their fraction indicates and (frankly) isn’t always caught. Some form of “shame list” may be an appropriate way to deal with this, although much caution would have to be exercised.
    2. Many of the problems here are unremovable artifacts of a physically present PC meeting. Going to a virtualized process would eliminate these problems (and introduce others). Any such decision would have to be carefully considered, but it is not impossible—there are plenty of succesful conference committees which never meet physically.
    3. The PC meeting can be run a bit differently.
      1. Bob Williamson and I managed to go through our secondary assignments and make independent decisions, then reconcile. In contrast, for most papers, the secondary PC member was inoperative at the PC meeting. This made some difference, and it’s easy to imagine that systematically having this reconciliation be a part of the PC meeting is helpful. The reconciliation step does not take very long and is parallelizable.
      2. Not making a decision at the PC meeting could be a real option for a small number of troublesome papers. There is perhaps a week-long timegap between the PC meeting and the release of the decisions during which decisions could be double checked. This option must only be used rarely, and never as a means for excluding interested PC members from the decision.
      3. Information can be more widely shared. I don’t see any real advantage to limiting the knowledge of papers not in your area to “title+authors”. At the PC meeting itself, it would be helpful to have all of the papers available to all of the members.
  2. Artificial Scarcity. My understanding is that the makers of NIPS purposefully preferred a single track conference, and it’s hard to argue with the success NIPS has enjoyed. Nevertheless, it seems notable that the NIPS workshops (which are excessively multitracked) are more succesful than the NIPS conference by some measures. Going to a two-track or partially two-track format would ease some of the decision making.

    Even working within the single track format, it’s not clear that the ratio between orals and spotlights is right. Spotlights take about 1/10th the time that an oral presentation takes, and yet only 1/10th or so of the overall time is allocated to spotlight presentations. Losing one oral presentation (out of about 20) would yield a
    significant increase in the number of spotlights, and it’s easy to imagine this would be beneficial to attendees while easing decision making.

  3. The Veto Effect. The veto effect is hard to deal with, and it’s only relevant to a small number of decisions. Nevertheless it’s important because some of the best papers are controversial at the time they are published. The are two ways I can imagine for dealing with the veto effect: (1) allowing author feedback (2) devolving power from the PC to the reviewers. Allowing author feedback would have to be coupled with delayed decision making. Eliminating the power of the PC to reject very highly rated papers is also controversial, but may be worth considering.


Continuizing Solutions

This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful.

Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples.

  1. ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set.
  2. Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially.
  3. Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that simply adopting the new policy can go very wrong.

For each of these problems, there are “continuized” solutions which can yield smaller computation, more elegant mathematics, or both.

  1. ERM By shifting from choosing a single classifier to choosing a stochastic classifier we can prove a new style of bound which is significantly tighter, easier to state, and easier to understand than traditional bounds in the traditional setting. This is the PAC-Bayes bound idea.
  2. Experts By giving the adversary slightly more power—the ability to split experts and have them fractionally predict one way vs. another, the optimal policy becomes much easier to compute (quadratic in the horizon, or maybe less). This is the continuous experts idea.
  3. Policy Iteration For policy iteration, by stochastically mixing the old and the new policy, we can find a new policy better than the old policy. This is the conservative policy iteration idea.

There is some danger to continuizing. The first and second examples both involve a setting shift, which may not be valid—in general your setting should reflect your real problem rather than the thing which is easy to solve. However, even with the setting shift, the solutions appear so compellingly more elegant that it is hard to not hope to use them in a solution to the original setting.

I have not seen a good formulation of the general approach of continuizing. Nevertheless, I expect to see continuizing in more places and to use it in the future. By making it explicit, perhaps this can be made eaesier.


Health of Conferences Wiki

Aaron Hertzmann points out the health of conferences wiki, which has a great deal of information about how many different conferences function.


Luis von Ahn is awarded a MacArthur fellowship.

For his work on the subject of human computation including ESPGame, Peekaboom, and Phetch. The new MacArthur fellows.


Incentive Compatible Reviewing

Tags: Papers,Research,Reviewing jl@ 10:13 pm

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

  1. We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
    1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
    2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
  2. We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
  3. We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.


How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – emin is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here emin is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.


Precision is not accuracy

Tags: Machine Learning,Research jl@ 9:23 pm

In my experience, there are two different groups of people who believe the same thing: the mathematics encountered in typical machine learning conference papers is often of questionable value.
The two groups who agree on this are applied machine learning people who have given up on math, and mature theoreticians who understand the limits of theory.

Partly, this is just a statement about where we are with respect to machine learning. In particular, we have no mechanism capable of generating a prescription for how to solve all learning problems. In the absence of such certainty, people try to come up with formalisms that partially describe and motivate how and why they do things. This is natural and healthy—we might hope that it will eventually lead to just such a mechanism.

But, part of this is simply an emphasis on complexity over clarity. A very natural and simple theoretical statement is often obscured by complexifications. Common sources of complexification include:

  1. Generalization By trying to make a statement that applies in the most general possible setting, your theorem becomes excessively hard to read.
  2. Specialization Your theorem relies upon so many assumptions that it is hard for a simple reader to hold them all in their head.
  3. Obscuration Your theorem relies upon cumbersome notation full of subsubsuperscripts, badly named variables, etc…

There are several reasons why complexification occurs.

  1. Excessive generalization often happens when authors have an idea and want to completely exploit it. So, various bells and whistles are added until the core idea is obscured.
  2. Excessive specialization often happens when authors have some algorithm they really want to prove works. So, they start making one assumption after another until the proof goes through.
  3. Obscuration is far more subtle than it sounds. Some of the worst obscurations come from using an old standard notation which has simply been pushed to far.

After doing research for awhile, you realize that these complexifications are counterproductive. Type (1) complexifications make it double hard for others to do follow-on work: your paper is hard to read and you have eliminated the possibility. Type (2) complexifications look like “the tail wags the dog”—the math isn’t really working until it guides the algorithm design. Figuring out how to remove the assumptions often results in the better algoritihm. Type (3) complexifications are an error. Fooling around to find a better notation is one of the ways that we sometimes make new discoveries.

The worst reason, I’ve saved for last: it’s that the reviewing process emphasizes precision over accuracy. Imagine shooting a math gun at a machine learning target. A high precision math gun will very carefully guide the bullets to strike a fixed location—even though the location may have little to do with the target. An accurate math gun will point at the correct target. A precision/accuracy tradeoff is often encountered: we don’t know how to think about the actual machine learning problem, so instead we very precisely think about another not-quite-right problem. A reviewer almost invariably prefers the more precise (but less accurate) paper because precision is the easy thing to check and think about.

There seems to be no easy fix for this—people naturally prefer to value the things they can measure. The hard fix for this is more time spent by everyone thinking about what the real machine learning problems are.


Who is having visa problems reaching US conferences?

Tags: Machine Learning,Research jl@ 8:56 pm

Many of the large machine learning conferences were in the US this summer. A common problem which students from abroad encounter is visa issues.
Just getting a visa to visit can be pretty rough: you stand around in lines, sometimes for days. Even worse is the timing with respect to ticket buying. Airplane tickets typically need to be bought well in advance on nonrefundable terms to secure a reasonable rate for air travel. When a visa is denied, as happens reasonably often, a very expensive ticket is burnt.

A serious effort is under way to raise this as in issue in need of fixing. Over the long term, effectively driving research conferences to locate outside of the US seems an unwise policy. Robert Schapire is planning to talk to a congressman. Sally Goldman suggested putting together a list of problem cases, and Phil Long setup an email address to collect them.

If you (or someone you know) has had insurmountable difficulties reaching a conference in the US, please send an email with:

Email address:
Details: (be brief please)

We expect most of the problem cases are students, so don’t be shy.


New Models

Tags: Machine Learning,Research jl@ 6:10 pm

How should we, as researchers in machine learning, organize ourselves?

The most immediate measurable objective of computer science research is publishing a paper. The most difficult aspect of publishing a paper is having reviewers accept and recommend it for publication. The simplest mechanism for doing this is to show theoretical progress on some standard, well-known easily understood problem.

In doing this, we often fall into a local minima of the research process. The basic problem in machine learning is that it is very unclear that the mathematical model is the right one for the (or some) real problem. A good mathematical model in machine learning should have one fundamental trait: it should aid the design of effective learning algorithms. To date, our ability to solve interesting learning problems (speech recognition, machine translation, object recognition, etc…) remains limited (although improving), so the “rightness” of our models is in doubt.

If our mathematical models are bad, the simple mechanism of research above can not yield the end goal. (This should be agreed on even by people who disagree about what the end goal of machine learning is!) Instead, research which proposes and investigates new mathematical models for machine learning might yield the end goal. Doing this is hard.

  1. Coming up with a new mathematical model is just plain not easy. Some sources of inspiration include:
    1. Watching carefully: what happens succesfully in practice, can often be abstracted into a mathematical model.
    2. Swapping fields: In other fields (for example crypto), other methods of analysis have been developed. Sometimes, these methods can be transferred.
    3. Model repair: Existing mathematical models often have easily comprehendable failure modes. By thinking about how to avoid such failure modes, we can sometimes produce a new mathematical model.
  2. Speaking about a new model is hard. The difficulty starts with you in explaining it. Often, when trying to converge on a new model, we think of it in terms of the difference with respect to an older model, leading to a tangled explanation. The difficulty continues with other people (in particular: reviewers) reading it. For a reviewer with limited time, it is very tempting to assume that any particular paper is operating in some familiar model and fail out. The best approach here seems to be super explicitness. You can’t be too blunt about saying “this isn’t the model you are thinking about”.
  3. Succeeding with new models is also hard. When people don’t have a reference frame to understand the new model, they are unlikely to follow up, as is necessary for success in academia.

The good news here is that a succesful new model can be a big win. I wish it was an easier win: the barriers to success are formidably high, and it seems we should do everything possible to lower the barriers to success for the sake of improving research.


Presentation of Proofs is Hard.

Tags: Research,Teaching jl@ 7:38 pm

When presenting part of the Reinforcement Learning theory tutorial at ICML 2006, I was forcibly reminded of this.

There are several difficulties.

  1. When creating the presentation, the correct level of detail is tricky. With too much detail, the proof takes too much time and people may be lost to boredom. With too little detail, the steps of the proof involve too-great a jump. This is very difficult to judge.

    1. What may be an easy step in the careful thought of a quiet room is not so easy when you are occupied by the process of presentation.
    2. What may be easy after having gone over this (and other) proofs is not so easy to follow in the first pass by a viewer.

    These problems seem only correctable by process of repeated test-and-revise.

  2. When presenting the proof, simply speaking with sufficient precision is substantially harder than in normal conversation (where precision is not so critical). Practice can help here.
  3. When presenting the proof, going at the right pace for understanding is difficult. When we use a blackboard/whiteboard, a natural reasonable pace is imposed by the process of writing. Unfortunately, writing doesn’t scale well to large audiences for vision reasons, losing this natural pacing mechanism.
  4. It is difficult to entertain with a proof—there is nothing particularly funny about it. This particularly matters for a large audience which tends to naturally develop an expectation of being entertained.

Given all these difficulties, it is very tempting to avoid presenting proofs. Avoiding the proof in any serious detail is fairly reasonable in a conference presentation—the time is too short and the people viewing are too heavily overloaded to follow the logic well. The “right” level of detail is often the theorem statement.

Nevertheless, avoidance is not always possible because the proof is one of the more powerful mechanisms we have for doing research.

« Newer PostsOlder Posts »

Powered by WordPress