Machine Learning (Theory)


Complexity Illness

One of the enduring stereotypes of academia is that people spend a great deal of intelligence, time, and effort finding complexity rather than simplicity. This is at least anecdotally true in my experience.

  1. Math++ Several people have found that adding useless math makes their paper more publishable as evidenced by a reject-add-accept sequence.
  2. 8 page minimum Who submitted a paper to ICML violating the 8 page minimum? Every author fears that the reviewers won’t take their work seriously unless the allowed length is fully used. The best minimum violation I know is Adam‘s paper at SODA on generating random factored numbers, but this is deeply exceptional. It’s a fair bet that 90% of papers submitted are exactly at the page limit. We could imagine that this is because papers naturally take more space, but few people seem to be clamoring for more space.
  3. Journalong Has anyone been asked to review a 100 page journal paper? I have. Journal papers can be nice, because they give an author the opportunity to write without sharp deadlines or page limit constraints, but this can and does go awry.

Complexity illness is a burden on the community. It means authors spend more time filling out papers, reviewers spend more time reviewing, and (most importantly) effort is misplaced on complex solutions over simple solutions, ultimately slowing (sometimes crippling) the long term impact of an academic community.

It’s difficult to imagine an author-driven solution to complexity illness, because the incentives are simply wrong. Reviewing based on solution value rather than complexity is a good way for individual people to reduce the problem. More generally, it would be great to have a system which explicitly encourages research without excessive complexity. The best example of this seems to be education—it’s the great decomplexifier. The process of teaching something greatly encourages teaching the simple solution, because that is what can be understood. This seems to be true both of traditional education and less conventional means such as wikipedia articles. I’m not sure exactly how to use this observation—Is there some way we can shift conference formats towards the process of creating teachable material?


Research Political Issues

Tags: Funding,General,Research jl@ 5:02 pm

I’ve avoided discussing politics here, although not for lack of interest. The problem with discussing politics is that it’s customary for people to say much based upon little information. Nevertheless, politics can have a substantial impact on science (and we might hope for the vice-versa). It’s primary election time in the United States, so the topic is timely, although the issues are not.

There are several policy decisions which substantially effect development of science and technology in the US.

  1. Education The US has great contrasts in education. The top universities are very good places, yet the grade school education system produces mediocre results. For me, the contrast between a public education and Caltech was bracing. For many others attending Caltech, it clearly was not. Upgrading the k-12 education system in the US is a long-standing chronic problem which I know relatively little about. My own experience is that a basic attitude of “no child unrealized” is better than “no child left behind”. A fair claim can also be made that the US just doesn’t invest enough.
  2. Respect Lack of respect for science and technology is routinely expressed in many ways in the US.
    1. The most bald form of lack of respect is scientific censorship. This may be easily understood as a generality: you choose to spend a large fraction of your life learning to interpret some part of the world. After years, you come to some conclusion about the nature of the world. Then, someone with no particular experience or expertise tells you to alter it.
    2. A more refined form of lack of respect is simply lack of presence in decision making. This isn’t necessarily intentional: many people simply make decisions from the gut, and then come up with reasons to justify their decision. This style explicitly cuts out the deep thinking of science. Many policies could have been better informed by a serious consideration of even basic science:
      1. The oil of Iraq is fundamentally less valuable if we are going to tackle global warming.
      2. Swapping gasoline for hydrogen-based transportable energy source is dubious because it introduces another energy storage conversion to lose energy on. The same goes for swapping bioethanol for gasoline. In contrast, hybrid and electric vehicles actually recover substantial energy from regenerative braking, and a plug-in hybrid could run off electricity in typical commuter usage.
      3. The Space Shuttle is a boondoggle design. The rocket equation implies that the ratio of initial to final mass for vehicles reaching earth orbit must be at least a factor of e2.5 (it’s actually e2.93 for the Space Shuttle). Making the system reusable implies that most of this mass returns to earth so the payload deliverable into space is only 1.2% of the liftoff mass. A better designed system might deliver payloads a factor of 4 larger or be much smaller.
      4. Passenger Inspections at airports is another poor policy from the perspective of science. It isn’t effective, and there is no cost-efficient way to make it effective against a motivated opponent. Solid evidence for this is the continued use of mules to smuggle drugs. The basic problem from a chemistry point of view is that too much can be done with a small amount of mass. Deterrence and limitation (armored cockpits and active resistance for example) are fine policies.
    3. Lack of support. The simplest form of lack of respect is simply lack of support. The case for federal vs corporate funding of basic science and technology development is very simple: the benefit to society of conducting such work dramatically exceeds the benefit any one agent within society (such as a company) could gain from it. Of late, investment in core science has been an anemic 0.0005 GDP and visa issues hamstring broader technology development.
  3. Confidence This is primarily related to the technology side of science and technology. Many policy decisions are made without confidence in the ability of technologists to adapt. This comes in at least two flavors.
    1. The foreordained solution. Policy often comes in the form “we use approach X to solve problem Y” (some examples are above). This demonstrates an overconfidence by policy makers in there ability to pick the winner, and a lack of confidence in the ability of technologists to solve problems. It also represents an opportunity for large established industries to get huge payoffs at taxpayer expense. The X-prize represents the opposite of this approach, and it has been radically more effective by any reasonable standard.
    2. Confusion about the meaning of wealth. Some people believe that wealth is about what you have. However, for a society it seems much better to measure wealth in terms of what the society can do. Policy makers often forget that science and technology is a capability when it comes time to think of a solution. For example, someone with no confidence in the ability to create and make affordable plugin electric hybrids might think it necessary to conquest for oil.
  4. Stability People can’t program, do science, or invent new things when they are worried about more immediate events. There are several destabilizing trends going on in the US right now which either now or in the future may make it hard to focus away from immediate concerns.
    1. Debt and money supply. The federal debt for the US government is about 3.5 times the federal budget. This is bad for the simple reason that investors buying US treasury bonds aren’t investing in new technology. However, the destabilizing concern is more subtle. Since world war II, the US dollar has become the standard currency for exchange around the world. Since debt by the government creates a temptation by the government to (effectively) print money, the number of dollars in circulation has been rapidly growing. But, a growing number of dollars means that the currency is devaluing, which makes owning dollars undesirable. I don’t know an example of a previous world currency that has ceased to be such, but basic economics says that bad things happen to dollar-based savings if all the dollars flow back into the US. So far, the decline of the dollar has been relatively gradual, but a very disruptive cliff might exist out there somewhere. Policies which increase debt (like cutting taxes and increasing spending) exacerbate this problem. There is no fix once the dollar loses world currency status because confidence can be lost quickly, but not regained.
    2. Health Care. The US is running an experiment to determine how large a fraction of GDP can be devoted to health care. Currently it’s over 15%, in first place, and growing. This is even worse than it sounds, because many comparable countries in Europe (or Japan) have older populations which should generally be more expensive to take care of. In the present situation, because health care is incredibly expensive, losing health insurance (which is typically tied to a job) is potentially catastrophic for any individual.
    3. Wealth Asymmetry. The US has shifted towards a substantially more asymmetric division of wealth since the 1970s. An asymmetric division of wealth is not fundamentally bad—there needs to be room for great success to imply great rewards. However, a casual correlation of science and technology development with the gini coefficient map reveals that a large gini coefficient and substantial science and technology development do not coincide. The problem is that wealth becomes inheritable, and it’s very unlikely that the wealth is inherited by a someone interested in science and technology. Wealth is now scheduled to become perfectly inheritable in 2010 in the US.

I’m sure some of these issues are endemic to many other parts of the world as well, because there are fundamental conceptual difficulties with investing in the unknown instead of the known.


The Machine Learning Award goes to …

Tags: Machine Learning,Research jl@ 10:33 am

Perhaps the biggest CS prize for research is the Turing Award, which has a $0.25M cash prize associated with it. It appears none of the prizes so far have been for anything like machine learning (the closest are perhaps database awards).

In CS theory, there is the Gödel Prize which is smaller and newer, offering a $5K prize along and perhaps (more importantly) recognition. One such award has been given for Machine Learning, to Robert Schapire and Yoav Freund for Adaboost.

In Machine Learning, there seems to be no equivalent of these sorts of prizes. There are several plausible reasons for this:

There is no coherent community. People drift in and out of the central conferences all the time. Most of the author names from 10 years ago do not occur in the conferences of today. In addition, the entire subject area is fairly new. There are at least a core group of people who have stayed around.
Machine Learning work doesn’t last Almost every paper is forgotten, because {the goals change, there isn’t any real progress, there are no teachable foundations}. It’s true of all fields of research.
There is no consensus (or clarity) on what’s important. The field is fractured between many very different viewpoints—statistical, empirical, AI, and theoretical. The prioritization of results across these very different viewpoints is hard. We can hope that people who have been around for awhile can learn to appreciate these other viewpoints. The Turing Award is over a much more diverse set of interests.

I think there are some sound reasons why we-as-a-field-of-research should want a field-wide award.

  1. Aspiration Perhaps the most valuable aspect of an award is that it gives people an incentive to aim for something in the long term. The closest approximation that we have right now is “best papers” awards at individual conferences. Best paper awards have a role, but it’s not the same. 10 years from now, when we look back 10 years, which papers will seem most significant? Across many different conferences?
  2. Representation One function of an award is that tells other people what we consider good work. In an academic reference frame, it gives information of the form “this person deserves tenure”. From a less cynical point of view, having a short list of “what’s important” that outsiders can peruse might be quite helpful as a starting point in understanding what’s going on.
  3. Crystallization Research is a process of discovering information and placing it carefully in context. An award has some role in furthering that process.

The worst part of any award is administering it. How do you avoid wasting time and playing favorites while keeping the higher level vision of what might be useful in the long term? I don’t have any great insight, except that it would need to be done.


The Privacy Problem

Machine Learning is rising in importance because data is being collected for all sorts of tasks where it either wasn’t previously collected, or for tasks that did not previously exist. While this is great for Machine Learning, it has a downside—the massive data collection which is so useful can also lead to substantial privacy problems.

It’s important to understand that this is a much harder problem than many people appreciate. The AOL data release is a good example. To those doing machine learning, the following strategies might be obvious:

  1. Just delete any names or other obviously personally identifiable information. The logic here seems to be “if I can’t easily find the person then no one can”. That doesn’t work as demonstrated by the people who were found circumstantially from the AOL data.
  2. … then just hash all the search terms! The logic here is “if I can’t read it, then no one can”. It’s also trivially broken by a dictionary attack—just hash all the strings that might be in the data and check to see if they are in the data.
  3. … then encrypt all the search terms and throw away the key! This prevents a dictionary analysis, but it is still entirely possible to do a frequency analysis. If 10 terms appear with known relative frequencies in public data, then finding 10 terms encrypted terms with the same relative frequencies might give you very good evidence for what these terms are.

All of these strategies turn out to be broken. For those not familiar with Machine Learning, other obvious strategies turn out to not work that well.

  1. Just don’t collect the data. We are not too far off from a world where setting the “please don’t collect my information” flag in your browser implies that you volunteer to have your searches return less relevant results, to not find friends, to not filter spam, etc… If everyone effectively has that flag set by legislation the effect would be very substantial. Many internet companies run off of advertising so eliminating the ability to do targeted advertising will eliminate the ability of these companies to exist.
  2. …Then just keep aggregations of the data! Aggregating data is very bad for machine learning in general. When we are figuring out how to do machine learning it’s even worse because we don’t know in advance which aggregations would be most useful.
  3. …Then keep just enough data around and throw out everything else! Unfortunately, there is no such thing as “enough data”. More data is always better.

This is a particularly relevant topic right now, because it’s news and because CMU and NSF are organizing a workshop on the topic next month, which I’m planning to attend. However, this is not simply an interest burst—the long term trend of increasing data collection implies this problem will repeatedly come up over the indefinite future.

The privacy problem breaks into at least two parts.

  1. Cultural Norms. Historically, almost no monetary transactions were recorded and there was a reasonable expectation that people would forget a visitor. This is rapidly changing with the rise of credit cards and cameras. This change in what can be expected is profoundly uncomfortable.
  2. Power Balance. Data is power. The ability to collect and analyze large quantities of data which many large organizations now have or are constructing increases their power relative to ordinary people. This power can be used for good (to improve services) or for bad (to maximize monopoly status or for spying).

The cultural norm privacy problem is sometimes solvable by creating an opt-in or opt-out protocol. This is particularly helpful on the internet because a user could simply request “please don’t record my search” or “please don’t record which news articles I read”. Needing to do this for every search or every news article would be annoying. However, this is easily fixed by having a system wide setting—perhaps a special browser cookie which says “please don’t record me” that any site could check. None of this is helpful for cameras (where no interface exists) or monetary transactions (where the transaction itself determines whether or not some item is shipped).

The power balance privacy problem is much more difficult. Some solutions that people attempt are:

  1. Accept the change in power balance. This is the default action. There are plenty of historical examples where large organizations have abused their power, so providing them more power to abuse may be unwise.
  2. Legislate a halt. Forbid cameras in public places. Forbid the collection or retention of data by any organization. The problem with this method is that technology simply isn’t moving in this direction. At some point, we may end up with cameras and storage devices so small, cheap, and portable that forbidding their use is essentially absurd. The other difficulty with this solution is that it keeps good things from happening. For example, a reasonable argument can be made that the British were effective at tracking bomb planters because the cameras of London helped them source attacks.
  3. Legislate an acceleration. Instead of halting the collection of data, open it up to more general use. One example of this is cameras in police cars in the US. Recordings from these cameras can often settle disputes very definitively. As technology improves, it’s reasonable to expect cameras just about anywhere people are in public. Some legislation and good engineering could make these cameras available to anyone. This would involve a substantial shift in cultural norms—essentially people would always be in potential public view when not at home. This directly collides with the “privacy as a cultural norm” privacy problem.

The hardness of the privacy problem mentioned at the post beginning implies difficult tradeoffs.

  1. If you have cultural norm privacy concerns, then you really don’t appreciate method (3) for power balance privacy concerns.
  2. If you value privacy greatly and the default action is taken, then you prefer monopolistic marketplaces. The advantages of a large amount of private data are prohibitive to new market entrance.
  3. If you want the internet to work better, then there are limits on how little data can be collected.

All of the above is even murkier because what can be done with data is not fully known, nor is what can be done in a privacy sensitive way.


Asking questions

Tags: Research jl@ 6:02 am

There are very substantial differences in how question asking is viewed culturally. For example, all of the following are common:

  1. If no one asks a question, then no one is paying attention.
  2. To ask a question is disrespectful of the speaker.
  3. Asking a question is admitting your own ignorance.

The first view seems to be the right one for research, for several reasons.

  1. Research is quite hard—it’s difficult to guess how people won’t understand something in advance while preparing a presentation. Consequently, it’s very common to lose people. No worthwhile presenter wants that.
  2. Real understanding is precious. By asking a question, you are really declaring “I want to understand”, and everyone should respect that.
  3. Asking a question wakes you up. I don’t mean from “asleep” to “awake” but from “awake” to “really awake”. It’s easy to drift through something sort-of-understanding. When you ask a question, especially because you are on the spot, you will do much better.

Some of these effects might seem minor, but they accumulate over time, and their accumulation can have a big effect in terms of knowledge and understanding by the questioner as well as how well ideas are communicated. A final piece of evidence comes from checking cultural backgrounds. People with a cultural background that accepts question asking simply tend to do better in research. If this isn’t you, it’s ok. By being conscious of the need to ask questions and working up the courage to do it, you can do fine.

A reasonable default is that the time to not ask a question is when the speaker (or the environment) explicitly say “let me make progress in the talk”.


The View From China

Tags: Machine Learning,Research jl@ 7:44 am

I’m visiting Beijing for the Pao-Lu Hsu Statistics Conference on Machine Learning.

I had several discussions about the state of Chinese research. Given the large population and economy, you might expect substantial research—more than has been observed at international conferences. The fundamental problem seems to be the Cultural Revolution which lobotimized higher education, and the research associated with it. There has been a process of slow recovery since then, which has begun to be felt in the research world via increased participation in international conferences and (now) conferences in China.

The amount of effort going into construction in Beijing is very impressive—people are literally building a skyscraper at night outside the window of the hotel I’m staying at (and this is not unusual). If a small fraction of this effort is later focused onto supporting research, the effect could be very substantial. General growth in China’s research portfolio should be expected.


Presentation Preparation

Tags: Research jl@ 11:02 pm

A big part of doing research is presenting it at a conference. Since many people start out shy of public presentations, this can be a substantial challenge. Here are a few notes which might be helpful when thinking about preparing a presentation on research.

  1. Motivate. Talks which don’t start by describing the problem to solve cause many people to zone out.
  2. Prioritize. It is typical that you have more things to say than time to say them, and many presenters fall into the failure mode of trying to say too much. This is an easy-to-understand failure mode as it’s very natural to want to include everything. A basic fact is: you can’t. Example of this are:
    1. Your slides are so densely full of equations and words that you can’t cover them.
    2. Your talk runs over and a moderator prioritizes for you by cutting you off.
    3. You motor-mouth through the presentation, and the information absorption rate of the audience prioritizes in some uncontrolled fashion.
    4. The rate of flow of concepts simply exceeds the information capacity of the audience. Even with nondense slides and an easy succinct delivery, this can often happen.

    One way to prioritize is figure out: “What would I present in 1 minute?” or “What would I present in 5 minutes?”, and then let this guide your overall presentation.

  3. Unassume. When you are working in an area, it’s typical to buildup an internal shorthand for concepts. This needs to be peeled away when preparing a presentation. Decide what the minimal set of concepts are, and then be sure to define them as they are introduced. For people familiar with the basic concepts, this gives them a way to reconcile choices of language, and others at least have a prayer of following.
  4. Practice Well. Some people try to get a talk right by practicing it relentlessly until it is memorized, and then deliver it as a memorized monologue. This is terrible, because people in the audience know it is a memorized monologue and zone out. A good talk is delivered like a conversation, where it happens to be your turn to speak for awhile, and practicing that is more difficult. Some practice by yourself can be helpful—but not too much. A much better method is to practice on your friends by delivering to them before delivering it to the wider world.

The points above avoid the common failure modes which seem to come up with first-time presenters. There is much more advice to give (and for me to learn) about giving better presentations.


The Coming Patent Apocalypse

Tags: Funding,Research jl@ 5:46 pm

Many people in computer science believe that patents are problematic. The truth is even worse—the patent system in the US is fundamentally broken in ways that will require much more significant reform than is being considered now.

The myth of the patent is the following: Patents are a mechanism for inventors to be compensated according to the value of their inventions while making the invention available to all. This myth sounds pretty desirable, but the reality is a strange distortion slowly leading towards collapse.

There are many problems associated with patents, but I would like to focus on just two of them:

  1. Patent Trolls The way that patents have generally worked over the last several decades is that they were a tool of large companies. Large companies would amass a large number of patents and then cross-license each other’s patents—in effect saying “we agree to owe each other nothing”. Smaller companies would sometimes lose in this game, essentially because they didn’t have enough patents to convince the larger companies that cross-licensing was a good idea. However, they didn’t necessarily lose, because small companies are also doing fewer things which makes their patent violation profile smaller.

    The patent trolls arrived recently. They are a new breed of company which does nothing but produce patents and collect money from them. The thing which distinguishes patent troll companies is that they have no patent violation profile. A company with a large number of patents can not credibly threaten to sue them unless they cross-license their patents, because they don’t do anything which violates a patent.

    The best example (and proof that this method works) is NTP, which extracted $612.5M from RIM. Although this is the big case with lots of publicity, the process of extracting money goes on constantly all around your in backroom negotiations with the companies that actually do things. In effect, patent trolls impose an invisible tax on companies that do things by companies that don’t. Restated in another way, patent trolls are akin to exploiting tax loopholes—except they exploit the law to make money rather than simply to avoid losing it. Smaller companies are particularly prone to lose, because they simply can not afford the extreme legal fees associated with fighting even a winning battle, but even large companies are also vulnerable to a patent troll.

    The other side of this argument is that patent trolls are simply performing a useful business function: employing researchers to come up with ideas or (at least) putting a floor on the value of ideas which they buy up through patents. Unfortunately, this is simply not true in my experience, due to the next problem.

  2. Combinatorial Ideas Patents are too easy. In fact, the process of coming up with patentable ideas is simply a matter of combinatorial application of existing ideas. This is a simple game that any reasonably intelligent person can play: you take idea 1 and idea 2, glue them together in any reasonable way, and get a patent.

    There are several reasons why the combinatorial application of existing ideas has become standard for patents.

    1. One of these is regulatory capture. It should surprise no one that the patent office, which gets paid for every patent application, has found a way to increase the number of patent applications.
    2. Another reason has to do with the way that patent law developed. Initially, patents were for processes doing things, rather than ideas. The scope of patents has steadily extended over time but the basic idea of patenting a process has been preserved. The fundamental problem is that processes can be created by the combinatorial application of ideas.

    The ease of patents is fundamentally valuable to patent troll type companies because they can acquire a large number of patents on processes which other companies accidentally violate.

The patent apocalypse happens when we project forward these two trends. Patents become ever easier to acquire and patent troll companies become ever more prevalent. In the end, every company which does something uses some obvious process that violates someone’s patent, and they have to pay at rates the patent owner chooses. There is no inherent bound on the number of patent troll type companies which can exist—they can multiply unchecked and drain money from every other company which does things until the system collapses.

I would like to make some positive suggestions here about how to reform the patent system, but it’s a hard mechanism design problem. Some obvious points are:

  1. Patents should have higher standards than research papers rather than substantially lower standards.
  2. The patent office should not make money from patents (this is not equivalent to saying that the patent applications should not be charged).
  3. The decision in whether or not to grant a patent should weigh the cost of granting. Many private property afficionados think “patents are great, because they compensate inventors”, but there is a real cost to society in granting obvious patents. You block people from doing things in the straightforward way or create hidden liabilities.
  4. Patents should be far more readable if the “available for all” part of the myth is to be upheld. Published academic papers are substantially more readable (which isn’t all that high of a bar).

Patent troll companies have found a clever way to exhibit the flaws in the current patent system. Substantial patent reform to eliminate this style of company would benefit just about everyone, except for these companies.


Davor has been working to setup which is the new site for the many lectures mentioned here. (Tragically, they seem to only be available in windows media format.) I went through my own projects and added a few links to the videos. The day when every result is a set of {paper, slides, video} isn’t quite here yet, but it’s within sight. (For many papers, of course, code is a 4th component.)


The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year 2006 2005 2002 1997 1987 1967
count 4 10 5 1 0 0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

  1. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
  2. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
  3. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
  4. Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
  5. The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

  1. For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
  2. For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?


Best Practices for Collaboration

Tags: Papers,Research jl@ 1:51 pm

Many people, especially students, haven’t had an opportunity to collaborate with other researchers. Collaboration, especially with remote people can be tricky. Here are some observations of what has worked for me on collaborations involving a few people.

  1. Travel and Discuss Almost all collaborations start with in-person discussion. This implies that travel is often necessary. We can hope that in the future we’ll have better systems for starting collaborations remotely (such as blogs), but we aren’t quite there yet.
  2. Enable your collaborator. A collaboration can fall apart because one collaborator disables another. This sounds stupid (and it is), but it’s far easier than you might think.
    1. Avoid Duplication. Discovering that you and a collaborator have been editing the same thing and now need to waste time reconciling changes is annoying. The best way to avoid this to be explicit about who has write permission to what. Most of the time, a write lock is held for the entire document, just to be sure.
    2. Don’t keep the write lock unnecessarily. Some people are perfectionists so they have a real problem giving up the write lock on a draft until it is perfect. This prevents other collaborators from doing things. Releasing write lock (at least) when you sleep, is a good idea.
    3. Send all necessary materials. Some people try to save space or bandwidth by not passing ‘.bib’ files or other auxiliary components. Forcing your collaborator to deal with the missing subdocument problem is disabling. Space and bandwidth are cheap while your collaborators time is precious. (Sending may be pass-by-reference rather than attach-to-message in most cases.)
    4. Version Control. This doesn’t mean “use version control software”, although that’s fine. Instead, it means: have a version number for drafts passed back and forth. This means you can talk about “draft 3″ rather than “the draft that was passed last tuesday”. Coupled with “send all necessary materials”, this implies that you naturally backup previous work.
  3. Be Generous. It’s common for people to feel insecure about what they have done or how much “credit” they should get.
    1. Coauthor standing. When deciding who should have a chance to be a coauthor, the rule should be “anyone who has helped produce a result conditioned on previous work”. “Helped produce” is often interpreted too narrowly—a theoretician should be generous about crediting experimental results and vice-versa. Potential coauthors may decline (and senior ones often do so). Control over who is a coauthor is best (and most naturally) exercised by the choice of who you talk to.
    2. Author ordering. Author ordering is the wrong thing to worry about, so don’t. The CS theory community has a substantial advantage here because they default to alpha-by-author ordering, as is understood by everyone.
    3. Who presents. A good default for presentations at a conference is “student presents” (or suitable generalizations). This gives young people a real chance to get involved and learn how things are done. Senior collaborators already have plentiful alternative methods to present research at workshops or invited talks.
  4. Communicate by default Not cc’ing a collaborator is a bad idea. Even if you have a very specific question for one collaborator and not another, it’s a good idea to cc everyone. In the worst case, this is a few-second annoyance for the other collaborator. In the best case, the exchange answers unasked questions. This also prevents “conversation shifts into subjects interesting to everyone, but oops! you weren’t cced” problem.

These practices are imperfectly followed even by me, but they are a good ideal to strive for.


Recruitment Conferences

One of the subsidiary roles of conferences is recruitment. NIPS is optimally placed in time for this because it falls right before the major recruitment season.

I personally found job hunting embarrassing, and was relatively inept at it. I expect this is true of many people, because it is not something done often.

The basic rule is: make the plausible hirers aware of your interest. Any corporate sponsor is a “plausible”, regardless of whether or not there is a booth. CRA and the acm job center are other reasonable sources.

There are substantial differences between the different possibilities. Putting some effort into understanding the distinctions is a good idea, although you should always remember where the other person is coming from.


Structural Problems in NIPS Decision Making

This is a very difficult post to write, because it is about a perenially touchy subject. Nevertheless, it is an important one which needs to be thought about carefully.

There are a few things which should be understood:

  1. The system is changing and responsive. We-the-authors are we-the-reviewers, we-the-PC, and even we-the-NIPS-board. NIPS has implemented ‘secondary program chairs’, ‘author response’, and ‘double blind reviewing’ in the last few years to help with the decision process, and more changes may happen in the future.
  2. Agreement creates a perception of correctness. When any PC meets and makes a group decision about a paper, there is a strong tendency for the reinforcement inherent in a group decision to create the perception of correctness. For the many people who have been on the NIPS PC it’s reasonable to entertain a healthy skepticism in the face of this reinforcing certainty.
  3. This post is about structural problems. What problems arise because of the structure of the process? The post is not about individual people, because this is unlikely to be fruitful.

Although the subject is nominally about NIPS (which I have experience with as an author, reviewer, and PC member), the points may apply elsewhere.

For those that don’t know, it’s worth reviewing how the NIPS process currently works. Temporally, it looks like the following:

  1. PC chair is appointed.
  2. PC chair picks PC committee to cover many different areas. NIPS is notably diverse.
  3. PC committee members pick reviewers for their areas.
  4. Authors submit blinded papers.
  5. Papers are assigned to two PC committee members, the “primary” and the “secondary”.
  6. Reviewers bid for papers within their areas which they want and don’t want to review.
  7. Reviewers are assigned papers based on bid plus coverage.
  8. Reviewers review papers.
  9. Authors respond to blinded reviews.
  10. Reviewers discuss and rate papers.
  11. PC members digest author/reviewer interaction (and sometimes the paper) into an impression.
  12. PC members meet physically at the PC meeting.
  13. PC members present all papers that they believe are worth considering to other PC members and a decision is made.

Naturally, there are many details left out of this long list.

Here is my attempt to describe the problems I’ve seen:

  1. Attention deficit disorder. The attention paid to individual accept/reject decisions is (and structurally must be) small. There are several effects which drive this:
    1. The people on the NIPS PC are typically busy and time constrained.
    2. The number of papers assigned to individual PC members is large—perhaps 40 to 80, plus a similar number assigned as a secondary.
    3. Many of the people have traveled a very long ways to reach the PC meeting. Jetlag is common, and often significantly effects your ability to think carefully.
    4. The meeting itself is 2 days long. The average time spent on any decision must be less than 5 minutes, and everyone knows this. The implicit encouragement to digest a paper down to its most simple description is significant. No one on the PC has seen the paper except for the primary and the secondary (if you are lucky) PC members, so decisions are made quickly based upon relatively little information. (This is better than it sounds in most cases because effectively the decision was made by the primary PC member beforehand.)
  2. Artificial scarcity. NIPS is a single track conference with 3 levels of acceptance “Accept for an oral presentation”, “Accept for a poster with a spotlight”, and “Accept as a poster only”. It’s fairly difficult to justify a paper as “of broad interest”, which is ideal for an oral presentation. Will a neuroscientist really pay attention to this learning theory paper? Is this dimensionality reduction algorithm going to interest someone in learning theory? It’s substantially easier to justify a paper as “possibly of interest to a number of people”, which is about right for poster spotlight. Since the number of spotlights and the number of orals is similar, two effects occur: papers which are about right for spotlights become orals, and many reasonable spotlights aren’t spotlights because they don’t fit.
  3. The Veto Effect. If someone on the PC has a strong dislike for your paper, there is a very good chance for reject. This is true even when attention is explicitly payed by the PC chair to avoiding the veto problem. It’s even true when your paper has the strongest reviews in the area (no joke!). There are several fundamental problems here:
    1. People, especially in person, do not generally want to be confrontational. Consequently, if someone who is rarely confrontational speaks strongly against a paper, it’s rare2 for an alternate voice to be heard.
    2. It is easy to instill “fear, uncertainty, and doubt” in people. Was this paper covering the same material as some other paper no one knows? Are the assumptions criticizable? This problem is greatly exaggerated by attention deficit disorder.

It is easy to complain about these problems and substantially harder to fix them. (There is previous discussion on this.) Here is my best attempt to imagine fixes.

  1. Attention Deficit Disorder. The fundamental problem here is that papers aren’t getting the attention that they deserve by the final decision maker. Several changes might help, but nothing is going to be a silver bullet here.
    1. Author responsibility. Unfortunately, some authors abuse the system by submitting papers which should not be submitted. Much of this has to do with inexperience—many authors are first time paper writers. For these authors, some better effort educating people about what is an appropriate paper is good. This year, an effort was made to do this, and followups may be helpful. For a small fraction of papers, authors intentionally skate the edge of what is reasonable. Should an ICML paper with 30% different content be submitted to NIPS? This small fraction takes more time than their fraction indicates and (frankly) isn’t always caught. Some form of “shame list” may be an appropriate way to deal with this, although much caution would have to be exercised.
    2. Many of the problems here are unremovable artifacts of a physically present PC meeting. Going to a virtualized process would eliminate these problems (and introduce others). Any such decision would have to be carefully considered, but it is not impossible—there are plenty of succesful conference committees which never meet physically.
    3. The PC meeting can be run a bit differently.
      1. Bob Williamson and I managed to go through our secondary assignments and make independent decisions, then reconcile. In contrast, for most papers, the secondary PC member was inoperative at the PC meeting. This made some difference, and it’s easy to imagine that systematically having this reconciliation be a part of the PC meeting is helpful. The reconciliation step does not take very long and is parallelizable.
      2. Not making a decision at the PC meeting could be a real option for a small number of troublesome papers. There is perhaps a week-long timegap between the PC meeting and the release of the decisions during which decisions could be double checked. This option must only be used rarely, and never as a means for excluding interested PC members from the decision.
      3. Information can be more widely shared. I don’t see any real advantage to limiting the knowledge of papers not in your area to “title+authors”. At the PC meeting itself, it would be helpful to have all of the papers available to all of the members.
  2. Artificial Scarcity. My understanding is that the makers of NIPS purposefully preferred a single track conference, and it’s hard to argue with the success NIPS has enjoyed. Nevertheless, it seems notable that the NIPS workshops (which are excessively multitracked) are more succesful than the NIPS conference by some measures. Going to a two-track or partially two-track format would ease some of the decision making.

    Even working within the single track format, it’s not clear that the ratio between orals and spotlights is right. Spotlights take about 1/10th the time that an oral presentation takes, and yet only 1/10th or so of the overall time is allocated to spotlight presentations. Losing one oral presentation (out of about 20) would yield a
    significant increase in the number of spotlights, and it’s easy to imagine this would be beneficial to attendees while easing decision making.

  3. The Veto Effect. The veto effect is hard to deal with, and it’s only relevant to a small number of decisions. Nevertheless it’s important because some of the best papers are controversial at the time they are published. The are two ways I can imagine for dealing with the veto effect: (1) allowing author feedback (2) devolving power from the PC to the reviewers. Allowing author feedback would have to be coupled with delayed decision making. Eliminating the power of the PC to reject very highly rated papers is also controversial, but may be worth considering.


Continuizing Solutions

This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful.

Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples.

  1. ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set.
  2. Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially.
  3. Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that simply adopting the new policy can go very wrong.

For each of these problems, there are “continuized” solutions which can yield smaller computation, more elegant mathematics, or both.

  1. ERM By shifting from choosing a single classifier to choosing a stochastic classifier we can prove a new style of bound which is significantly tighter, easier to state, and easier to understand than traditional bounds in the traditional setting. This is the PAC-Bayes bound idea.
  2. Experts By giving the adversary slightly more power—the ability to split experts and have them fractionally predict one way vs. another, the optimal policy becomes much easier to compute (quadratic in the horizon, or maybe less). This is the continuous experts idea.
  3. Policy Iteration For policy iteration, by stochastically mixing the old and the new policy, we can find a new policy better than the old policy. This is the conservative policy iteration idea.

There is some danger to continuizing. The first and second examples both involve a setting shift, which may not be valid—in general your setting should reflect your real problem rather than the thing which is easy to solve. However, even with the setting shift, the solutions appear so compellingly more elegant that it is hard to not hope to use them in a solution to the original setting.

I have not seen a good formulation of the general approach of continuizing. Nevertheless, I expect to see continuizing in more places and to use it in the future. By making it explicit, perhaps this can be made eaesier.


Health of Conferences Wiki

Aaron Hertzmann points out the health of conferences wiki, which has a great deal of information about how many different conferences function.


Luis von Ahn is awarded a MacArthur fellowship.

For his work on the subject of human computation including ESPGame, Peekaboom, and Phetch. The new MacArthur fellows.


Incentive Compatible Reviewing

Tags: Papers,Research,Reviewing jl@ 10:13 pm

Reviewing is a fairly formal process which is integral to the way academia is run. Given this integral nature, the quality of reviewing is often frustrating. I’ve seen plenty of examples of false statements, misbeliefs, reading what isn’t written, etc…, and I’m sure many other people have as well.

Recently, mechanisms like double blind review and author feedback have been introduced to try to make the process more fair and accurate in many machine learning (and related) conferences. My personal experience is that these mechanisms help, especially the author feedback. Nevertheless, some problems remain.

The game theory take on reviewing is that the incentive for truthful reviewing isn’t there. Since reviewers are also authors, there are sometimes perverse incentives created and acted upon. (Incidentially, these incentives can be both positive and negative.)

Setting up a truthful reviewing system is tricky because their is no final reference truth available in any acceptable (say: subyear) timespan. There are several ways we could try to get around this.

  1. We could try to engineer new mechanisms for finding a reference truth into a conference and then use a ‘proper scoring rule’ which is incentive compatible. For example, we could have a survey where conference participants short list the papers which interested them. There are significant problems here:
    1. Conference presentations mostly function as announcements of results. Consequently, the understanding of the paper at the conference is not nearly as deep as, say, after reading through it carefully in a reading group.
    2. This is inherently useless for judging reviews of rejected papers and it is highly biased for judging reviews of papers presented in two different formats (say, a poster versus an oral presentation).
  2. We could ignore the time issue and try to measure reviewer performance based upon (say) long term citation count. Aside from the bias problems above, there is also a huge problem associated with turnover. Who the reviewers are and how an individual reviewer reviews may change drastically in just a 5 year timespan. A system which can provide track records for only a small subset of current reviewers isn’t very capable.
  3. We could try to manufacture an incentive compatible system even when the truth is never known. This paper by Nolan Miller, Paul Resnick, and Richard Zeckhauser discusses the feasibility of this approach. Essentially, the scheme works by rewarding reviewer i according to a proper scoring rule applied to P(reviewer j’s score | reviewer i’s score). (A simple example of a proper scoring rule is log[P()].) This is approach is pretty fresh, so there are lots of problems, some of which may or may not be fundamental difficulties for application in practice. The significant problem I see is that this mechanism may reward joint agreement instead of a good contribution towards good joint decision making.

None of these mechanisms are perfect, but they may each yield a little bit of information about what was or was not a good decision over time. Combining these sources of information to create some reviewer judgement system may yield another small improvement in the reviewing process.

The important thing to remember is that we are the reviewers as well as the authors. Are we interested in tracking our reviewing performance over time in order to make better judgements? Such tracking often happens on an anecdotal or personal basis, but shifting to an automated incentive compatible system would be a big change in scope.


How to solve an NP hard problem in quadratic time

This title is a lie, but it is a special lie which has a bit of truth.

If n players each play each other, you have a tournament. How do you order the players from weakest to strongest?

The standard first attempt is “find the ordering which agrees with the tournament on as many player pairs as possible”. This is called the “minimum feedback arcset” problem in the CS theory literature and it is a well known NP-hard problem. A basic guarantee holds for the solution to this problem: if there is some “true” intrinsic ordering, and the outcome of the tournament disagrees k times (due to noise for instance), then the output ordering will disagree with the original ordering on at most 2k edges (and no solution can be better).

One standard approach to tractably solving an NP-hard problem is to find another algorithm with an approximation guarantee. For example, Don Coppersmith, Lisa Fleischer and Atri Rudra proved that ordering players according to the number of wins is a 5-approximation to the NP-hard problem.

An even better approach is to realize that the NP hard problem may not be the real problem. The real problem may be finding a good approximation to the “true” intrinsic ordering given noisy tournament information.

In a learning setting, the simplest form of ranking problem is “bipartite ranking” where every element has a value of either 0 or 1 and we want to order 0s before 1s. A common way to measure the performance of bipartite ranking is according to “area under the ROC curve” (AUC) = 1 – the fraction of out-of-order pairs. Nina, Alina, Greg and I proved that if we learn a comparison function which errs on k dissimilar pairs, then ordering according to the number of wins yields an order within 4k edge reversals of the original ordering. As a reduction statement(*), this shows that an error rate of e for a learned pairwise binary classifier produces an ordering with an expected AUC of 1 – 4e. The same inequality even holds for a (stronger) regret transform. If r = e – emin is the regret of the binary pairwise classifier, then the AUC regret is bounded by 4r. (Here emin is the error rate of the best possible classifier which predicts knowing the most likely outcome.) The regret result extends to more general measures of ordering than simply AUC.

We were unable to find any examples where ordering according to the degree produced more than a 2r AUC regret. Nikhil Bansal, Don, and Greg have worked out a tightened proof which gives exactly this upper bound. At the end of the day, we have an algorithm with satisfies precisely the same guarantee as the NP hard solution.

There are two lessons here. The learning lesson is that a good pairwise comparator implies the ability to rank well according to AUC. The general research lesson is that an NP hard problem for an approximate solution is not an intrinsic obstacle. Sometimes there exist simple tractable algorithms which satisfy the same guarantees as the NP hard solution.

(*) To prove the reduction, you must make sure that your form pairwise examples in the right way. Your source of pairwise ordering examples must be uniform over the dissimilar pairs containing one example with label 1 and one example with label 0.


Precision is not accuracy

Tags: Machine Learning,Research jl@ 9:23 pm

In my experience, there are two different groups of people who believe the same thing: the mathematics encountered in typical machine learning conference papers is often of questionable value.
The two groups who agree on this are applied machine learning people who have given up on math, and mature theoreticians who understand the limits of theory.

Partly, this is just a statement about where we are with respect to machine learning. In particular, we have no mechanism capable of generating a prescription for how to solve all learning problems. In the absence of such certainty, people try to come up with formalisms that partially describe and motivate how and why they do things. This is natural and healthy—we might hope that it will eventually lead to just such a mechanism.

But, part of this is simply an emphasis on complexity over clarity. A very natural and simple theoretical statement is often obscured by complexifications. Common sources of complexification include:

  1. Generalization By trying to make a statement that applies in the most general possible setting, your theorem becomes excessively hard to read.
  2. Specialization Your theorem relies upon so many assumptions that it is hard for a simple reader to hold them all in their head.
  3. Obscuration Your theorem relies upon cumbersome notation full of subsubsuperscripts, badly named variables, etc…

There are several reasons why complexification occurs.

  1. Excessive generalization often happens when authors have an idea and want to completely exploit it. So, various bells and whistles are added until the core idea is obscured.
  2. Excessive specialization often happens when authors have some algorithm they really want to prove works. So, they start making one assumption after another until the proof goes through.
  3. Obscuration is far more subtle than it sounds. Some of the worst obscurations come from using an old standard notation which has simply been pushed to far.

After doing research for awhile, you realize that these complexifications are counterproductive. Type (1) complexifications make it double hard for others to do follow-on work: your paper is hard to read and you have eliminated the possibility. Type (2) complexifications look like “the tail wags the dog”—the math isn’t really working until it guides the algorithm design. Figuring out how to remove the assumptions often results in the better algoritihm. Type (3) complexifications are an error. Fooling around to find a better notation is one of the ways that we sometimes make new discoveries.

The worst reason, I’ve saved for last: it’s that the reviewing process emphasizes precision over accuracy. Imagine shooting a math gun at a machine learning target. A high precision math gun will very carefully guide the bullets to strike a fixed location—even though the location may have little to do with the target. An accurate math gun will point at the correct target. A precision/accuracy tradeoff is often encountered: we don’t know how to think about the actual machine learning problem, so instead we very precisely think about another not-quite-right problem. A reviewer almost invariably prefers the more precise (but less accurate) paper because precision is the easy thing to check and think about.

There seems to be no easy fix for this—people naturally prefer to value the things they can measure. The hard fix for this is more time spent by everyone thinking about what the real machine learning problems are.


Who is having visa problems reaching US conferences?

Tags: Machine Learning,Research jl@ 8:56 pm

Many of the large machine learning conferences were in the US this summer. A common problem which students from abroad encounter is visa issues.
Just getting a visa to visit can be pretty rough: you stand around in lines, sometimes for days. Even worse is the timing with respect to ticket buying. Airplane tickets typically need to be bought well in advance on nonrefundable terms to secure a reasonable rate for air travel. When a visa is denied, as happens reasonably often, a very expensive ticket is burnt.

A serious effort is under way to raise this as in issue in need of fixing. Over the long term, effectively driving research conferences to locate outside of the US seems an unwise policy. Robert Schapire is planning to talk to a congressman. Sally Goldman suggested putting together a list of problem cases, and Phil Long setup an email address to collect them.

If you (or someone you know) has had insurmountable difficulties reaching a conference in the US, please send an email with:

Email address:
Details: (be brief please)

We expect most of the problem cases are students, so don’t be shy.

« Newer PostsOlder Posts »

Powered by WordPress