Machine Learning (Theory)

2/16/2007

The Forgetting

How many papers do you remember from 2006? 2005? 2002? 1997? 1987? 1967? One way to judge this would be to look at the citations of the papers you write—how many came from which year? For myself, the answers on recent papers are:

year 2006 2005 2002 1997 1987 1967
count 4 10 5 1 0 0

This spectrum is fairly typical of papers in general. There are many reasons that citations are focused on recent papers.

  1. The number of papers being published continues to grow. This is not a very significant effect, because the rate of publication has not grown nearly as fast.
  2. Dead men don’t reject your papers for not citing them. This reason seems lame, because it’s a distortion from the ideal of science. Nevertheless, it must be stated because the effect can be significant.
  3. In 1997, I started as a PhD student. Naturally, papers after 1997 are better remembered because they were absorbed in real time. A large fraction of people writing papers and attending conferences haven’t been doing it for 10 years.
  4. Old papers aren’t on the internet. This is huge effect for any papers prior to 1995 (or so). The ease of examining a paper greatly influences the ability of an author to read and understand it. There are a number of journals which essentially have “internet access for the privileged elite who are willing to pay”. In my experience, this is only marginally better than having them stuck in the library.
  5. The recent past is more relevant to the present than the far past. There is a lot of truth in this—people discover and promote various problems or techniques which take off for awhile, until their turn to be forgotten arrives.

Should we be disturbed by this forgetting? There are a few good effects. For example, when people forget, they reinvent, and sometimes they reinvent better. Nevertheless, it seems like the effect of forgetting is bad overall, because it causes wasted effort. There are two implications:

  1. For paper writers, it is very common to overestimate the value of a paper, even though we know that the impact of most papers is bounded in time. Perhaps by looking at those older papers, we can get an idea of what is important in the long term. For example, looking at my own older citations, simplicity is it. If you want a paper to have a long term impact, it needs to have a simple algorithm, analysis method, or setting. Fundamentally, only those things which are teachable survive. Was your last paper simple? Could you teach it in a class? Are other people going to start doing so? Are the review criteria promoting the papers which a hope of survival?
  2. For conference organizers, it’s important to understand the way science has changed. Originally, you had to be a giant to succeed at science. Then, you merely had to stand on the shoulders of giants to succeed. Now, it seems that even the ability to peer over the shoulders of people standing on the shoulders of giants might be helpful. This is generally a good thing, because it means more people can help on a very hard task. Nevertheless, it seems that much of this effort is getting wasted in forgetting, because we do not have the right mechanisms to remember the information. Which is going to be the first conference to switch away from an ordered list of papers to something with structure? Wouldn’t it be great if all the content at a conference was organized in a wikipedia-like easy-for-outsiders-to-understand style?
11 Comments to “The Forgetting”
  1. John C. says:

    Perhaps this forgetting is by design. It was almost always my understanding that papers were proposals for what should be included in books, which were proposals of what should be taught, which is a proposal for what should be included in common knowledge. Those who use such a fragile process might desire to have their ideas fight for memorability (or to have the prestige of having won such a fight).

  2. Kilian W. says:

    John, I agree with your last point. The way papers are stored right now is clearly sub-optimal. It is too easy to miss important work because it was published at a conference that you are less familiar with, that was before your time, or because the title threw you off. Ideally, you could imagine a centralized, searchable, hierarchical data base where people and conferences upload their papers. If it is organized in a fine-grained hierarchy, you could subscribe to your topic/sub-tree of interest and receive weekly or daily emails with the latest additions. Going further, you could imagine allowing users to leave reviews or ratings for the papers (amazon or digg style).

    Isn’t there somebody at Google still searching for a useful 20% project?

  3. Anonymous says:

    Kilian – fyi such a paper organization and peer-comment system does exist in biology and medicine, see: http://www.f1000biology.com/start.asp
    the flexibility of their topic hierarchy, and the breadth of commenting peers (right now only faculty members, surprising how many people have time to write!) and others may not be perfect, … who in Google wants to improve this, or in other words, implement a “PaperPedia”?

  4. furr says:

    if you try attacking bigger problems you’ll start citing older papers

  5. Scientific Papers in the Internet Age

    In a recent discussion at Machine Learning (Theory) blog the website called Faculty of 1000 (Biology) and Faculty of 1000 (Medicine) came up. It works as follows: users submit papers they like, and there is space for supporting and dissenting…

  6. On the comment 4. “Old papers aren’t on the internet.”

    This is field-dependent, and circumstances make some fields more fortunate than others. In astronomy, due to NASA money, the major journals…Astronomical Journal, Astrophysical Journal, Monthly Notices of the Royal Astronomical Society, Astronomy and Astrophysics, to mention a few, are on line from their beginning (19th Century in some cases).

    This is possible because the size of these journals was small enough, even with 100+ years’ accumulation, to be scanned in a reasonable amount of time.

    It may well be that government funds could be used similarly in other fields to advantage, and the cost might not be so large as to be prohibitive.

    Type “ADS Abstracts” into your browser to get a sense of what is available in this field.

  7. hal says:

    I think there are two compounding issues. (1) old stuff often enters the common vocabulary and escapes citation. Decision trees often go uncited, for instance, especially in the context of boosting. (2) recent tutorial/collections/books supercede old papers; eg., many people cite SVMs as the Cristianini and Shawe-Taylor, or even the Vapnik book, rather than the original paper. Both a signs that an old technique is important, but reduce the recency of citation lists.

  8. Joe Kondel says:

    There have been some efforts towards this problem from the semantic web / large scale heterogeneous database systems area. One relatively recent one I remember is the Piazza system from U of Washington. Here’s one of the better papers.

  9. A Vezhnevets says:

    I totally agree on the notion about the relation of papers simplicity and its survivability. Most of the papers that had significant impact in the fields I’m familiar with (machine learning and vision) were those, that proposed simple solutions to complicated tasks. It seems that simpler methods also achieve better results then complex once (Occam’s razor?). The first thing that comes to my mind as an example is Boosting – student can understand and implement it in a few hours.

  10. Charles says:

    I really like the comment that only teachable ideas survive. That said, it’s not an uncommon research contribution to make an unteachable idea become teachable. For example, simpler proofs are often found to replace important but complex ones; or complex algorithms are subsumed into a general framework that makes understanding them much simpler. The forward-backward algorithm is probably a good example of the latter phenomenon.

  11. chunyu says:

    Fresh researchers usually read the most recent work, and try to keep pace with the research trend in their area. Though there may be many fundamental problems to be solved, the interest bias of the main flow research community may only focus those easy to published ones. If we choose different research areas, it may be dangerous.

Sorry, the comment form is closed at this time.

Powered by WordPress