The Webscience Future

The internet has significantly effected the way we do research but it’s capabilities have not yet been fully realized.

First, let’s acknowledge some known effects.

  1. Self-publishing By default, all researchers in machine learning (and more generally computer science and physics) place their papers online for anyone to download. The exact mechanism differs—physicists tend to use a central repository (Arxiv) while computer scientists tend to place the papers on their webpage. Arxiv has been slowly growing in subject breadth so it now sometimes used by computer scientists.
  2. Collaboration Email has enabled working remotely with coauthors. This has allowed collaborationis which would not otherwise have been possible and generally speeds research.

Now, let’s look at attempts to go further.

  1. Blogs (like this one) allow public discussion about topics which are not easily categorized as “a new idea in machine learning” (like this topic).
  2. Organization of some subfield of research. This includes Satinder Singh’s Reinforcement Learning pages, and, more generally books that have been placed online such as this one.
  3. Discussion Groups The kernel machines discussions provide a good example of some common format allowing discussion.
  4. Class notes have been placed online such as Avrim’s learning theory lecture notes.
  5. Wikipedia has an article on Machine Learning. The article gives a reasonable quick overview and is surely read by a very large number of people.
  6. Online Proceedings are now being used by several conferences such as NIPS.

Now, let’s consider some futures.

  1. Wikifuture Wikipedia becomes better to the point where it is a completely comprehensive listing of current research in machine learning. At some point, we-the-community realize this and begin to emphasize (and credit) information placed in wikipedia. This process reinforces itself to the point where “if it’s not in wikipedia, it doesn’t exist”.

    This future is significantly more probable than most people understand. As evidence compare the machine learning page three years ago (yep, it didn’t exist), two years ago, one year ago, and today. That progression strongly suggests that wikipedia:machine learning will continue to grow into a significant information resource.

    There are fundamental obstacles to the success of the wikipedia future.

    1. credit Wikipedia has only very weak mechanisms for crediting editors. A list of the changes done by one user account is about as much credit as is available. This is not enough to make career-deciding questions on. We could hope for a stronger link between identity and editor along with tools to track the value of particular edits (Think of counting hyperlinks as an analogue for counting citations).
    2. controversy Wikipedia has grown up in a nonmanipulative environment. When it was little known, the incentive to fabricate entries was not great. Now that it is becoming well known that incentive is growing. Character assasination by false article exists. In science, the thing to worry about is misplaced ideas of the importance of your topic of research since it is very difficult to be sufficiently interested in a research topic and simultaneously view it objectively. Research is about creating new ideas, and the location of these ideas in some general organization is in dispute by default.
  2. Evolutionary Progression Consider the following sequence of steps.
    1. Conference Organization We realize that having a list of online papers isn’t nearly as useful as having an organized list of online papers so the conferences which already have online proceedings create an explorable topic hierarchy.
    2. Time Organization We realize that the organization at one particular year’s conference is sketchy—research is a multiyear endeavor. Consequently, we start adding to last years topic hierarchy rather than creating a new one from scratch each year.
    3. Transformation We realize that it is better if papers are done in the language of the web. For example, it’s very handy to be able to hyperlink inside of a paper. A good solution to the math on the web problem would greatly help here.
    4. Consolidation We realize that there is a lot of redundancy in two papers on the same or a similar topic. They share an introduction, motivation, and (often) definitions. By joining the shared pieces, the contents of both papers can be made clearer.

    Each of these individual steps clearly yields something better. At the end of these steps, creating a paper is simply the process of creating a webpage or altering an existing webpage. We can imagine doing all of this while keeping the peer-review mechanisms of science intact, so the resulting process is simply better in all ways. It’s easier to author because for most papers much of the “filler” introduction/motivation/definition can be reused from previous papers. It’s easier to review, because reviewers can consider the result in context. Much of the difficulty of reviewing is simply due to the author and reviewer not being “on the same page” in how they understand things. An organized topic hierarchy greatly aids this.

  3. The unknown It is difficult to anticipate everything. What other futures might exist?

Which future comes about is dependent on many things—the decisions of community leaders, enabling ‘math-on-the-web’ technologies, etc…, so it is difficult to predict which future and when it will come about. Nevertheless, a potential exists and there are several paths leading towards reaching that potential.

18 Replies to “The Webscience Future”

  1. Here’s an idea. What I’d like to see is an database of online papers that will help you find new and interesting papers. Papers would be added both by web crawling (like Google Scholar) and by people posting papers (like arXiv). In addition to normal searches (such as looking for a specific paper or all papers by an author), there would be a system for finding interesting papers that you should look at. It might be a collaborative filtering system (perhaps like, although I haven’t actually tried that points you to references similar to papers you’re already interested in.

    This is inspired by a comment Sam Roweis made that we ought to put our papers online and get rid of the hassle and expense of conferences (I might be paraphrasing incorrectly). I’m not ready to get rid of conferences and paper reviewing, but I still wonder: how would you identify the interesting papers without a conference system? Is there an online system (perhaps using machine learning) that could help identify interesting papers to look at that you haven’t already seen? For example, I’d be interested in seeing all new papers that fit specific categories, e.g., papers by researcher X on topic Y, or all papers on a specific subtopic that I’m working on.

  2. Aaron: this is something I’ve been interested (from a research perspective) for a while. Andrew McCallum is working on REXA recently as a Google Scholar/CiteSeer replacement that does full inference over papers, authors, conferences/journals, grants, etc. I think that combining IR/IE style technology with some of the stuff that Simone Teufel, Marti Hearst and other have been working on, we might be able to move toward making this a reality.

    In addition to the problem you mention, I would love to be able to go to something like Rexa (which, importantly, has user accounts) and say “give me a one page summary of everything that’s happened in reinforcement learning in the past 5 year.” Since it knows who I am, it presumably knows what papers I’ve read, what topics I’m interested in, etc. There’s a lot of new work that has to go in to building something like this, including:

    One big difference between this problem and Aaron’s problem is that in my problem, I (as the user) am saying what is interesting. In Aaron’s problem, the machine must decide. Deciding interestingness and unexpectedness is studied (among other places) in the summarization literature, but it’s a well-known hard problem and I don’t think that without some sort of user modeling, it’s going to get anywhere.

  3. Sounds interesting. I’m certainly not dead-set on the machine figuring out everything that’s interesting. Perhaps it would be a mix — sometimes you specify topics you like or what you think of specific papers and researchers, and the rest of the time it tries to guess based on what it knows about you.

    What is Rexa? The main page didn’t give any information. There’s some time investment in trying out places like Rexa and CiteULike, so I and others will be reluctant to jump in without recommendations from other people. However, like other social networks, it might only be really useful if a lot of people use it. (On the other hand, Google Scholar was useful immediately as a search engine).

    I wonder if there would be concerns about the server-owner datamining your research interests. When I worked at Microsoft a few years ago, MS employees were prohibited from using IBM patent server database (that was before the USPTO database was usable). The fear was that IBM could see what MS employees were looking at and guess their business plans. Or that they could use server logs to prove that employees had looked at patents in patent-infringment lawsuits.

  4. Rexa is basically CiteSeer, but where Authors, Papers, Affiliations, Conferences, etc. are all first class objects. I.e., full coreference/entity matching is done between Authors, so you can track one guys full publication record. The quality of the IE (in a demo Andrew showed me recently) also appears to be significantly higher than CiteSeer and perhaps also Google Scholar. If you’ll be at NIPS, I believe he’s going to show a demo there.

    The whole privacy issue becomes important once you start talking about user modeling. This is unfortunate IMO because a lot of things users (both types of users: users like my mom as well as users like me) want that can pretty much only be accomplished through personalization. I don’t have a good solution to this problem. I think it’s perhaps smaller in the academic world (minus Microsoft) than then general world, which makes this seem to me to be a good place to test such techniques. The biomedical domain is another source of such problems: there, many more papers are published than can be possibly read, and the current methods available to bio researchers are pretty crummy. Again, perhaps they wouldn’t care too much about privacy issues in personalization if they get a lot of bang for their bucks (the they have a lot of the latter).

  5. I’ve used it to scan what other people are looking at (under a particular tag). I’ve uploaded my bibliography file so that it would be useful for others. So it’s very much like – but there’s too much hassle for me personally to use it as my default bib database – I’m still sticking with bibtex. Something like this should be a part of a system like citeseer, not a separate effort.

  6. I’m not sure that we want to put all of our egs in the CiteSeer, Google Scholar, or REXA baskets. I’m happy to have those systems do what they are good at: crawl, extract, disambiguate, index, link. For recommendations, I would prefer overlays created by entities I trust (people, editorial or review boards) that link to existing repositories and indexing systems. A conference, or a journal, could reduce its workflow costs by becoming an overlay. For example, a NIPS submission would be just a link to an online resource at an accepted place. Reviewing or reading anonymity could be ensured by anonymized browsing systems. I’d love to get reading lists from people I trust as overlays, too. Those might be public (John Langford’s top 10 of 2005 posted as RSS on his site, say), or private (John’s private bottom 10 list of 2005 that he would make accessible through a secure channel to his close friends).

  7. I have an imperfect understanding of what ‘overlay’ means. Is it equivalent to “shortlist”? Or is there more structure to it?

    Do you have in mind a particular mechanism for doing this? (Or maybe “something like blah, but different in ways x, y, and z”?)

    Enabling anonymous browsing is easy. There are hundreds of anonymizing proxies and it’s easy to setup a personal one like, say, using existing software.
    Public shortlist systems are pretty easy—a blog seems an effective mechanism.

  8. I’m not sure that there’s a technical definition of “overlay”, but the term has been used in the context of “virtual” electronic journals that link to publicly visible papers. An essential feature is that the links are persistent, and the underlying document guaranteed not to change (that’s the case with arXiv, but not with someone’s papers on their own site). Another valuable feature is some means of commenting on the papers, ideally with references to specific location in the papers. I don’t know of any tool that allows that for PDF documents.

  9. I see. It would be easy for repositories (citeseer,, rexa, etc…) to enable overlays by having guaranteed permanent paper locations. doesn’t do that. Citeseer seems to cache papers, but there probably is no permanency semantics if it really is a cache. I have not explored Rexa—perhaps we should pester Andrew to make it so an account is not required by default. The wayback machine seems to support permalinks for the whole web (for example, here), but there seem to be some issues with snapshot rate and bugginess.

    A virtual electronic journal which is a blog pointing into a permalink supporting repository would satisfy the commenting desire.

    I have never seen convincing use of hyperlinks into PDF documents, so I’m not hopeful that will materialize. A good solution to the math-on-the-web problem would avoid this need.

  10. This is cool, but it confuses me. It seems that wikipedia machine learning has more content than MLpedia. Why is MLpedia preferred? Are the editing policies somehow different from wikipedia?

    The default for random people interested in machine learning is probably wikipedia rather than MLpedia. Given that, isn’t effort on the wikepedia machine learning resources preferred? (Does MLpedia somehow backend on wikipedia? Or could it be made to?)

  11. MLpedia does have different editing policies to Wikipedia – it allows documentation of active research (which is against wikipedia policies) and it’s discussion pages are used to discuss the application of the techniques rather than the articles themselves. In addition, it allows for articles like ‘Interesting papers at CVPR 2006’ which would not be appropriate for Wikipedia. It also allows the inclusion of tutorials, source code etc. which are again inappropriate for Wikipedia.

    See also:

Comments are closed.