Machine Learning (Theory)

9/18/2006

What is missing for online collaborative research?

Tags: Machine Learning jakester@ 6:25 am

The internet has recently made the research process much smoother: papers are easy to obtain, citations are easy to follow, and unpublished “tutorials” are often available. Yet, new research fields can look very complicated to outsiders or newcomers. Every paper is like a small piece of an unfinished jigsaw puzzle: to understand just one publication, a researcher without experience in the field will typically have to follow several layers of citations, and many of the papers he encounters have a great deal of repeated information. Furthermore, from one publication to the next, notation and terminology may not be consistent which can further confuse the reader.

But the internet is now proving to be an extremely useful medium for collaboration and knowledge aggregation. Online forums allow users to ask and answer questions and to share ideas. The recent phenomenon of Wikipedia provides a proof-of-concept for the “anyone can edit” system. Can such models be used to facilitate research and collaboration? This could potentially be extremely useful for newcomers and experts alike. On the other hand, entities of this sort already exist to some extent: Wikipedia::Machine Learning, MLpedia, the discussion boards on kernel-machines.org, Rexa, and the gradual online-ification of paper proceedings to name a few.

None of these have yet achieved takeoff velocity. You’ll know that takeoff velocity has been achieved when these become a necessary part of daily life rather than a frill.

Each of these efforts seems to be missing critical pieces, such as:

  1. A framework for organizing and summarizing information. Wikipedia and MLpedia are good examples, yet this is not as well solved as you might hope as mathematics on the web is still more awkward than it should be.
  2. A framework for discussion. Kernel-machines.org handles this, but is too area-specific. There does exist a discussion framework on Wikipedia/MLpedia, but the presentation format marginalizes discussion, placed on a separate page and generally not viewed by most observers. The discussion, in fact, should be an integral part of the presentation.
  3. Researchers have incentives to contribute. Wikipedia intentionally anonymizes contributors in the presentation, because recognizing them might invite the wrong sort of contributor. Incentives done well, however, are one of the things creating (6). One of the existing constraints within academia is that the basic unit of credit is coauthorship on a peer-reviewed paper. Given this constraint, it would be very handy if a system could automatically translate a subset of an online site into a paper, with authorship automatically summarized. The site itself might also track and display who has contributed how much and who has contributed recently.
  4. Explicit mechanisms for handling disagreements. If you get 3 good researchers on a topic in a room, you might have about 5 distinct opinions. Much of research has to do with thinking carefully about what is important and why, the sorts of topics likely to provoke disagreement. Given that disagreement is a part of the process of research, there needs to be a way to facilitate, and even spotlight, disagreements for a healthy online research mechanism. One crude system for handling disagreements is illustrated by the linux kernel “anyone can download and start their own kernel tree”. A more fine-grained version of this may be effective “anyone can clone a webpage and start their own version of it”. Perhaps this can be coupled with a version voting system, although that is tricky. A fundamental point is: a majority vote does not determine the correctness of a theorem. Integrating a peer review system may work well. None of the existing systems handle this problem effectively.
  5. Low entry costs. Many systems handle this well, but it must be emphasized because small changes in the barrier to entry can have a large effect on (6).
  6. Community buy-in. Wikipedia is the big success story here, but Wikipedia::MachineLearning has more limited success. There are many techniques which might aid community buy in, but they may not be enough.

Can a site be created that simultaneously handles all of the necessary pieces for online research?

17 Comments to “What is missing for online collaborative research?”
  1. Darius Bacon says:

    There’s a pre-Web but very forward-looking paper, “Hypertext Publishing and the Evolution of Knowledge”, on this sort of thing. Its scenario of what a really useful scholarly web might be like even used an example related to machine learning.

  2. Anonymous says:

    hmmm the four examples given are quite depressing … Wikipedia::Machine Learning is a bunch of abandoned drafts of articles written mostly by students (it appears). MLpedia is a collection of stubs. kernel-machines.org – a nice collection of three-year-old news. Rexa – apparently the most useless publication search engine – the only one which manages to get the authors names wrong. So I agree with the post – there’s a long way to go till we have online research tools which can make impact

  3. Spurred by the posting on 9/12, a few colleagues (Maneesh Agarwala, Sameer Agarwal) and I recently discussed a system that would be pretty easy to put into practice and to bootstrap; it’s tempting to go ahead and try it out.

    The goal is to enable online discussion of existing technical papers. (This is slightly different from the one stated above, but it could evolve into a general research discussion board). With more recent papers, we want to have discussion in order to understand the work and its significance, and to hear what other people think of the paper. Often, the most interesting aspects come out later in discussions with colleagues, and its a matter of luck as to whether these discussions happen (and whether you happen to be present). With older papers, its really useful to find out the history and signficance of the paper — the oral tradition associated with the paper — which is otherwise not written down. The authors (or other implementers) often have hindsights about the work which are not written in the paper.

    The discussion board we propose would be based on discussion threads, most of which would revolve around an individual paper, or two related papers, etc. It would resemble a blog in some ways, except that old papers and posts would presumably remain interesting long after the original posting. It would be easily searchable by paper (so that you can find a specific paper in an area), or by tags (so that you can browse areas). Postings would be rated for usefulness (like slashdot and amazon), which helps you filter out the useful ones, and provides additional motivation for people to write good things.

    It’s easy to bootstrap: we’d simply have our paper-reading research meetings use the board, and perhaps even require participation for classes. If we do this at multiple universities at the same time, where we are reading the same papers in our seminars, then there could be discussion across the universities.

    In the future, I would envision such a board being merged with paper indices (like Rexa) and collaborative filtering (like del.icio.us), but this is somewhat orthogonal.

  4. By the way, if anyone can recommend some existing web software that could be modified to suit this purpose, I’d be interested to hear it. It should probably be more like a discussion board or blog than a wiki.

    Another crucial feature I didn’t mention is the ability to track discussions, e.g., to have one webpage where you go to see all the latest discussion on papers/topics that interest you, or to have it all sent to you daily by email, etc.

  5. KaraNagai says:

    It is probably possible to use the system of notes they have in citeulike.org for discussion of papers. Anyway I would expect them to introduce more comprehensive functionality of that kind.

  6. You might take a look at this blog. As you can see, it allows latex to be used in the comments. You can set up a comments feed, or set up a page listing latest comments, such as here.

  7. jl says:

    I definitely encourage you to try this out. Experimenting with such new approaches is pretty worthwhile.

    The format you imagine seems most like kernelmachines.org. Can you outline what the differences are? It seems like:
    - More focused discussion (centered on papers).
    - More general topics.

    What else? Good support for math would be great, but it’s a tricky problem.

  8. The main thing is paper-focused discussion, although I think it can expand beyond that. Also, there should be a good mechanism for indexing papers and topics, by appropriate tagging of discussions. kernel-machines.org, seems to be too unstructured; it has pages and pages of discussions on miscellaneous topics, with the ones that fall off the bottom of the page being forgotten. I’d like to be able to easily find the discussion relating to a certain paper, or find all discussions in which a certain paper is mentioned (and this isn’t just for vanity searching :), and to search by tags or combinations of tags. Ideally, there’d be links from citation indicies: find the paper you want on Google Scholar or Rexa, and then go directly to the discussion on that paper. There should also be a mechanism for “watching” discussions that you’re interested in.

    I think paper-focused discussion is a related, but somewhat distinct goal from general-purpose discussion, one for which specific tools (and longevity of discussion) would be useful.

    The other thing which, I think, is different, is the idea of boostrapping the discussion by incorporating it into group meetings/seminar classes.

  9. That does look relevant, and the source code is available.

  10. Allan Erskine says:

    I think this is a very interesting topic. For my part I can recommend Plone as a simple yet full-featured web publishing system. I have set it up as an intranet wiki/CMS at two different companies, and have found it to require little or no maintenence once up.

    Lots of interesting projects in the Java space too, particularly those backing the JSR 170 standard. Magnolia seems to be worth special mention for it’s useability. The reason to consider a site backed by a standard is that is protects the investment of the community.

    As for smaller systems, eg the MoveableType(?)/itex-module combo that powers the n-Category cafe, my guess is that they will not be extensible enough to provide a solid foundation for your worthy goals.

    To really get something up and running which incorporates all the features you mentioned will require a fair old effort. It could be a great project for Google’s 2007 summer of code. (Or Yahoo’s equivalent!)

  11. KaraNagai says:

    I did not know that it is open source project. If it is, that could be an interesting point to start from.

  12. jl says:

    Even using firefox, I run into difficulties here. The problem is that the appropriate mathml fonts are missing by default (and presumably missing for many others).

  13. Greg Wilson says:

    Readers may be interested in the report Jon Udell prepared for Los Alamos National Laboratory titled “Internet Groupware for Scientific Collaboration”. The URL is http://207.22.26.166/GroupwareReport.html (or you can Google for the title terms).

  14. Phil Cowans says:

    I’d be very intersted in something along these lines. I think an important aspect would be to persuade people to use it as an everyday part of doing research, rather than an ‘external’ project. Making it easy to present maths is definitely important, but not the end of the story. I guess what I have in mind is a kind of wiki where the original page is based on the paper itself, and where notes/comments/discussion can easily be tied to particular sections. It would be great to be able to visually cross-reference particular sections to other papers, so that people can keep the links up to date as new papers are published or people find old papers which the original authors didn’t know about. I’d imagine that people would want various degrees of sharing – personal notes, shared information within research groups and fully public annotations, but probably with an encouragement to discuss things publicly.

  15. Yes, but downloading the mathml fonts is the work of a few moments.

    More important than the software are the personnel and their methods of engendering debate. It’s critical to delete comments quickly which are irrelevant, crackpot, etc. Also keeping the tone respectful is vital: agonism not antagonism. But perhaps the most important lesson I’ve learned is that the best threads occur when a handful of people, and perhaps only 2 or 3, have a reasonably clear conception of what they’re looking for, and one person keeps the conversation on track, summing up where you’ve got to so far, and setting the course for the next phase.

  16. hyip says:

    hmmm the four examples given are quite depressing … Wikipedia::Machine Learning is a bunch of abandoned drafts of articles written mostly by students (it appears). MLpedia is a collection of stubs. kernel-machines.org – a nice collection of three-year-old news. Rexa – apparently the most useless publication search engine – the only one which manages to get the authors names wrong. So I agree with the post – there’s a long way to go till we have online research tools which can make impact

  17. zslevi says:

    What about Scholarpedia?

Leave a Reply


Powered by WordPress