December 2005 – Page 3 – Machine Learning (Theory)

In everyday use a model is a system which explains the behavior of some system, hopefully at the level where some alteration of the model predicts some alteration of the real-world system. In machine learning “model” has several variant definitions.

Everyday. The common definition is sometimes used.
Parameterized. Sometimes model is a short-hand for “parameterized model”. Here, it refers to a model with unspecified free parameters. In the Bayesian learning approach, you typically have a prior over (everyday) models.
Predictive. Even further from everyday use is the predictive model. Examples of this are “my model is a decision tree” or “my model is a support vector machine”. Here, there is no real sense in which an SVM explains the underlying process. For example, an SVM tells us nothing in particular about how alterations to the real-world system would create a change.

Which definition is being used at any particular time is important information. For example, if it’s a parameterized or predictive model, this implies some learning is required. If it’s a predictive model, then the set of operations which can be done to the model are restricted with respect to everyday usage. I don’t have any particular advice here other than “watch out”—be aware of the distinctions, watch for this source of ambiguity, and clarify when necessary.

The internet has significantly effected the way we do research but it’s capabilities have not yet been fully realized.

First, let’s acknowledge some known effects.

Self-publishing By default, all researchers in machine learning (and more generally computer science and physics) place their papers online for anyone to download. The exact mechanism differs—physicists tend to use a central repository (Arxiv) while computer scientists tend to place the papers on their webpage. Arxiv has been slowly growing in subject breadth so it now sometimes used by computer scientists.
Collaboration Email has enabled working remotely with coauthors. This has allowed collaborationis which would not otherwise have been possible and generally speeds research.

Now, let’s look at attempts to go further.

Blogs (like this one) allow public discussion about topics which are not easily categorized as “a new idea in machine learning” (like this topic).
Organization of some subfield of research. This includes Satinder Singh’s Reinforcement Learning pages, and, more generally books that have been placed online such as this one.
Discussion Groups The kernel machines discussions provide a good example of some common format allowing discussion.
Class notes have been placed online such as Avrim’s learning theory lecture notes.
Wikipedia has an article on Machine Learning. The article gives a reasonable quick overview and is surely read by a very large number of people.
Online Proceedings are now being used by several conferences such as NIPS.

Now, let’s consider some futures.

Wikifuture Wikipedia becomes better to the point where it is a completely comprehensive listing of current research in machine learning. At some point, we-the-community realize this and begin to emphasize (and credit) information placed in wikipedia. This process reinforces itself to the point where “if it’s not in wikipedia, it doesn’t exist”.
This future is significantly more probable than most people understand. As evidence compare the machine learning page three years ago (yep, it didn’t exist), two years ago, one year ago, and today. That progression strongly suggests that wikipedia:machine learning will continue to grow into a significant information resource.

There are fundamental obstacles to the success of the wikipedia future.
1. credit Wikipedia has only very weak mechanisms for crediting editors. A list of the changes done by one user account is about as much credit as is available. This is not enough to make career-deciding questions on. We could hope for a stronger link between identity and editor along with tools to track the value of particular edits (Think of counting hyperlinks as an analogue for counting citations).
2. controversy Wikipedia has grown up in a nonmanipulative environment. When it was little known, the incentive to fabricate entries was not great. Now that it is becoming well known that incentive is growing. Character assasination by false article exists. In science, the thing to worry about is misplaced ideas of the importance of your topic of research since it is very difficult to be sufficiently interested in a research topic and simultaneously view it objectively. Research is about creating new ideas, and the location of these ideas in some general organization is in dispute by default.
Evolutionary Progression Consider the following sequence of steps.
1. Conference Organization We realize that having a list of online papers isn’t nearly as useful as having an organized list of online papers so the conferences which already have online proceedings create an explorable topic hierarchy.
2. Time Organization We realize that the organization at one particular year’s conference is sketchy—research is a multiyear endeavor. Consequently, we start adding to last years topic hierarchy rather than creating a new one from scratch each year.
3. Transformation We realize that it is better if papers are done in the language of the web. For example, it’s very handy to be able to hyperlink inside of a paper. A good solution to the math on the web problem would greatly help here.
4. Consolidation We realize that there is a lot of redundancy in two papers on the same or a similar topic. They share an introduction, motivation, and (often) definitions. By joining the shared pieces, the contents of both papers can be made clearer.
Each of these individual steps clearly yields something better. At the end of these steps, creating a paper is simply the process of creating a webpage or altering an existing webpage. We can imagine doing all of this while keeping the peer-review mechanisms of science intact, so the resulting process is simply better in all ways. It’s easier to author because for most papers much of the “filler” introduction/motivation/definition can be reused from previous papers. It’s easier to review, because reviewers can consider the result in context. Much of the difficulty of reviewing is simply due to the author and reviewer not being “on the same page” in how they understand things. An organized topic hierarchy greatly aids this.
The unknown It is difficult to anticipate everything. What other futures might exist?

Which future comes about is dependent on many things—the decisions of community leaders, enabling ‘math-on-the-web’ technologies, etc…, so it is difficult to predict which future and when it will come about. Nevertheless, a potential exists and there are several paths leading towards reaching that potential.

Month: December 2005

Watchword: model

The Webscience Future