A conversation between Theo and Pat

Pat (the practitioner) I need to do multiclass classification and I only have a decision tree.

Theo (the thoeretician) Use an error correcting output code.

Pat Oh, that’s cool. But the created binary problems seem unintuitive. I’m not sure the decision tree can solve them.

Theo Oh? Is your problem a decision list?

Pat No, I don’t think so.

Theo Hmm. Are the classes well separated by axis aligned splits?

Pat Err, maybe. I’m not sure.

Theo Well, if they are, under the IID assumption I can tell you how many samples you need.

Pat IID? The data is definitely not IID.

Theo Oh dear.

Pat Can we get back to the choice of ECOC? I suspect we need to build it dynamically in response to which subsets of the labels are empirically separable from each other.

Theo Ok. What do you know about your problem?

Pat Not much. My friend just gave me the dataset.

Theo Then, no one can help you.

Pat (What a fuzzy thinker. Theo keeps jumping to assumptions that just aren’t true.)

Theo (What a fuzzy thinker. Pat’s problem is unsolvable without making extra assumptions.)

I’ve heard variants of this conversation several times. The fundamental difference in viewpoint is the following:

  1. Theo lives in a world where he chooses the problem to solve based upon learning model (and assumptions) used.
  2. Pat lives in a world where the problem is imposed on him.

I’d love for these confusions to go away, but there is no magic wand. The best advice seems to be: listen carefully and avoid assuming to much in what you hear.

John Langford –> Yahoo Research, NY

I will join Yahoo Research (in New York) after my contract ends at TTI-Chicago.

The deciding reasons are:

  1. Yahoo is running into many hard learning problems. This is precisely the situation where basic research might hope to have the greatest impact.
  2. Yahoo Research understands research including publishing, conferences, etc…
  3. Yahoo Research is growing, so there is a chance I can help it grow well.
  4. Yahoo understands the internet, including (but not at all limited to) experimenting with research blogs.

In the end, Yahoo Research seems like the place where I might have a chance to make the greatest difference.

Yahoo (as a company) has made a strong bet on Yahoo Research. We-the-researchers all hope that bet will pay off, and this seems plausible. I’ll certainly have fun trying.

Conferences, Workshops, and Tutorials

This is a reminder that many deadlines for summer conference registration are coming up, and attendance is a very good idea.

  1. It’s entirely reasonable for anyone to visit a conference once, even when they don’t have a paper. For students, visiting a conference is almost a ‘must’—there is no where else that a broad cross-section of research is on display.
  2. Workshops are also a very good idea. ICML has 11, KDD has 9, and AAAI has 19. Workshops provide an opportunity to get a good understanding of some current area of research. They are probably the forum most conducive to starting new lines of research because they are so interactive.
  3. Tutorials are a good way to gain some understanding of a long-standing direction of research. They are generally more coherent than workshops. ICML has 7 and AAAI has 15.

Rexa is live

Rexa is now publicly available. Anyone can create an account and login.

Rexa is similar to Citeseer and Google Scholar in functionality with more emphasis on the use of machine learning for intelligent information extraction. For example, Rexa can automatically display a picture on an author’s homepage when the author is searched for.

JMLR is a success

In 2001, the “Journal of Machine Learning Research” was created in reaction to unadaptive publisher policies at MLJ. Essentially, with the creation of the internet, the bottleneck in publishing research shifted from publishing to research. The declaration of independence accompanying this move expresses the reasons why in greater detail.

MLJ has strongly changed its policy in reaction to this. In particular, there is no longer an assignment of copyright to the publisher (*), and MLJ regularly sponsors many student “best paper awards” across several conferences with cash prizes. This is an advantage of MLJ over JMLR: MLJ can afford to sponsor cash prizes for the machine learning community. The remaining disadvantage is that reading papers in MLJ sometimes requires searching for the author’s website where the free version is available. In contrast, JMLR articles are freely available to everyone off the JMLR website. Whether or not this disadvantage cancels the advantage is debatable, but essentially no one working on machine learning argues with the following: the changes brought by the creation of JMLR have been positive for the general machine learning community.

This model can and should be emulated in other areas of research where publishers are not behaving in a sufficiently constructive manner. Doing so requires two vital ingredients: a consensus of leaders to support a new journal and the willigness to spend the time and effort setting it up. Presumably, some lessons on how to do this have been learned by the editors of JMLR and they are willing to share it.

(*) Back in the day, it was typical to be forced to sign over all rights to your journal paper, then ignore this and place it on your homepage. The natural act of placing your paper on your webpage is no longer illegal.