John Langford – Page 77 – Machine Learning (Theory)

10/8/200510/8/2005

We have a winner

The DARPA grandchallenge is a big contest for autonomous robot vehicle driving. It was run once in 2004 for the first time and all teams did badly. This year was notably different with the Stanford and CMU teams succesfully completing the course. A number of details are here and wikipedia has continuing coverage.

A formal winner hasn’t been declared yet although Stanford completed the course quickest.

The Stanford and CMU teams deserve a large round of applause as they have strongly demonstrated the feasibility of autonomous vehicles.

The good news for machine learning is that the Stanford team (at least) is using some machine learning techniques.

10/3/200510/3/2005

Not ICML

Alex Smola showed me this ICML 2006 webpage. This is NOT the ICML we know, but rather some people at “Enformatika”. Investigation shows that they registered with an anonymous yahoo email account from dotregistrar.com the “Home of the $6.79 wholesale domain!” and their nameservers are by Turkticaret, a Turkish internet company.

It appears the website has since been altered to “ICNL” (the above link uses the google cache).

They say that imitation is the sincerest form of flattery, so the organizers of the real ICML 2006 must feel quite flattered.

9/30/20059/30/2005

Research in conferences

Conferences exist as part of the process of doing research. They provide many roles including “announcing research”, “meeting people”, and “point of reference”. Not all conferences are alike so a basic question is: “to what extent do individual conferences attempt to aid research?” This question is very difficult to answer in any satisfying way. What we can do is compare details of the process across multiple conferences.

Comments The average quality of comments across conferences can vary dramatically. At one extreme, the tradition in CS theory conferences is to provide essentially zero feedback. At the other extreme, some conferences have a strong tradition of providing detailed constructive feedback. Detailed feedback can give authors significant guidance about how to improve research. This is the most subjective entry.
Blind Virtually all conferences offer single blind review where authors do not know reviewers. Some also provide double blind review where reviewers do not know authors. The intention with double blind reviewing is to make the conference more approachable to first-time authors.
Author Feedback Author feedback is a mechanism where authors can provide feedback to reviewers (and, to some extent, complain). Providing an author feedback mechanism provides an opportunity for the worst reviewing errors to be corrected.
Conditional Accepts A conditional accept is some form of “we will accept this paper if conditions X,Y, and Z are met”. A conditional accept allows reviewers to demand different experiments or other details they need in order to make a decision. This might speed up research significantly because otherwise good papers need not wait another year.
Papers/PC member How many papers can one person actually review well? When there is an incredible load of papers to review, it becomes very tempting to make snap decisions without a thorough attempt at understanding. Snap decisions are often wrong. These numbers are based on the number of submissions with a computer science standard of 3 reviews per paper.

Each of these “options” make reviewing more difficult by requiring more reviewer work. There is a basic trade-off between the amount of time spent reviewing vs. working on new research and the speed of the review process itself. It is unclear where this optimal trade-off point lies, but the easy default is “not enough time spent reviewing” because reviewing is generally an unrewarding job.

It seems reasonable to cross reference these options with some measures of ‘conference impact’. For each of these, it’s important to realize these are not goal metrics and so their meaning is unclear. The best that can be said is that it is not bad to do well. Also keep in mind that measurements of “impact” are inherently “trailing indicators” which are not necessarily relevant to the way the conference is currently run.

average citations Citeseer has been used to estimate the average impact of a conference’s papers here using the average number of citations per paper.
max citations A number of people believe that the maximum number of citations given to any one paper is a strong indicator of the success of the conference. This can be measured by going to scholar.google.com and using ‘advanced search’ for the conference name.

Conference	Comments	blindness	author feedback	conditional accepts	Reviews/PC member	log(average citations per paper+1)	max citations
ICML	Sometimes Helpful	Double	Yes	Yes	8	2.12	1079
AAAI	Sometimes Helpful	Double	Yes	No	8	1.87	650
COLT	Sometimes Helpful	Single	No	No	15?	1.49	710
NIPS	Sometimes Helpful/Sometimes False	Single	Yes	No	113(*)	1.06	891
CCC	Sometimes Helpful	Single	No	No	24	1.25	142
STOC	Not Helpful	Single	No	No	41	1.69	611
SODA	Not Helpful	Single	No	No	56	1.51	175

(*) To some extent this is a labeling problem. NIPS has an organized process of finding reviewers very similar to ICML. They are simply not called PC members.

Keep in mind that the above is a very incomplete list (it only includes the conferences that I interacted with) and feel free to add details in the comments.

9/26/20059/27/2005

Prediction Bounds as the Mathematics of Science

“Science” has many meanings, but one common meaning is “the scientific method” which is a principled method for investigating the world using the following steps:

Form a hypothesis about the world.
Use the hypothesis to make predictions.
Run experiments to confirm or disprove the predictions.

The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run.

Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not.

Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $10⁹ or more riding on the outcome the temptation to make the outcome “right” is extreme.
Wrong experiments. When conducting experiments of some new phenomena, it is common for the experimental apparatus to simply not work right. In that setting, throwing out the “bad data” can make the results much cleaner… or it can simply be cheating. Millikan did this in the ‘oil drop’ experiment which measured the electron charge.

Done right, allowing some kinds of “cheating” may be helpful to the progress of science since we can more quickly find the truth about the world. Done wrong, it results in modern nightmares like painkillers that cause heart attacks. (Of course, the more common outcome is that the drugs effectiveness is just overstated.)

A basic question is “How do you do it right?” And a basic answer is “With prediction theory bounds”. Each prediction bound has a number of things in common:

They assume that the data is independently and identically drawn. This is well suited to experimental situations where experimenters work very hard to make different experiments be independent. In fact, this is a better fit than typical machine learning applications where independence of the data is typically more questionable or simply false.
They make no assumption about the distribution that the data is drawn from. This is important for experimental testing of predictions because the distribution that observations are expected to come from is a part of the theory under test.

These two properties above form an ‘equivalence class’ over different mathematical bounds where each bound can be trusted to an equivalent degree. Inside of this equivalent class there are several that may be helpful in determining whether deviations from the scientific method are reasonable or not.

The most basic test set bound corresponds to the scientific method above.
The Occam’s Razor bound allows a careful reordering of steps (1), (2) and step (3). More “interesting” bounds like the VC-bound and the PAC-Bayes bound allow more radical alterations of these steps. Several are discussed here.
The Sample Compression bound allows careful disposal of some datapoints.
Progressive Validation bounds (such as here, here or here) allow hypotheses to be safely reformulated in arbitrary ways as experiments progress.

Scientific experimenters looking for a little extra flexibility in the scientific method may find these approaches useful. (And if they don’t, maybe there is another bound in this equivalence class that needs to be worked out.)

9/20/20059/20/2005

Workshop Proposal: Atomic Learning

This is a proposal for a workshop. It may or may not happen depending on the level of interest. If you are interested, feel free to indicate so (by email or comments).

Description:
Assume(*) that any system for solving large difficult learning problems must decompose into repeated use of basic elements (i.e. atoms). There are many basic questions which remain:

What are the viable basic elements?
What makes a basic element viable?
What are the viable principles for the composition of these basic elements?
What are the viable principles for learning in such systems?
What problems can this approach handle?

Hal Daume adds:

Can composition of atoms be (semi-) automatically constructed[?]
When atoms are constructed through reductions, is there some notion of the “naturalness” of the created leaning problems?
Other than Markov fields/graphical models/Bayes nets, is there a good language for representing atoms and their compositions?

The answer to these and related questions remain unclear to me. A workshop gives us a chance to pool what we have learned from some very different approaches to tackling this same basic goal.

(*) As a general principle, it’s very difficult to conceive of any system for solving any large problem which does not decompose.

Plan Sketch:

A two day workshop with unhurried presentations and discussion seems appropriate, especially given the diversity of approaches.
TTI-Chicago may be able to help with costs.

The above two points suggest having a workshop on a {Friday, Saturday} or {Saturday, Sunday} at TTI-Chicago.