September 2005 – Machine Learning (Theory)

9/30/20059/30/2005

Research in conferences

Conferences exist as part of the process of doing research. They provide many roles including “announcing research”, “meeting people”, and “point of reference”. Not all conferences are alike so a basic question is: “to what extent do individual conferences attempt to aid research?” This question is very difficult to answer in any satisfying way. What we can do is compare details of the process across multiple conferences.

Comments The average quality of comments across conferences can vary dramatically. At one extreme, the tradition in CS theory conferences is to provide essentially zero feedback. At the other extreme, some conferences have a strong tradition of providing detailed constructive feedback. Detailed feedback can give authors significant guidance about how to improve research. This is the most subjective entry.
Blind Virtually all conferences offer single blind review where authors do not know reviewers. Some also provide double blind review where reviewers do not know authors. The intention with double blind reviewing is to make the conference more approachable to first-time authors.
Author Feedback Author feedback is a mechanism where authors can provide feedback to reviewers (and, to some extent, complain). Providing an author feedback mechanism provides an opportunity for the worst reviewing errors to be corrected.
Conditional Accepts A conditional accept is some form of “we will accept this paper if conditions X,Y, and Z are met”. A conditional accept allows reviewers to demand different experiments or other details they need in order to make a decision. This might speed up research significantly because otherwise good papers need not wait another year.
Papers/PC member How many papers can one person actually review well? When there is an incredible load of papers to review, it becomes very tempting to make snap decisions without a thorough attempt at understanding. Snap decisions are often wrong. These numbers are based on the number of submissions with a computer science standard of 3 reviews per paper.

Each of these “options” make reviewing more difficult by requiring more reviewer work. There is a basic trade-off between the amount of time spent reviewing vs. working on new research and the speed of the review process itself. It is unclear where this optimal trade-off point lies, but the easy default is “not enough time spent reviewing” because reviewing is generally an unrewarding job.

It seems reasonable to cross reference these options with some measures of ‘conference impact’. For each of these, it’s important to realize these are not goal metrics and so their meaning is unclear. The best that can be said is that it is not bad to do well. Also keep in mind that measurements of “impact” are inherently “trailing indicators” which are not necessarily relevant to the way the conference is currently run.

average citations Citeseer has been used to estimate the average impact of a conference’s papers here using the average number of citations per paper.
max citations A number of people believe that the maximum number of citations given to any one paper is a strong indicator of the success of the conference. This can be measured by going to scholar.google.com and using ‘advanced search’ for the conference name.

Conference	Comments	blindness	author feedback	conditional accepts	Reviews/PC member	log(average citations per paper+1)	max citations
ICML	Sometimes Helpful	Double	Yes	Yes	8	2.12	1079
AAAI	Sometimes Helpful	Double	Yes	No	8	1.87	650
COLT	Sometimes Helpful	Single	No	No	15?	1.49	710
NIPS	Sometimes Helpful/Sometimes False	Single	Yes	No	113(*)	1.06	891
CCC	Sometimes Helpful	Single	No	No	24	1.25	142
STOC	Not Helpful	Single	No	No	41	1.69	611
SODA	Not Helpful	Single	No	No	56	1.51	175

(*) To some extent this is a labeling problem. NIPS has an organized process of finding reviewers very similar to ICML. They are simply not called PC members.

Keep in mind that the above is a very incomplete list (it only includes the conferences that I interacted with) and feel free to add details in the comments.

9/26/20059/27/2005

Prediction Bounds as the Mathematics of Science

“Science” has many meanings, but one common meaning is “the scientific method” which is a principled method for investigating the world using the following steps:

Form a hypothesis about the world.
Use the hypothesis to make predictions.
Run experiments to confirm or disprove the predictions.

The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run.

Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not.

Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $10⁹ or more riding on the outcome the temptation to make the outcome “right” is extreme.
Wrong experiments. When conducting experiments of some new phenomena, it is common for the experimental apparatus to simply not work right. In that setting, throwing out the “bad data” can make the results much cleaner… or it can simply be cheating. Millikan did this in the ‘oil drop’ experiment which measured the electron charge.

Done right, allowing some kinds of “cheating” may be helpful to the progress of science since we can more quickly find the truth about the world. Done wrong, it results in modern nightmares like painkillers that cause heart attacks. (Of course, the more common outcome is that the drugs effectiveness is just overstated.)

A basic question is “How do you do it right?” And a basic answer is “With prediction theory bounds”. Each prediction bound has a number of things in common:

They assume that the data is independently and identically drawn. This is well suited to experimental situations where experimenters work very hard to make different experiments be independent. In fact, this is a better fit than typical machine learning applications where independence of the data is typically more questionable or simply false.
They make no assumption about the distribution that the data is drawn from. This is important for experimental testing of predictions because the distribution that observations are expected to come from is a part of the theory under test.

These two properties above form an ‘equivalence class’ over different mathematical bounds where each bound can be trusted to an equivalent degree. Inside of this equivalent class there are several that may be helpful in determining whether deviations from the scientific method are reasonable or not.

The most basic test set bound corresponds to the scientific method above.
The Occam’s Razor bound allows a careful reordering of steps (1), (2) and step (3). More “interesting” bounds like the VC-bound and the PAC-Bayes bound allow more radical alterations of these steps. Several are discussed here.
The Sample Compression bound allows careful disposal of some datapoints.
Progressive Validation bounds (such as here, here or here) allow hypotheses to be safely reformulated in arbitrary ways as experiments progress.

Scientific experimenters looking for a little extra flexibility in the scientific method may find these approaches useful. (And if they don’t, maybe there is another bound in this equivalence class that needs to be worked out.)

9/20/20059/20/2005

Workshop Proposal: Atomic Learning

This is a proposal for a workshop. It may or may not happen depending on the level of interest. If you are interested, feel free to indicate so (by email or comments).

Description:
Assume(*) that any system for solving large difficult learning problems must decompose into repeated use of basic elements (i.e. atoms). There are many basic questions which remain:

What are the viable basic elements?
What makes a basic element viable?
What are the viable principles for the composition of these basic elements?
What are the viable principles for learning in such systems?
What problems can this approach handle?

Hal Daume adds:

Can composition of atoms be (semi-) automatically constructed[?]
When atoms are constructed through reductions, is there some notion of the “naturalness” of the created leaning problems?
Other than Markov fields/graphical models/Bayes nets, is there a good language for representing atoms and their compositions?

The answer to these and related questions remain unclear to me. A workshop gives us a chance to pool what we have learned from some very different approaches to tackling this same basic goal.

(*) As a general principle, it’s very difficult to conceive of any system for solving any large problem which does not decompose.

Plan Sketch:

A two day workshop with unhurried presentations and discussion seems appropriate, especially given the diversity of approaches.
TTI-Chicago may be able to help with costs.

The above two points suggest having a workshop on a {Friday, Saturday} or {Saturday, Sunday} at TTI-Chicago.

9/19/20059/19/2005

NIPS Workshops

Attendance at the NIPS workshops is highly recommended for both research and learning. Unfortunately, there does not yet appear to be a public list of workshops. However, I found the following workshop webpages of interest:

There are many more workshops. In fact, there are so many that it is not plausible anyone can attend every workshop they are interested in. Maybe in future years the organizers can spread them out over more days to reduce overlap.

Many of these workshops are accepting presentation proposals (due mid-October).

9/14/20059/19/2005

The Predictionist Viewpoint

Virtually every discipline of significant human endeavor has a way explaining itself as fundamental and important. In all the cases I know of, they are both right (they are vital) and wrong (they are not solely vital).

Politics. This is the one that everyone is familiar with at the moment. “What could be more important than the process of making decisions?”
Science and Technology. This is the one that we-the-academics are familiar with. “The loss of modern science and technology would be catastrophic.”
Military. “Without the military, a nation will be invaded and destroyed.”
(insert your favorite here)

Within science and technology, the same thing happens again.

Mathematics. “What could be more important than a precise language for establishing truths?”
Physics. “Nothing is more fundamental than the laws which govern the universe. Understanding them is the key to understanding everything else.”
Biology. “Without life, we wouldn’t be here, so clearly the study of life is fundamental.”
Computer Science. “Everything is a computer. Controlling computation is fundamental to controlling the world.”

This post is a “me too” for machine learning. The basic claim is that all problems can be rephrased as prediction problems. In particular, for any agent (human or machine), there are things which are sensed and the goal is make good predictions about which actions to take. Here are some examples:

Soccer. Playing soccer with Peter Stone is interesting because he sometimes reacts to a pass before it is made. The ability to predict what will happen in the future is a huge edge in games.
Defensive Driving is misnamed. It’s really predictive driving. You, as a driver, attempt to predict how the other cars around you can mess up, and take that into account in your own driving style.
Predicting well can make you very wealthy by playing the stock market. Some companies have been formed around the idea of automated stock picking, with partial success. More generally, the idea of prediction as the essential ingredient is very common when gambling with stocks.
Information markets generalize the notion of stock picking to make predictions about arbitrary facts.

Prediction problems are prevalent throughout our lives so studying the problems and their solution, which is a core goal of machine learning, is essential. From the predictionist viewpoint, it is not about what you know, what you can prove or infer, who your friends are, or how much wealth you have. Instead, it’s about how well you can predict (and act on predictions of) the future.