John Langford – Page 90 – Machine Learning (Theory)

3/29/20053/29/2005

Academic Mechanism Design

From game theory, there is a notion of “mechanism design”: setting up the structure of the world so that participants have some incentive to do sane things (rather than obviously counterproductive things). Application of this principle to academic research may be fruitful.

What is misdesigned about academic research?

The JMLG guides give many hints.
The common nature of bad reviewing also suggests the system isn’t working optimally.
There are many ways to experimentally “cheat” in machine learning.
Funding Prisoner’s Delimma. Good researchers often write grant proposals for funding rather than doing research. Since the pool of grant money is finite, this means that grant proposals are often rejected, implying that more must be written. This is essentially a “prisoner’s delimma”: anyone not writing grant proposals loses, but the entire process of doing research is slowed by distraction. If everyone wrote 1/2 as many grant proposals, roughly the same distribution of funding would occur, and time would be freed for more research.

Mechanism design is not that easy—many counterintuitive effects can occur. Academic mechanism design is particularly difficult problem because there are many details. Nevertheless, it may be worthwhile because it’s hard to underestimate the value of an improvement in the rate of useful research.

The good news is that not everything needs to be solved at once. For example, on the empirical side, if we setup an easy system allowing anyone to create challenges like KDDCup, we might achieve a better (i.e. less cheat-prone) understanding of what works and what does not.

3/28/20053/28/2005

Open Problems for Colt

Adam Klivans and Rocco Servedio are looking for open (learning theory) problems for COLT. This is a good idea in the same way that the KDDcup challenge is a good idea: crisp problem definitions that anyone can attack yield solutions that advance science.

3/24/20053/24/2005

The Role of Workshops

A good workshop is often far more interesting than the papers at a conference. This happens because a workshop has a much tighter focus than a conference. Since you choose the workshops fitting your interest, the increased relevance can greatly enhance the level of your interest and attention. Roughly speaking, a workshop program consists of elements related to a subject of your interest. The main conference program consists of elements related to someone’s interest (which is rarely your own). Workshops are more about doing research while conferences are more about presenting research.

Several conferences have associated workshop programs, some with deadlines due shortly.

ICML workshops	Due April 1
IJCAI workshops	Deadlines Vary
KDD workshops	Not yet finalized

Anyone going to these conferences should examine the workshops and see if any are of interest. (If none are, then maybe you should organize one next year.)

3/21/20053/21/2005

Research Styles in Machine Learning

Machine Learning is a field with an impressively diverse set of reseearch styles. Understanding this may be important in appreciating what you see at a conference.

Engineering. How can I solve this problem? People in the engineering research style try to solve hard problems directly by any means available and then describe how they did it. This is typical of problem-specific conferences and communities.
Scientific. What are the principles for solving learning problems? People in this research style test techniques on many different problems. This is fairly common at ICML and NIPS.
Mathematical. How can the learning problem be mathematically understood? People in this research style prove theorems with implications for learning but often do not implement (or test algorithms). COLT is a typical conference for this style.

Many people manage to cross these styles, and that is often beneficial.

Whenver we list a set of alternative, it becomes natural to think “which is best?” In this case of learning it seems that each of these styles is useful, and can lead to new useful discoveries. I sometimes see failures to appreciate the other approaches, which is a shame.

3/18/20053/18/2005

Binomial Weighting

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log₂ |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than min_c f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.