The Role of Workshops

A good workshop is often far more interesting than the papers at a conference. This happens because a workshop has a much tighter focus than a conference. Since you choose the workshops fitting your interest, the increased relevance can greatly enhance the level of your interest and attention. Roughly speaking, a workshop program consists of elements related to a subject of your interest. The main conference program consists of elements related to someone’s interest (which is rarely your own). Workshops are more about doing research while conferences are more about presenting research.

Several conferences have associated workshop programs, some with deadlines due shortly.

ICML workshops Due April 1
IJCAI workshops Deadlines Vary
KDD workshops Not yet finalized

Anyone going to these conferences should examine the workshops and see if any are of interest. (If none are, then maybe you should organize one next year.)

Active learning

Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework.

Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw}:

hw(x) = 1 if x > w, and 0 otherwise.

VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (called the realizable case), then in order to get a classifier with error rate at most e, it is enough to draw m = O(1/e) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log m labels. Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them.

Unfortunately, not all that much is known beyond this toy example. To date, the single main theoretical result in the field is [FSST97]‘s analysis of the query-by-committee (QBC) learning algorithm. In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i.e., through the origin) linear separator from this same distribution, then it is possible to achieve generalization error e after seeing O(d/e) points and requesting just O(d log 1/e) labels: an exponential improvement over the usual O(d/e) sample complexity of learning linear separators in a supervised setting. This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces.

Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e.

The theoretical terrain of active learning remains something of an unexplored wilderness. There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify!

[A02] D. Angluin. Queries revisited.
[BL92] E. Baum and K. Lang. Query learning can work poorly when a human oracle is used.
[D04] S. Dasgupta. Analysis of a greedy active learning strategy.
[DKM05] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query-by-committee algorithm.

Research Styles in Machine Learning

Machine Learning is a field with an impressively diverse set of reseearch styles. Understanding this may be important in appreciating what you see at a conference.

  1. Engineering. How can I solve this problem? People in the engineering research style try to solve hard problems directly by any means available and then describe how they did it. This is typical of problem-specific conferences and communities.
  2. Scientific. What are the principles for solving learning problems? People in this research style test techniques on many different problems. This is fairly common at ICML and NIPS.
  3. Mathematical. How can the learning problem be mathematically understood? People in this research style prove theorems with implications for learning but often do not implement (or test algorithms). COLT is a typical conference for this style.

Many people manage to cross these styles, and that is often beneficial.

Whenver we list a set of alternative, it becomes natural to think “which is best?” In this case of learning it seems that each of these styles is useful, and can lead to new useful discoveries. I sometimes see failures to appreciate the other approaches, which is a shame.

Binomial Weighting

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than minc f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.

Going all the Way, Sometimes

At many points in research, you face a choice: should I keep on improving some old piece of technology or should I do something new? For example:

  1. Should I refine bounds to make them tighter?
  2. Should I take some learning theory and turn it into a learning algorithm?
  3. Should I implement the learning algorithm?
  4. Should I test the learning algorithm widely?
  5. Should I release the algorithm as source code?
  6. Should I go see what problems people actually need to solve?

The universal temptation of people attracted to research is doing something new. That is sometimes the right decision, but is also often not. I’d like to discuss some reasons why not.

  1. Expertise Once expertise are developed on some subject, you are the right person to refine them.
  2. What is the real problem? Continually improving a piece of technology is a mechanism forcing you to confront this question. In many cases, this confrontation is uncomfortable because you discover that your method has fundamental flaws with respect to solving the real problem. Not going all the way means you never discover this, except the hard way—people lose interest in your work.
  3. Virtues of breadth When you go all the way, you gain breadth, with a deeper understanding about which problems are important and why. This can be invaluable in focusing your future research.
  4. More Tangible Accomplishment Going all the way means that you can point to your peers and say “I solved it”.

Going all the way is sometimes problematic in research. For example, a paper with a theory, an algorithm, and experimental results invites defeat-in-detail: a reviewer can disagree with any one of these components and eliminate it from consideration. Another issue is that academia doesn’t directly reward implementing and releasing algorithms. A third issue is that you will almost certainly discover topics of interest which don’t fit your home conference(s). It is also very difficult to publish a paper with the title “an incremental improvement on X (which makes it work great in practice)”.

Along with this advice, it is important to remember to fail fast, where appropriate. When you discover that an idea is not workable, quickly quitting it and moving on is a real virtue.