A question of quantification

This is about methods for phrasing and think about the scope of some theorems in learning theory. The basic claim is that there are several different ways of quantifying the scope which sound different yet are essentially the same.

  1. For all sequences of examples. This is the standard quantification in online learning analysis. Standard theorems would say something like “for all sequences of predictions by experts, the algorithm A will perform almost as well as the best expert.”
  2. For all training sets. This is the standard quantification for boosting analysis such as adaboost or multiclass boosting.
    Standard theorems have the form “for all training sets the error rate inequalities … hold”.
  3. For all distributions over examples. This is the one that we have been using for reductions analysis. Standard theorem statements have the form “For all distributions over examples, the error rate inequalities … hold”.

It is not quite true that each of these is equivalent. For example, in the online learning setting, quantifying “for all sequences of examples” implies “for all distributions over examples”, but not vice-versa.

However, in the context of either boosting or reductions these are equivalent because the algorithms operate in an element-wise fashion. To see the equivalence, note that:

  1. “For any training set” is equivalent to “For any sequence of examples” because a training set is a sequence and vice versa.
  2. “For any sequence of examples” is equivalent to “For any distribution over examples” when the theorems are about unconditional example transformations because:
    1. The uniform distribution over a sufficiently long sequence of examples can approximate any distribution we care about arbitrarily well.
    2. If the theorem holds “for all distributions”, it holds for the uniform distribution over the elements in any sequence of examples.

The natural debate here is “how should the theorems be quantified?” It is difficult to answer this debate based upon mathematical grounds because we just showed an equivalence. It is nevertheless important because it strongly influences how we think about algorithms and how easy it is to integrate the knowledge across different theories. Here are the arguments I know.

  1. For all sequences of examples.
    1. Learning theory people (at least) are used to thinking about “For all sequences of examples”.
    2. (Applied) Machine learning people are not so familiar with this form of quantification.
    3. When the algorithm is example-conditional such as in online learning, the quantification is more general than “for all distributions”.
  2. For all training sets.
    1. This is very simple.
    2. It is misleadingly simple. For example, a version of the adaboost theorem also applies to test sets using the test error rates of the base classifiers. It is fairly common for this to be misunderstood.
  3. For all distributions over examples.
    1. Distributions over examples is simply how most people think about learning problems.
    2. “For all distributions over examples” is easily and often confused with “For all distributions over examples accessed by IID draws”. It seems most common to encounter this confusion amongst learning theory folks.

What quantification should be used and why?
(My thanks to Yishay Mansour for clarifying the debate.)

The Design of an Optimal Research Environment

How do you create an optimal environment for research? Here are some essential ingredients that I see.

  1. Stability. University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between:
    1. easy small
    2. difficult big

    For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things.

    Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt research was a great example of this before they lost their monopoly and had to fire the researchers.

    An inadequate amount of stability is probably given by fixed duration appointments. The appropriate timescale to consider here is one year because there is a strong annual cycle for application. A one year appointment is probably too short while a 10 year appointment may be effectively the same as arbitrary duration.

    One significant part of stability is financial stability of the parent organization. The nature of research implies that what makes ‘funding sense’ depends on the size of the set of people who can benefit from it. Since ‘governments’ are the largest organizations around, they are good candidates, with the caveat that political will has not been known as greatly stable.

  2. Free time. University-based research is relatively terrible about this while industrial labs vary widely.

    For professors at a university, teaching a subject well requires very significant time and energy. Running the university requires very significant time and energy. Writing grant proposals for funding requires very significant time and energy (success in grant writing is typically a necessity to achieve tenure at US research universities). Advising students well requires significant time and energy. Some of these activities are partially synergistic with research, but it seems unarguable that the multiplicity of time demands can greatly slow research.

    In industrial labs (think of “IBM research” as an example), there is an tendency to shift research towards short term rewards. This is partly becaue the research labs have not proved very capable of using the big successes and partly because even large businesses have ups and downs. During a ‘down’ time, it is very tempting to use the reserve of very capable people in a research lab for short duration projects.

    One thing to understand about research is that it is not the case that a person with half as much free time can produce half as much work. The problems encountered might have sufficient intricacy that they require a day or more of overhead to begin making progress.

  3. Problem exposure The set of research problems which people could work on is much larger than the set of research problems which are plausibly useful to the rest of the world (*). The way narrow down on the relevant problems is to give researchers a chance to encounter them.

    Univeristy-based research can have problems here, because no really effective mechanism for doing this exists by default. In industrial labs, researchers often act as “superconsultants” helping to solve problems. This method is effective but perhaps relatively crude because as often instantiated, it conflicts with “free time”.

    (*) There is always someone who says “but at the time number theory was researched, nobody had any practical use, and now it is very useful for crypto”. This is true, but there are several important caveats:

    1. The “hit rate” (in terms of big impact on everyones lives) for unmotivated research is much lower. I don’t have a good measurement of how much lower, but “significantly” seems both plausibly and anecdotally correct.
    2. Number theory was not entirely unmotivated at the time. Afterall, it was clear that numbers were a very important abstraction and so a better understanding of numbers was plausibly useful.
    3. Would development of factoring algorithms and their hardness understanding have been much delayed without number theory? And would that delay have been worth, in exchange, an earlier development of calculus? These are at least hard questions to answer.

    There is a role for research of the “let’s do it because I am curious nature”, especially because interested people do good research. The size of that role must, necessarily, be relatively small.

  4. Benefit exposure Almost all researchers are not well-exposed to the benefits of research. University-based research is may do this best via giving the researchers the ability to form their own companies with partial ownership. However, the overhead associated with “form your own company” is very substantial creating a formidable barrier (some good news, is that the barrier may be decreasing). This fails for many research advances which are obviously useful yet have no reasonable mechanism of capturing it’s profit. In the context of machine learning, spam filtering is of obvious and significant common use, yet a researcher with a better spam filter can not easily create a company around the method. There are several reasons for this.
    1. Machine learning is a victim of it’s common success. It’s hard to develop a learning algorithm which is substantially better than others. This means that anyone wanting to implement spam filtering can do so. Patents are useless here—you can’t patent an entire field (and even if you could it wouldn’t work).
    2. Spam filtering is not easily modularizable. Most people want their email clients to filter spam, but they don’t want to switch email clients. Combined with the first observation, this makes it clear that a “spam filtering company” is going to run into trouble.

    The benefit structure in industrial research labs is (perhaps) often even worse. The primary mechanism is via bonuses, where 10% of base salary might be large. Obviously, keeping the job is of much greater importance. A system where the base salary is bumped by a small (variable) amount each year is common. The implication of this is that rewards are delayed (reducing incentives) compared to a bonus and age discrimination becomes a problem.

    A relative bright spot here is google. Much of what they work on is not research-as-we-know-it, but some of it is, and there seem to be significant mechanisms for rewarding success, even after the IPO.

  5. Devolution of power There is a traditional notion of how things are done which says “everyone must have a manager to tell them what to do”. This works badly in research for several reasons.
    1. People are simply not very good at doing research on a topic which does not interest them. The problem is hard enough that doing it in a disinterested or even not-fully-interested way greatly reduces the chance of success.
    2. Another difficulty here is judgement. People make mistakes, including even managers. If a manager has a good idea about what is an interesting topic of research, then the power to say “work on this” is mildly beneficial over simple persuasion. On the other hand, a mistake is greatly harmful.

    The alternative to a command-and-control notion of research is a persuasion-based system. In a persuasion-based system, people are free to either work with someone or not, as they judge. In a persuasion based system, no compelling idea means that researchers try different problems while, which seems appropriate. And, when something promising comes along, more people can be added as demand and persuasiveness allows. University style research partially achieves this (a notable complaint of students is that they do not have enough freedom). In industrial labs, this varies widely.

    It is worth noting that devolution of power has occurred in all of the following surprisingly effective organizations/endeavors:

    1. Linux kernel development (and more generally open source software). Only a decade ago, most economists would have considered the present success of open source software development as implausible. Anyone who can code can work on the linux kernel if they want.
    2. Wikipedia Traditionaly encyclopedias complain that wikipedia articles can be false since anyone can submit, but this argument vastly underestimates the value of having a comprehensive and uptodate information resource.
    3. Google made it so anyone could advertise in a simple and easy way. This simultaneously lowered the barrier to advertising and sanitized the process. We sometimes forget that google came long after other internet search engines so their success was far from assured.
  6. Student exposure By “student” here, I mean “research student”. Research students are of course much more common on university campuses, although there are sometimes government funded research labs such as MPI in Germany and NICTA in Australia.

    Any reasonable system involving people working together must take into account how new people are brought into the system. The appropriate point seems to be where training in research starts (typically, ‘phd students’). Aside from issues of a self-sustaining system, students can add something significant to the process of research. The effort of explaining a new topic of research often aids research via simplification and consideration in new contexts. Students of course can also contribute substantially to the basic mission of research because they inherently bring in new ideas.

  7. Concentration There are several kinds of concentration, and each of them is important. The first kind is the ‘quality’ kind: there are better and worse researchers. Unfortunately, this is extraordinarily difficult to judge, except after the fact. Another kind is concentration of interests. Having 5 people each interested in the subject working on their own is often substantially less useful than having the 5 working together. The internet has greatly aided concentration because some significant benefit can be derived even from people who are not physically near to each other. Universities have traditionally had difficulty with this because the demands of teaching undergraduates imply a necessity for breadth of subjects. This implies that it is common for universities to not have two people working on a similar subject.

It is interesting that no place manages to capture them all. If all of these elements were to come together, the result might be very impressive.

MLSS 2006

There will be two machine learning summer schools in 2006.

One is in Canberra, Australia from February 6 to February 17 (Aussie summer). The webpage is fully ‘live’ so you should actively consider it now.

The other is in Taipei, Taiwan from July 24 to August 4. This one is still in the planning phase, but that should be settled soon.

Attending an MLSS is probably the quickest and easiest way to bootstrap yourself into a reasonable initial understanding of the field of machine learning.

The Everything Ensemble Edge

Rich Caruana, Alexandru Niculescu, Geoff Crew, and Alex Ksikes have done a lot of empirical testing which shows that using all methods to make a prediction is more powerful than using any single method. This is in rough agreement with the Bayesian way of solving problems, but based upon a different (essentially empirical) motivation. A rough summary is:

  1. Take all of {decision trees, boosted decision trees, bagged decision trees, boosted decision stumps, K nearest neighbors, neural networks, SVM} with all reasonable parameter settings.
  2. Run the methods on each problem of 8 problems with a large test set, calibrating margins using either sigmoid fitting or isotonic regression.
  3. For each loss of {accuracy, area under the ROC curve, cross entropy, squared error, etc…} evaluate the average performance of the method.

A series of conclusions can be drawn from the observations.

  1. (Calibrated) boosted decision trees appear to perform best, in general although support vector machines and neural networks give credible near-best performance.
  2. The metalearning algorithm which simply chooses the best (based upon a small validation set) performs much better.
  3. A metalearning algorithm which combines the predictors in an ensemble using stepwise refinement of validation set performance appears to perform even better.

There are a number of caveats to this work: it was only applied on large datasets there is no guarantee that the datasets are representative of your problem (although efforts were made to be representative in general), and the size of the training set was fixed rather than using the natural size given by the problem. Despite all these caveats, the story told above seems compelling: if you want maximum performance, you must try many methods and somehow combine them.

The most significant drawback of this method is computational complexity. Techniques for reducing the computational complexity are therefore of significant interest. It seems plausible that there exists some learning algorithm which typically performs well whenever any of the above algorithms can perform well at a computational cost which is significantly less than “run all algorithm on all settings and test”.

A fundamental unanswered question here is “why?” in several forms. Why have the best efforts of many machine learning algorithm designers failed to capture all the potential predictive strength into a single coherent learning algorithm? Why do ensembles give such a significant consistent edge in practice? A great many papers follow the scheme: invent a new way to create ensembles, test, observe that it improves prediction performance at the cost of more computation, and publish. There are several pieces of theory explain individual ensemble methods, but we seem to have no convincing theoretical statement explaining why they almost always work.

Prediction Competitions

There are two prediction competitions currently in the air.

  1. The Performance Prediction Challenge by Isabelle Guyon. Good entries minimize a weighted 0/1 loss + the difference between a prediction of this loss and the observed truth on 5 datasets. Isabelle tells me all of the problems are “real world” and the test datasets are large enough (17K minimum) that the winner should be well determined by ability rather than luck. This is due March 1.
  2. The Predictive Uncertainty Challenge by Gavin Cawley. Good entries minimize log loss on real valued output variables for one synthetic and 3 “real” datasets related to atmospheric prediction. The use of log loss (which can be infinite and hence is never convergent) and smaller test sets of size 1K to 7K examples makes the winner of this contest more luck dependent. Nevertheless, the contest may be of some interest particularly to the branch of learning (typically Bayes learning) which prefers to optimize log loss.

May the best predictor win.