Machine Learning (Theory)

3/30/2005

What can Type Theory teach us about Machine Learning?

Tags: General, Language DrewBagnell@ 8:51 am

This post is some combination of belaboring the obvious and speculating wildly about the future. The basic issue to be addressed is how to think about machine learning in terms given to us from Programming Language theory.

Types and Reductions

John’s research programme (I feel this should be in British spelling to reflect the grandiousness of the idea…) of machine learning reductions StateOfReduction is at some essential level type-theoretic in nature. The fundamental elements are the classifier, a function f: alpha -> beta, and the corresponding classifier trainer g: List of (alpha,beta) -> (alpha -> beta). The research goal is to create *combinators* that produce new f’s and g’s given existing ones. John (probably quite rightly) seems unwilling at the moment to commit to any notion stronger than these combinators are correctly typed. One way to see the result of a reduction is something typed like: (For those denied the joy of the Hindly-Milner type system, “simple” is probably wildly wrong.)

g’: List of (gamma,delta) -> (List of (alpha,beta) -> (alpha -> beta)) -> (gamma -> delta)

Perhaps another is to think of the reduction itself type-theoretically as a combinator that takes something typed like g above and spits out a new classifier trainer. (I.e. a reduction “lifts” a lower-level function up to operate on some type it wasn’t originally designed for.)

Many of things John considers reductions have particularly error and sample complexity preserving properties. For instance, a reduction may have the property that small regret of a low level classifier implies small regret for the result of the reduction. There are a number of interesting issues here:

* Can we “type” these properties so a compiler might check them?
* Is this enough? Some reductions (including, I would argue, some that have been discussed seriously) are “trivial” reductions in the sense that the antecedent never holds. That is, the are combinators that setup learning problems impossible to solve. Can we somehow type this kind of thing so we can separate good from bad reductions, where we try to define bad as meaning something like “creates impossible subproblems where it was possible to do otherwise”?

RLBench hints at the power that formalizing interfaces for things like reductions can be. You can chain together powerful comands that leverage existing classifier trainers to learn in a trace-generating model which is adapted (again essentially a combinator library) from a generative model. Unfortunately, RLBench operates at the command line which is poorly designed for this kind of composition.

Types and Probabilistic Models

Arguably the graphical models/Bayesian community has just as grandiose plans as John, but here the reductions are to learning probabilities in the sense “beliefs”. Graphical models form a language that allows us to compose together little subgraphs that express our beliefs about some subsystem. Much research has gone into allowing the modeling language to drive inference as well. It’s interesting to explore what type theory can tell us about this as well.

Existing work

Lloyd Allison has taken thoughts like this quite seriously in his relatively recent work, “Types and Classes in Machine Learning and Data Mining” . Allison uses Haskell to type a number of ideas including
MML and some classifiers, and can check many things statically. Unfortunately, ML is one of those parts of information science where we demand every computational advantage we can get. If we learn to solve some problem with acceptable speed, in no time it is the inner loop of some more sophisticated machine learning algorithm. Despite my apppreciation (well, mostly) of Haskell, it simply isn’t practical to write ML algorithms in Haskell. (Although
presumambly both Allison and …. would disagree with me.)

It may still make a huge amount of sense to think about things in something like the Haskell type system and then translate them to the capable (but gross) type system of, say C++. Understanding polymorphism and type classes and there relation with Machine Learning may be a real fundamental breakthrough to making ML widely useful.

Contributions flowing towards PL theory

Right now, when push comes to shove, all good interfaces between systems basically amount to invoking functions or closures. When
you get over “object oriented” and other such hype this makes sense for structuring work. What’s interesting is that recent machine learning and AI work *can’t* be expressed that way. I think of graphical models as tools for expressing future interfaces because they preserve uncertainty across boundaries. This seems to me where ML people can challenge PL people– help us design a new conception of “interface” that preserves uncertainty between layers. (Say, that
passes probability or “belief messages” back and forth.) Perhaps the probabalistic machinery already exist: we can always define “sampling interfaces” between systems. My guess is that these interfaces are basically multi-directional (unlike functional interfaces). Why? Say I want a program to understand speech and I build a layered system that consists of “signaling processing”, “phoneme recognition”, “word recognition”, “language modeling”, and some form of “semantic understanding”. I can resolve something ambigious about, say a phoneme, by understanding the higher level language model. To get the phoneme parsing right I have to feedback the results from language layer. In this sense, interfaces need to preserve uncertainty and probably pass information both ways.

How to start (small)

I think a serious effort should be made to explain things like reductions in terms of PL theory– even if it means that, like Allison, you have to do some explaining type systems first. I’d love to spend some time hashing some of this out with PL people. (If only there were more hours in the day!)

We should write our libraries to have super-clean functional interfaces and make them as (parametrically) polymorphic as reasonable.

Any other thoughts on ways to proceed on this in the short term?

3/29/2005

Academic Mechanism Design

Tags: General jl@ 11:19 pm

From game theory, there is a notion of “mechanism design”: setting up the structure of the world so that participants have some incentive to do sane things (rather than obviously counterproductive things). Application of this principle to academic research may be fruitful.

What is misdesigned about academic research?

  1. The JMLG guides give many hints.
  2. The common nature of bad reviewing also suggests the system isn’t working optimally.
  3. There are many ways to experimentally “cheat” in machine learning.
  4. Funding Prisoner’s Delimma. Good researchers often write grant proposals for funding rather than doing research. Since the pool of grant money is finite, this means that grant proposals are often rejected, implying that more must be written. This is essentially a “prisoner’s delimma”: anyone not writing grant proposals loses, but the entire process of doing research is slowed by distraction. If everyone wrote 1/2 as many grant proposals, roughly the same distribution of funding would occur, and time would be freed for more research.

Mechanism design is not that easy—many counterintuitive effects can occur. Academic mechanism design is particularly difficult problem because there are many details. Nevertheless, it may be worthwhile because it’s hard to underestimate the value of an improvement in the rate of useful research.

The good news is that not everything needs to be solved at once. For example, on the empirical side, if we setup an easy system allowing anyone to create challenges like KDDCup, we might achieve a better (i.e. less cheat-prone) understanding of what works and what does not.

3/28/2005

Open Problems for Colt

Tags: Announcements jl@ 11:12 am

Adam Klivans and Rocco Servedio are looking for open (learning theory) problems for COLT. This is a good idea in the same way that the KDDcup challenge is a good idea: crisp problem definitions that anyone can attack yield solutions that advance science.

3/24/2005

The Role of Workshops

Tags: Announcements jl@ 10:37 am

A good workshop is often far more interesting than the papers at a conference. This happens because a workshop has a much tighter focus than a conference. Since you choose the workshops fitting your interest, the increased relevance can greatly enhance the level of your interest and attention. Roughly speaking, a workshop program consists of elements related to a subject of your interest. The main conference program consists of elements related to someone’s interest (which is rarely your own). Workshops are more about doing research while conferences are more about presenting research.

Several conferences have associated workshop programs, some with deadlines due shortly.

ICML workshops Due April 1
IJCAI workshops Deadlines Vary
KDD workshops Not yet finalized

Anyone going to these conferences should examine the workshops and see if any are of interest. (If none are, then maybe you should organize one next year.)

3/22/2005

Active learning

Tags: Active, Prediction Theory sanjoy@ 9:29 pm

Often, unlabeled data is easy to come by but labels are expensive. For instance, if you’re building a speech recognizer, it’s easy enough to get raw speech samples — just walk around with a microphone — but labeling even one of these samples is a tedious process in which a human must examine the speech signal and carefully segment it into phonemes. In the field of active learning, the goal is as usual to construct an accurate classifier, but the labels of the data points are initially hidden and there is a charge for each label you want revealed. The hope is that by intelligent adaptive querying, you can get away with significantly fewer labels than you would need in a regular supervised learning framework.

Here’s an example. Suppose the data lie on the real line, and the classifiers are simple thresholding functions, H = {hw}:

hw(x) = 1 if x > w, and 0 otherwise.

VC theory tells us that if the underlying distribution P can be classified perfectly by some hypothesis in H (called the realizable case), then in order to get a classifier with error rate at most e, it is enough to draw m = O(1/e) random labeled examples from P, and to return any classifier consistent with them. But suppose we instead draw m unlabeled samples from P. If we lay these points down on the line, their hidden labels are a sequence of 0’s followed by a sequence of 1’s, and the goal is to discover the point w at which the transition occurs. This can be accomplished with a simple binary search which asks for just log m labels. Thus active learning gives us an exponential improvement in the number of labels needed: by adaptively querying log m labels, we can automatically infer the rest of them.

Unfortunately, not all that much is known beyond this toy example. To date, the single main theoretical result in the field is [FSST97]’s analysis of the query-by-committee (QBC) learning algorithm. In their model, the learner observes a stream of unlabeled data and makes spot decisions about whether or not to ask for a point’s label. They show that if the data is drawn uniformly from the surface of the d-dimensional unit sphere, and the hidden labels correspond perfectly to a homogeneous (i.e., through the origin) linear separator from this same distribution, then it is possible to achieve generalization error e after seeing O(d/e) points and requesting just O(d log 1/e) labels: an exponential improvement over the usual O(d/e) sample complexity of learning linear separators in a supervised setting. This remarkable result is tempered somewhat by the complexity of the QBC algorithm, which involves computing volumes of intermediate version spaces.

Some recent progress on active learning: [DKM05] show how a simple variant of the perceptron update can be used to achieve these same sample complexity bounds, in the same model. [D04] shows a variety of upper and lower bounds for active learning — for instance, if you allow linear separators which are non-homogeneous then in the above model the sample complexity necessarily shoots up to 1/e.

The theoretical terrain of active learning remains something of an unexplored wilderness. There has, however, been a lot of beautiful theory work (see [A02] for a roundup) on a related model in which the learner is allowed to synthesize query points, rather than simply choosing them from the pool of unlabeled data. This ran into some practical problems: [BL92] found that the resulting synthetic instances were often very difficult for a human to classify!

[A02] D. Angluin. Queries revisited.
[BL92] E. Baum and K. Lang. Query learning can work poorly when a human oracle is used.
[D04] S. Dasgupta. Analysis of a greedy active learning strategy.
[DKM05] S. Dasgupta, A. Kalai, and C. Monteleoni. Analysis of perceptron-based active learning.
[FSST97] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query-by-committee algorithm.

3/21/2005

Research Styles in Machine Learning

Tags: Organization jl@ 4:50 pm

Machine Learning is a field with an impressively diverse set of reseearch styles. Understanding this may be important in appreciating what you see at a conference.

  1. Engineering. How can I solve this problem? People in the engineering research style try to solve hard problems directly by any means available and then describe how they did it. This is typical of problem-specific conferences and communities.
  2. Scientific. What are the principles for solving learning problems? People in this research style test techniques on many different problems. This is fairly common at ICML and NIPS.
  3. Mathematical. How can the learning problem be mathematically understood? People in this research style prove theorems with implications for learning but often do not implement (or test algorithms). COLT is a typical conference for this style.

Many people manage to cross these styles, and that is often beneficial.

Whenver we list a set of alternative, it becomes natural to think “which is best?” In this case of learning it seems that each of these styles is useful, and can lead to new useful discoveries. I sometimes see failures to appreciate the other approaches, which is a shame.

3/18/2005

Binomial Weighting

Tags: Online, Papers, Problems jl@ 8:16 pm

Suppose we have a set of classifiers c making binary predictions from an input x and we see examples in an online fashion. In particular, we repeatedly see an unlabeled example x, make a prediction y’(possibly based on the classifiers c), and then see the correct label y.

When one of these classifiers is perfect, there is a great algorithm available: predict according to the majority vote over every classifier consistent with every previous example. This is called the Halving algorithm. It makes at most log2 |c| mistakes since on any mistake, at least half of the classifiers are eliminated.

Obviously, we can’t generally hope that the there exists a classifier which never errs. The Binomial Weighting algorithm is an elegant technique allowing a variant Halving algorithm to cope with errors by creating a set of virtual classifiers for every classifier which occasionally disagree with the original classifier. The Halving algorithm on this set of virtual classifiers satisfies a theorem of the form:

errors of binomial weighting algorithm less than minc f(number of errors of c, number of experts)

The Binomial weighting algorithm takes as a parameter the maximal minimal number of mistakes of a classifier. By introducing a “prior” over the number of mistakes, it can be made parameter free. Similarly, introducing a “prior” over the set of classifiers is easy and makes the algorithm sufficiently flexible for common use.

However, there is a problem. The minimal value of f() is 2 times the number of errors of any classifier, regardless of the number of classifiers. This is frustrating because a parameter-free learning algorithm taking an arbitrary “prior” and achieving good performance on an arbitrary (not even IID) set of examples is compelling for implementation and use, if we had a good technique for removing the factor of 2. How can we do that?

See the weighted majority algorithm for an example of a similar algorithm which can remove a factor of 2 using randomization and at the expense of introducing a parameter. There are known techniques for eliminating this parameter, but they appear not as tight (and therefore practically useful) as introducing a “prior” over the number of errors.

3/17/2005

Going all the Way, Sometimes

Tags: General jl@ 8:01 pm

At many points in research, you face a choice: should I keep on improving some old piece of technology or should I do something new? For example:

  1. Should I refine bounds to make them tighter?
  2. Should I take some learning theory and turn it into a learning algorithm?
  3. Should I implement the learning algorithm?
  4. Should I test the learning algorithm widely?
  5. Should I release the algorithm as source code?
  6. Should I go see what problems people actually need to solve?

The universal temptation of people attracted to research is doing something new. That is sometimes the right decision, but is also often not. I’d like to discuss some reasons why not.

  1. Expertise Once expertise are developed on some subject, you are the right person to refine them.
  2. What is the real problem? Continually improving a piece of technology is a mechanism forcing you to confront this question. In many cases, this confrontation is uncomfortable because you discover that your method has fundamental flaws with respect to solving the real problem. Not going all the way means you never discover this, except the hard way—people lose interest in your work.
  3. Virtues of breadth When you go all the way, you gain breadth, with a deeper understanding about which problems are important and why. This can be invaluable in focusing your future research.
  4. More Tangible Accomplishment Going all the way means that you can point to your peers and say “I solved it”.

Going all the way is sometimes problematic in research. For example, a paper with a theory, an algorithm, and experimental results invites defeat-in-detail: a reviewer can disagree with any one of these components and eliminate it from consideration. Another issue is that academia doesn’t directly reward implementing and releasing algorithms. A third issue is that you will almost certainly discover topics of interest which don’t fit your home conference(s). It is also very difficult to publish a paper with the title “an incremental improvement on X (which makes it work great in practice)”.

Along with this advice, it is important to remember to fail fast, where appropriate. When you discover that an idea is not workable, quickly quitting it and moving on is a real virtue.

3/15/2005

The State of Tight Bounds

Tags: Prediction Theory jl@ 11:49 am

What? Bounds are mathematical formulas relating observations to future error rates assuming that data is drawn independently. In classical statistics, they are calld confidence intervals.
Why?

  1. Good Judgement. In many applications of learning, it is desirable to know how well the learned predictor works in the future. This helps you decide if the problem is solved or not.
  2. Learning Essence. The form of some of these bounds helps you understand what the essence of learning is.
  3. Algorithm Design. Some of these bounds suggest, motivate, or even directly imply learning algorithms.

What We Know Now

There are several families of bounds, based on how information is used.

  1. Testing Bounds. These are methods which use labeled data not used in training to estimate the future error rate. Examples include the test set bound, progressive validation also here and here, train and test bounds, and cross-validation (but see the big open problem). These techniques are the best available for goal (1) above, but provide little information towards goals (2) or (3). Some of these techniques are computationally efficient while others are not.
  2. Unlabeled test set bounds Instead of using labeled data to construct a tight bound, it is sometimes possible to use unlabeled data.
  3. Training Bounds These are methods which use labeled data to for both training and testing. These bounds provide insight into goals (2) and (3), but are not very effective for goal (1) above. These bounds teach us about how many independent examples are required to guarantee learning success on different on different representations. Any bound of this sort implies a learning algorithm: “minimize the bound”. The set covering machine is an application of this approach to a (variant) sample compression bound. Most other potential applications aren’t “quite there” yet, although they are close. Here is a list of learning algorithms and bounds that “almost work” for these algorithms.
    1. SVM PAC-Bayes margin bound.
    2. Perceptron PAC-Bayes margin bound, Sample Compression Bound
    3. Neural Network PAC-Bayes bound.
    4. Decision Tree Occam’s razor bound.
    5. Decision Tree Pruning Rademacher Complexity Bounds
    6. Semisupervised Clustering PAC-MDL or transductive PAC-Bayes bound
    7. Nearest neighbor ??(Note that the cross validation approaches work well here.)

Limitations
The independence assumption is a significant limitation in the applicability of this approach since we often can not believe that independence is satisfied on natural learning problems. Some work has gone into weakening this assumption.

3/13/2005

Avoiding Bad Reviewing

Tags: Reviewing jl@ 11:08 am

If we accept that bad reviewing often occurs and want to fix it, the question is “how”?

Reviewing is done by paper writers just like yourself, so a good proxy for this question is asking “How can I be a better reviewer?” Here are a few things I’ve learned by trial (and error), as a paper writer, and as a reviewer.

  1. The secret ingredient is careful thought. There is no good substitution for a deep and careful understanding.
  2. Avoid reviewing papers that you feel competitive about. You almost certainly will be asked to review papers that feel competitive if you work on subjects of common interest. But, the feeling of competition can easily lead to bad judgement.
  3. If you feel biased for some other reason, then you should avoid reviewing. For example…
  4. Feeling angry or threatened by a paper is a form of bias. See above.
  5. Double blind yourself (avoid looking at the name even in a single-blind situation). The significant effect of a name you recognize is making you pay close attention to a paper. Since not paying enough attention is a standard failure mode of reviewers, a name you recognize is inevitably unfair to authors you do not recognize.
  6. Don’t review fast. For conferences there is a tendency to review papers right at the deadline. This tendency can easily result in misjudgements because you do not have the opportunity to really understand what a paper is saying.
  7. Don’t review too much. “Too much” is defined on a per-person basis. If you don’t have time to really understand the papers that you review, then you should say “no” to review requests.
  8. Overconfidence is the enemy of truth. If you are not confident about your review, you should not state that you are. Bad reviews are often overconfident reviews.
  9. Always try to make review comments nonpersonal and constructive, especially in a rejection.

Given the above, observations, a few suggestions for improved review organization can be derived.

  1. Double blind. A common argument against double blind reviewing is that it is often defeatable. This is correct and misses the point. The reason why double blind reviewing is helpful is that a typical reviewer who wants to review well is aided by the elimination of side information which should not effect the acceptance of a paper. (ICML and AAAI are double blind this year.) Another reason why double blind reviewing is “right”, is that it simply appears fairer. This makes it easier on average for authors to take rejections in a more constructive manner.
  2. Staggered deadlines. Many people can’t prioritize reviews well, so the prioritization defaults to deadline proximity. Consequently, instead of having many paper reviews due on one day, having them due at the rate of one-per-day (or an even slower rate) may be helpful. These should be real deadlines in the sense that “you get it in by this date or you are excluded from conversation and decision making about the paper”.
  3. Large PCs. There is a tendency to value (and romanticize) the great researcher. But a great researcher with many papers to review can only be a mediocre reviewer due to lack of available attention and time. Consequently, increasing the size of the PC may be helpful for small PC conferences.
  4. Communication channels. A typical issue in reviewing a paper is that some detail is unintentionally (and accidentally) unclear. In this case, being able to communicate with the authors is helpful. This communication can be easily setup to respect the double blind guarantee by routing through the conference site. This communication does not change the meaning of a reviewers job. ICML and AAAI are allowing author feedback. I mean something more spontaneous, but this is a step in that direction.
  5. Refusal. In many cases, it is not possible to tell that you have a conflict of interest in a paper until after seeing it. A mechanism for saying “I have a conflict of interest, please reassign the paper” should exist, and it’s use should be respected.
  6. Independence. Access to other reviews should not be available until after completing your own review. The point of having multiple reviews is reducing noise. Allowing early access to other reviews increases noise by decreasing independence amongst reviewers. Many conferences (but not all) follow this pattern.

If you have more ideas, please add them.

3/10/2005

Breaking Abstractions

Tags: General jl@ 12:29 pm

Sam Roweis’s comment reminds me of a more general issue that comes up in doing research: abstractions always break.

  1. Real number’s aren’t. Most real numbers can not be represented with any machine. One implication of this is that many real-number based algorithms have difficulties when implemented with floating point numbers.
  2. The box on your desk is not a turing machine. A turing machine can compute anything computable, given sufficient time. A typical computer fails terribly when the state required for the computation exceeds some limit.
  3. Nash equilibria aren’t equilibria. This comes up when trying to predict human behavior based on the result of the equilibria computation. Often, it doesn’t work.
  4. The probability isn’t. Probability is an abstraction expressing either our lack of knowledge (the Bayesian viewpoint) or fundamental randomization (the frequentist viewpoint). From the frequentist viewpoint the lack of knowledge typically precludes actually knowing the fundamental randomization. From the Bayesian viewpoint, precisely specifying our lack of knowledge is extremely difficult and typically not done.

So, what should we do when we learn that our basic tools can break? The answer, of course is to keep using them until something better comes along. However, the uncomfortable knowledge our tools break is necessary in a few ways:

  1. When considering a new abstraction, the existence of a break does not imply that it is a useless abstraction. (Just as the existence of the breaks above does not imply that they are useless abstractions.)
  2. When using an abstraction in some new way, we must generally consider “is this a reasonable use?”
  3. Some tools break more often than other tools. We should actively consider the “rate of breakage” when deciding amongst tools.

3/9/2005

Bad Reviewing

Tags: Reviewing jl@ 1:20 pm

This is a difficult subject to talk about for many reasons, but a discussion may be helpful.

Bad reviewing is a problem in academia. The first step in understanding this is admitting to the problem, so here is a short list of examples of bad reviewing.

  1. Reviewer disbelieves theorem proof (ICML), or disbelieve theorem with a trivially false counterexample. (COLT)
  2. Reviewer internally swaps quantifiers in a theorem, concludes it has been done before and is trivial. (NIPS)
  3. Reviewer believes a technique will not work despite experimental validation. (COLT)
  4. Reviewers fail to notice flaw in theorem statement (CRYPTO).
  5. Reviewer erroneously claims that it has been done before (NIPS, SODA, JMLR)—(complete with references!)
  6. Reviewer inverts the message of a paper and concludes it says nothing important. (NIPS*2)
  7. Reviewer fails to distinguish between a DAG and a tree (SODA).
  8. Reviewer is enthusiastic about paper but clearly does not understand (ICML).
  9. Reviewer erroneously believe that the “birthday paradox” is relevant (CCS).

The above is only for cases where there was sufficient reviewer comments to actually understand reviewer failure modes. Many reviewers fail to leave sufficient comments and it’s easy to imagine they commit similar mistakes.

Bad reviewing should be clearly distinguished from rejections—note that some of the above examples are actually accepts.

The standard psychological reaction to any rejected paper is trying to find fault with the reviewers. You, as a paper writer, have invested significant work (weeks? months? years?) in the process of creating a paper, so it is extremely difficult to step back and read the reviews objectively. One distinguishing characteristic of a bad review from a rejection is that it bothers you years later.

If we accept that bad reviewing happens and want to address the issue, we are left with a very difficult problem. Many smart people have thought about improving this process, yielding the system we observe now. There are many subtle issues here and several solutions that (naively) appear obvious don’t work.

3/8/2005

Fast Physics for Learning

Tags: Bayesian, General jl@ 2:54 pm

While everyone is silently working on ICML submissions, I found this discussion about a fast physics simulator chip interesting from a learning viewpoint. In many cases, learning attempts to predict the outcome of physical processes. Access to a fast simulator for these processes might be quite helpful in predicting the outcome. Bayesian learning in particular may directly benefit while many other algorithms (like support vector machines) might have their speed greatly increased.

The biggest drawback is that writing software for these odd architectures is always difficult and time consuming, but a several-orders-of-magnitude speedup might make that worthwhile.

3/5/2005

Funding Research

Tags: General jl@ 8:26 pm

The funding of research (and machine learning research) is an issue which seems to have become more significant in the United States over the last decade. The word “research” is applied broadly here to science, mathematics, and engineering.

There are two essential difficulties with funding research:

  1. Longshot Paying a researcher is often a big gamble. Most research projects don’t pan out, but a few big payoffs can make it all worthwhile.
  2. Information Only Much of research is about finding the right way to think about or do something.

The Longshot difficulty means that there is high variance in payoffs. This can be compensated for by funding many different research projects, reducing variance.

The Information-Only difficulty means that it’s hard to extract a profit directly from many types of research, so companies have difficulty justifying basic research. (Patents are a mechanism for doing this. They are often extraordinarily clumsy or simply not applicable.)

These two difficulties together imply that research is often chronically underfunded compared to what would be optimal for any particular nation. They also imply that funding for research makes more sense for larger nations and makes sense for government (rather than private) investment.

The United States has a history of significant research, and significant benefits from research, but that seems to be under attack.

  1. Historically, the old phone monopoly ran Bell Labs which employed thousands doing science and engineering research. It made great sense for them because research was a place to stash money (and evade regulators) that might have some return. They invented the transistor, the laser, and unix. With the breakup of the phone monopoly, it no longer made sense, and so it has been broken apart and has lost orders of magnitude of staff.
  2. On a smaller scale, Xerox Parc (inventors of mice, ethernet, and other basic bits of computer technology) has been radically scaled back.
  3. IBM and HP, who have been historically strong funders of computer-related research have been forced to shift towards more direct research. (Some great research still happens at these places, but the overall trend seems clear.)
  4. The NSF has had funding cut.

What’s Left
The new monopoly on the block is Microsoft, which has been a significant funder of new research, some of which is basic. IBM is still managing to do some basic research. Many companies are funding directed research (with short term expected payoffs). The NSF still has a reasonable budget, even if it is a bit reduced. Many other branches of the government fund directed research of one sort or another. From the perspective of a researcher, this isn’t as good as NSF because it is “money with strings attached”, including specific topics, demos, etc…

Much of the funding available falls into two or three categories: directed into academia, very directed, or both. These have difficulties from a research viewpoint.

  1. Into Academia The difficulty with funding directed into academia is that the professors who it is directed at are incredibly busy with nonresearch. Teaching and running a university are full time jobs. It takes an extraordinary individual to manage all of this and get research done. (Amazingly, many people do manage, but the workload can be crushing.) From the perspective of funding research, this is problematic, because the people being funded are forced to devote much time to nonresearch. As an example from machine learning, AT&T inherited the machine learning group from Bell Labs consisting of Leon Bottou, Michael Collins, Sanjoy Dasgupta, Yoave Freund, Michael Kearns, Yann Lecun, David McAllester, Robert Schapire, Satinder Singh, Peter Stone, Rich Sutton, Vladimir Vapnik (and maybe a couple more I’m forgetting). The environment there was almost pure research without other responsibilities. It would be extremely difficult to argue that a similar-sized group drawn randomly from academia has had as significant an impact on machine learning. This group is gone now, scattered to many different locations.
  2. Very directed It’s a basic fact of research that it is difficult to devote careful and deep attention to something that does not interest you. Attempting to do so simply means that many researchers aren’t efficient. (I’m not arguing against any direction here. It makes sense for any nation to invest in directions which seem important.)

The Good News (for researchers, anyways)
The good news at the moment is outside of the US. NICTA, in Australia, seems to be a well made attempt to do research right. India is starting to fund research more. China is starting to fund research more. Japan is starting to fund basic research more. With the rise of the EU more funding for research makes sense because the benefit applies to a much larger pool of people. In machine learning, this is being realized with the PASCAL project. On the engineering side, centers like the Mozilla Foundation and OSDL (which are funded by corporate contributions) provide some funding for open source programmers.

We can hope for improvements in the US—there is certainly room for it. For example, the NSF budget is roughly 0.3% of the Federal government budget so the impact of more funding for basic research is relatively trivial in the big picture. However, it’s never easy to tradeoff immediate needs against the silent loss of the future.

3/4/2005

The Big O and Constants in Learning

Tags: Prediction Theory jl@ 12:23 am

The notation g(n) = O(f(n)) means that in the limit as n approaches infinity there exists a constant C such that the g(n) is less than Cf(n).

In learning theory, there are many statements about learning algorithms of the form “under assumptions x,y, and z, the classifier learned has an error rate of at most O(f(m))“.

There is one very good reason to use O(): it helps you understand the big picture and neglect the minor details which are not important in the big picture. However, there are some important reasons not to do this as well.

  1. Unspeedup In algorithm analysis, the use of O() for time complexity is pervasive and well-justified. Determining the exact value of C is inherently computer architecture dependent. (The “C” for x86 processors might differ from the “C” on PowerPC processors.) Since many learning theorists come from a CS theory background, the O() notation is applied to generalization error. The O() abstraction breaks here—you can not generally change learning algorithm and decrease your error rate by some constant factor.
  2. You’re fired When you go and run a learning algorithm to acquire some predictor, the performance it achieves is a key quantity of significant interest. Using an algorithm with merely twice the error rate of a better algorithm can easily make the difference between “it works” and “it doesn’t”. This is not often true when programming, as evidenced by the large number of people who use computationally inefficient languages to solve problems.

We can’t say “never use O()“, because sometimes abstracting away details is the right thing to do. However, any use of O() should be analyzed for “reasonableness” more thoroughly than for computational complexity. Similarly, more interest should be allocated to improving constants in such analysis than is done in algorithms.

3/2/2005

Prior, “Prior” and Bias

Many different ways of reasoning about learning exist, and many of these suggest that some method of saying “I prefer this predictor to that predictor” is useful and necessary. Examples include Bayesian reasoning, prediction bounds, and online learning. One difficulty which arises is that the manner and meaning of saying “I prefer this predictor to that predictor” differs.

  1. Prior (Bayesian) A prior is a probability distribution over a set of distributions which expresses a belief in the probability that some distribution is the distribution generating the data.
  2. “Prior” (Prediction bounds & online learning) The “prior” is a measure over a set of classifiers which expresses the degree to which you hope the classifier will predict well.
  3. Bias (Regularization, Early termination of neural network training, etc…) The bias is some (often implicitly specified by an algorithm) way of preferring one predictor to another.

This only scratches the surface—there are yet more subtleties. For example the (as mentioned in meaning of probability) shifts from one viewpoint to another.

Powered by WordPress