# Machine Learning (Theory)

## 1/6/2008

### Research Political Issues

Tags: Funding,General,Research jl@ 5:02 pm

I’ve avoided discussing politics here, although not for lack of interest. The problem with discussing politics is that it’s customary for people to say much based upon little information. Nevertheless, politics can have a substantial impact on science (and we might hope for the vice-versa). It’s primary election time in the United States, so the topic is timely, although the issues are not.

There are several policy decisions which substantially effect development of science and technology in the US.

1. Education The US has great contrasts in education. The top universities are very good places, yet the grade school education system produces mediocre results. For me, the contrast between a public education and Caltech was bracing. For many others attending Caltech, it clearly was not. Upgrading the k-12 education system in the US is a long-standing chronic problem which I know relatively little about. My own experience is that a basic attitude of “no child unrealized” is better than “no child left behind”. A fair claim can also be made that the US just doesn’t invest enough.
2. Respect Lack of respect for science and technology is routinely expressed in many ways in the US.
1. The most bald form of lack of respect is scientific censorship. This may be easily understood as a generality: you choose to spend a large fraction of your life learning to interpret some part of the world. After years, you come to some conclusion about the nature of the world. Then, someone with no particular experience or expertise tells you to alter it.
2. A more refined form of lack of respect is simply lack of presence in decision making. This isn’t necessarily intentional: many people simply make decisions from the gut, and then come up with reasons to justify their decision. This style explicitly cuts out the deep thinking of science. Many policies could have been better informed by a serious consideration of even basic science:
1. The oil of Iraq is fundamentally less valuable if we are going to tackle global warming.
2. Swapping gasoline for hydrogen-based transportable energy source is dubious because it introduces another energy storage conversion to lose energy on. The same goes for swapping bioethanol for gasoline. In contrast, hybrid and electric vehicles actually recover substantial energy from regenerative braking, and a plug-in hybrid could run off electricity in typical commuter usage.
3. The Space Shuttle is a boondoggle design. The rocket equation implies that the ratio of initial to final mass for vehicles reaching earth orbit must be at least a factor of e2.5 (it’s actually e2.93 for the Space Shuttle). Making the system reusable implies that most of this mass returns to earth so the payload deliverable into space is only 1.2% of the liftoff mass. A better designed system might deliver payloads a factor of 4 larger or be much smaller.
4. Passenger Inspections at airports is another poor policy from the perspective of science. It isn’t effective, and there is no cost-efficient way to make it effective against a motivated opponent. Solid evidence for this is the continued use of mules to smuggle drugs. The basic problem from a chemistry point of view is that too much can be done with a small amount of mass. Deterrence and limitation (armored cockpits and active resistance for example) are fine policies.
3. Lack of support. The simplest form of lack of respect is simply lack of support. The case for federal vs corporate funding of basic science and technology development is very simple: the benefit to society of conducting such work dramatically exceeds the benefit any one agent within society (such as a company) could gain from it. Of late, investment in core science has been an anemic 0.0005 GDP and visa issues hamstring broader technology development.
3. Confidence This is primarily related to the technology side of science and technology. Many policy decisions are made without confidence in the ability of technologists to adapt. This comes in at least two flavors.
1. The foreordained solution. Policy often comes in the form “we use approach X to solve problem Y” (some examples are above). This demonstrates an overconfidence by policy makers in there ability to pick the winner, and a lack of confidence in the ability of technologists to solve problems. It also represents an opportunity for large established industries to get huge payoffs at taxpayer expense. The X-prize represents the opposite of this approach, and it has been radically more effective by any reasonable standard.
2. Confusion about the meaning of wealth. Some people believe that wealth is about what you have. However, for a society it seems much better to measure wealth in terms of what the society can do. Policy makers often forget that science and technology is a capability when it comes time to think of a solution. For example, someone with no confidence in the ability to create and make affordable plugin electric hybrids might think it necessary to conquest for oil.
4. Stability People can’t program, do science, or invent new things when they are worried about more immediate events. There are several destabilizing trends going on in the US right now which either now or in the future may make it hard to focus away from immediate concerns.
1. Debt and money supply. The federal debt for the US government is about 3.5 times the federal budget. This is bad for the simple reason that investors buying US treasury bonds aren’t investing in new technology. However, the destabilizing concern is more subtle. Since world war II, the US dollar has become the standard currency for exchange around the world. Since debt by the government creates a temptation by the government to (effectively) print money, the number of dollars in circulation has been rapidly growing. But, a growing number of dollars means that the currency is devaluing, which makes owning dollars undesirable. I don’t know an example of a previous world currency that has ceased to be such, but basic economics says that bad things happen to dollar-based savings if all the dollars flow back into the US. So far, the decline of the dollar has been relatively gradual, but a very disruptive cliff might exist out there somewhere. Policies which increase debt (like cutting taxes and increasing spending) exacerbate this problem. There is no fix once the dollar loses world currency status because confidence can be lost quickly, but not regained.
2. Health Care. The US is running an experiment to determine how large a fraction of GDP can be devoted to health care. Currently it’s over 15%, in first place, and growing. This is even worse than it sounds, because many comparable countries in Europe (or Japan) have older populations which should generally be more expensive to take care of. In the present situation, because health care is incredibly expensive, losing health insurance (which is typically tied to a job) is potentially catastrophic for any individual.
3. Wealth Asymmetry. The US has shifted towards a substantially more asymmetric division of wealth since the 1970s. An asymmetric division of wealth is not fundamentally bad—there needs to be room for great success to imply great rewards. However, a casual correlation of science and technology development with the gini coefficient map reveals that a large gini coefficient and substantial science and technology development do not coincide. The problem is that wealth becomes inheritable, and it’s very unlikely that the wealth is inherited by a someone interested in science and technology. Wealth is now scheduled to become perfectly inheritable in 2010 in the US.

I’m sure some of these issues are endemic to many other parts of the world as well, because there are fundamental conceptual difficulties with investing in the unknown instead of the known.

## 6/16/2006

### Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)
s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the Ã¢â‚¬Å“geometricÃ¢â‚¬Â version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. ItÃ¢â‚¬â„¢s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin WainwrightÃ¢â‚¬â„¢s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a Ã¢â‚¬Å“priorÃ¢â‚¬Â that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and itÃ¢â‚¬â„¢s a sleight of hand because the penalties donÃ¢â‚¬â„¢t come from the motivating problem (the primal) but rather get tacked on at the end. In short: itÃ¢â‚¬â„¢s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that IÃ¢â‚¬â„¢m not even close to having been the first to figure that out. Miroslav DudÃƒÂ­k, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. ItÃ¢â‚¬â„¢s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if youÃ¢â‚¬â„¢re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework youÃ¢â‚¬â„¢re in luck. Yasemin Altun and Alex Smola have a paper (that I havenÃ¢â‚¬â„¢t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If youÃ¢â‚¬â„¢re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.

## 6/5/2006

### Server Shift, Site Tweaks, Suggestions?

Tags: General jl@ 8:40 pm

Hunch.net has shifted to a new server, and wordpress has been updated to the latest version. If anyone notices difficulties associated with this, please comment. (Note that DNS updates can take awhile so the shift may not yet be complete.)
More generally, this is a good time to ask for suggestions. What would make this blog more useful?

## 4/30/2006

### John Langford –> Yahoo Research, NY

Tags: General jl@ 10:02 pm

I will join Yahoo Research (in New York) after my contract ends at TTI-Chicago.

The deciding reasons are:

1. Yahoo is running into many hard learning problems. This is precisely the situation where basic research might hope to have the greatest impact.
2. Yahoo Research understands research including publishing, conferences, etc…
3. Yahoo Research is growing, so there is a chance I can help it grow well.
4. Yahoo understands the internet, including (but not at all limited to) experimenting with research blogs.

In the end, Yahoo Research seems like the place where I might have a chance to make the greatest difference.

Yahoo (as a company) has made a strong bet on Yahoo Research. We-the-researchers all hope that bet will pay off, and this seems plausible. I’ll certainly have fun trying.

## 12/28/2005

### Yet more nips thoughts

Tags: General DrewBagnell@ 12:13 pm

I only managed to make it out to the NIPS workshops this year so
I’ll give my comments on what I saw there.

The Learing and Robotics workshops lives again. I hope it
continues and gets more high quality papers in the future. The
most interesting talk for me was Larry Jackel’s on the LAGR
program (see John’s previous post on said program). I got some
ideas as to what progress has been made. Larry really explained
make the goals achievable but challenging.

Hal Daume gave a very interesting talk about structured
prediction using RL techniques, something near and dear to my own
heart. He achieved rather impressive results using only a very
greedy search.

The non-parametric Bayes workshop was great. I enjoyed the entire
morning session I spent there, and particularly (the usually
desultory) discussion periods. One interesting topic was the
Gibbs/Variational inference divide. I won’t try to summarize
especially as no conclusion was reached. It was interesting to
note that samplers are competitive with the variational
approaches for many Dirichlet process problems. One open question
I left with was whether the fast variants of Gibbs sampling could
be made multi-processor as the naive variants can.

I also have a reading list of sorts from the main
conference. Most of the papers mentioned in previous posts on
NIPS are on that list as well as these: (in no particular order)

The Information-Form Data Association Filter
[ps.gz][pdf][bibtex]

Divergences, surrogate loss functions and experimental design
XuanLong Nguyen, Martin Wainwright, Michael Jordan [ps.gz][pdf][bibtex]

Generalization to Unseen Cases
Teemu Roos, Peter GrÃƒÂ¼nwald, Petri MyllymÃƒÂ¤ki, Henry Tirri [ps.gz][pdf][bibtex]

Gaussian Process Dynamical Models
David Fleet, Jack Wang, Aaron Hertzmann [ps.gz][pdf][bibtex]

Convex Neural Networks
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau,
Patrice Marcotte [ps.gz][pdf][bibtex]

Describing Visual Scenes using Transformed Dirichlet Processes
Erik Sudderth, Antonio Torralba, William Freeman, Alan Willsky
[ps.gz][pdf][bibtex]

Learning vehicular dynamics, with application to modeling helicopters
Pieter Abbeel, Varun Ganapathi, Andrew Ng [ps.gz][pdf][bibtex]

Tensor Subspace Analysis
Xiaofei He, Deng Cai, Partha Niyogi [ps.gz][pdf][bibtex]

## 12/22/2005

### Yes , I am applying

Tags: General jl@ 6:43 pm

Every year about now hundreds of applicants apply for a research/teaching job with the timing governed by the university recruitment schedule. This time, it’s my turn—the hat’s in the ring, I am a contender, etc… What I have heard is that this year is good in both directions—both an increased supply and an increased demand for machine learning expertise.

I consider this post a bit of an abuse as it is neither about general research nor machine learning. Please forgive me this once.

My hope is that I will learn about new places interested in funding basic research—it’s easy to imagine that I have overlooked possibilities.

I am not dogmatic about where I end up in any particular way. Several earlier posts detail what I think of as a good research environment, so I will avoid a repeat. A few more details seem important:

1. Application. There is often a tension between basic research and immediate application. This tension is not as strong as might be expected in my case. As evidence, many of my coauthors of the last few years are trying to solve particular learning problems and I strongly care about whether and where a learning theory is useful in practice.
2. Duration. I would like my next move to be of indefinite duration.

Feel free to email me (jl@hunch.net) if there is a possibility you think I should consider.

## 12/9/2005

### Machine Learning Thoughts

Tags: General jl@ 12:04 am

I added a link to Olivier Bousquet’s machine learning thoughts blog. Several of the posts may be of interest.

## 12/7/2005

### Is the Google way the way for machine learning?

Tags: General jl@ 5:36 pm

Urs Hoelzle from Google gave an invited presentation at NIPS. In the presentation, he strongly advocates interacting with data in a particular scalable manner which is something like the following:

1. Make a cluster of machines.
2. Build a unified filesystem. (Google uses GFS, but NFS or other approaches work reasonably well for smaller clusters.)
3. Interact with data via MapReduce.

Creating a cluster of machines is, by this point, relatively straightforward.

Unified filesystems are a little bit tricky—GFS is capable by design of essentially unlimited speed throughput to disk. NFS can bottleneck because all of the data has to move through one machine. Nevertheless, this may not be a limiting factor for smaller clusters.

MapReduce is a programming paradigm. Essentially, it is a combination of a data element transform (map) and an agreggator/selector (reduce). These operations are highly parallelizable and the claim is that they support the forms of data interaction which are necessary.
Apparently, the Nutch project has an open source implementation of mapreduce (but this is clearly the most nonstandard element).

Shifting towards this paradigm has several effects:

1. It makes “big data” applications more viable.
2. It makes some learning algorithms more viable than others. One way to think about this is in terms of statistical query learning algorithms. The (generalized) notion of statistical query algorithms is algorithms that rely upon only the results of expections of a (relatively small) number of functions. Any such algorithm can be implemented via mapreduce. The “naive bayes” algorithm and most decision tree algorithms can be easily phrased as statistical query algorithms. Support vector machines can (technically) be phrased as statistical query algorithms, but the number of queries scales with the number of datapoints. Gradient descent algorithms can also be phrased as statistical query algorithms. Learning algorithms which work on one example at a time are not generally statistical query algorithms.

Another way to think about this is in terms of the complexity of the computation. Roughly speaking, as the amount of data scales, only O(n) or (perhaps) O(n log(n)) algorithms are tractable. This strongly favors online learning algorithms. Decision trees and naive bayes are (again) relatively reasonable. Support vector machines (or gaussian processes) encounter difficulties related to scaling.

There is a reasonable argument that the “low hanging fruit” of machine learning research is in the big data with enabling tools paradigm. This is because (a) the amount of data available has been growing far faster than the amount of computation and (b) we just haven’t had the tools to scale here, until recently.

I expect Urs is right: we should look in this direction.

## 12/1/2005

### The Webscience Future

Tags: General jl@ 10:47 am

The internet has significantly effected the way we do research but it’s capabilities have not yet been fully realized.

First, let’s acknowledge some known effects.

1. Self-publishing By default, all researchers in machine learning (and more generally computer science and physics) place their papers online for anyone to download. The exact mechanism differs—physicists tend to use a central repository (Arxiv) while computer scientists tend to place the papers on their webpage. Arxiv has been slowly growing in subject breadth so it now sometimes used by computer scientists.
2. Collaboration Email has enabled working remotely with coauthors. This has allowed collaborationis which would not otherwise have been possible and generally speeds research.

Now, let’s look at attempts to go further.

1. Blogs (like this one) allow public discussion about topics which are not easily categorized as “a new idea in machine learning” (like this topic).
2. Organization of some subfield of research. This includes Satinder Singh’s Reinforcement Learning pages, and, more generally books that have been placed online such as this one.
3. Discussion Groups The kernel machines discussions provide a good example of some common format allowing discussion.
4. Class notes have been placed online such as Avrim’s learning theory lecture notes.
5. Wikipedia has an article on Machine Learning. The article gives a reasonable quick overview and is surely read by a very large number of people.
6. Online Proceedings are now being used by several conferences such as NIPS.

Now, let’s consider some futures.

1. Wikifuture Wikipedia becomes better to the point where it is a completely comprehensive listing of current research in machine learning. At some point, we-the-community realize this and begin to emphasize (and credit) information placed in wikipedia. This process reinforces itself to the point where “if it’s not in wikipedia, it doesn’t exist”.

This future is significantly more probable than most people understand. As evidence compare the machine learning page three years ago (yep, it didn’t exist), two years ago, one year ago, and today. That progression strongly suggests that wikipedia:machine learning will continue to grow into a significant information resource.

There are fundamental obstacles to the success of the wikipedia future.

1. credit Wikipedia has only very weak mechanisms for crediting editors. A list of the changes done by one user account is about as much credit as is available. This is not enough to make career-deciding questions on. We could hope for a stronger link between identity and editor along with tools to track the value of particular edits (Think of counting hyperlinks as an analogue for counting citations).
2. controversy Wikipedia has grown up in a nonmanipulative environment. When it was little known, the incentive to fabricate entries was not great. Now that it is becoming well known that incentive is growing. Character assasination by false article exists. In science, the thing to worry about is misplaced ideas of the importance of your topic of research since it is very difficult to be sufficiently interested in a research topic and simultaneously view it objectively. Research is about creating new ideas, and the location of these ideas in some general organization is in dispute by default.
2. Evolutionary Progression Consider the following sequence of steps.
1. Conference Organization We realize that having a list of online papers isn’t nearly as useful as having an organized list of online papers so the conferences which already have online proceedings create an explorable topic hierarchy.
2. Time Organization We realize that the organization at one particular year’s conference is sketchy—research is a multiyear endeavor. Consequently, we start adding to last years topic hierarchy rather than creating a new one from scratch each year.
3. Transformation We realize that it is better if papers are done in the language of the web. For example, it’s very handy to be able to hyperlink inside of a paper. A good solution to the math on the web problem would greatly help here.
4. Consolidation We realize that there is a lot of redundancy in two papers on the same or a similar topic. They share an introduction, motivation, and (often) definitions. By joining the shared pieces, the contents of both papers can be made clearer.

Each of these individual steps clearly yields something better. At the end of these steps, creating a paper is simply the process of creating a webpage or altering an existing webpage. We can imagine doing all of this while keeping the peer-review mechanisms of science intact, so the resulting process is simply better in all ways. It’s easier to author because for most papers much of the “filler” introduction/motivation/definition can be reused from previous papers. It’s easier to review, because reviewers can consider the result in context. Much of the difficulty of reviewing is simply due to the author and reviewer not being “on the same page” in how they understand things. An organized topic hierarchy greatly aids this.

3. The unknown It is difficult to anticipate everything. What other futures might exist?

Which future comes about is dependent on many things—the decisions of community leaders, enabling ‘math-on-the-web’ technologies, etc…, so it is difficult to predict which future and when it will come about. Nevertheless, a potential exists and there are several paths leading towards reaching that potential.

## 11/26/2005

### The Design of an Optimal Research Environment

Tags: General jl@ 10:04 am

How do you create an optimal environment for research? Here are some essential ingredients that I see.

1. Stability. University-based research is relatively good at this. On any particular day, researchers face choices in what they will work on. A very common tradeoff is between:
1. easy small
2. difficult big

For researchers without stability, the ‘easy small’ option wins. This is often “ok”—a series of incremental improvements on the state of the art can add up to something very beneficial. However, it misses one of the big potentials of research: finding entirely new and better ways of doing things.

Stability comes in many forms. The prototypical example is tenure at a university—a tenured professor is almost imposssible to fire which means that the professor has the freedom to consider far horizon activities. An iron-clad guarantee of a paycheck is not necessary—industrial research labs have succeeded well with research positions of indefinite duration. Atnt research was a great example of this before they lost their monopoly and had to fire the researchers.

An inadequate amount of stability is probably given by fixed duration appointments. The appropriate timescale to consider here is one year because there is a strong annual cycle for application. A one year appointment is probably too short while a 10 year appointment may be effectively the same as arbitrary duration.

One significant part of stability is financial stability of the parent organization. The nature of research implies that what makes ‘funding sense’ depends on the size of the set of people who can benefit from it. Since ‘governments’ are the largest organizations around, they are good candidates, with the caveat that political will has not been known as greatly stable.

For professors at a university, teaching a subject well requires very significant time and energy. Running the university requires very significant time and energy. Writing grant proposals for funding requires very significant time and energy (success in grant writing is typically a necessity to achieve tenure at US research universities). Advising students well requires significant time and energy. Some of these activities are partially synergistic with research, but it seems unarguable that the multiplicity of time demands can greatly slow research.

In industrial labs (think of “IBM research” as an example), there is an tendency to shift research towards short term rewards. This is partly becaue the research labs have not proved very capable of using the big successes and partly because even large businesses have ups and downs. During a ‘down’ time, it is very tempting to use the reserve of very capable people in a research lab for short duration projects.

One thing to understand about research is that it is not the case that a person with half as much free time can produce half as much work. The problems encountered might have sufficient intricacy that they require a day or more of overhead to begin making progress.

3. Problem exposure The set of research problems which people could work on is much larger than the set of research problems which are plausibly useful to the rest of the world (*). The way narrow down on the relevant problems is to give researchers a chance to encounter them.

Univeristy-based research can have problems here, because no really effective mechanism for doing this exists by default. In industrial labs, researchers often act as “superconsultants” helping to solve problems. This method is effective but perhaps relatively crude because as often instantiated, it conflicts with “free time”.

(*) There is always someone who says “but at the time number theory was researched, nobody had any practical use, and now it is very useful for crypto”. This is true, but there are several important caveats:

1. The “hit rate” (in terms of big impact on everyones lives) for unmotivated research is much lower. I don’t have a good measurement of how much lower, but “significantly” seems both plausibly and anecdotally correct.
2. Number theory was not entirely unmotivated at the time. Afterall, it was clear that numbers were a very important abstraction and so a better understanding of numbers was plausibly useful.
3. Would development of factoring algorithms and their hardness understanding have been much delayed without number theory? And would that delay have been worth, in exchange, an earlier development of calculus? These are at least hard questions to answer.

There is a role for research of the “let’s do it because I am curious nature”, especially because interested people do good research. The size of that role must, necessarily, be relatively small.

4. Benefit exposure Almost all researchers are not well-exposed to the benefits of research. University-based research is may do this best via giving the researchers the ability to form their own companies with partial ownership. However, the overhead associated with “form your own company” is very substantial creating a formidable barrier (some good news, is that the barrier may be decreasing). This fails for many research advances which are obviously useful yet have no reasonable mechanism of capturing it’s profit. In the context of machine learning, spam filtering is of obvious and significant common use, yet a researcher with a better spam filter can not easily create a company around the method. There are several reasons for this.
1. Machine learning is a victim of it’s common success. It’s hard to develop a learning algorithm which is substantially better than others. This means that anyone wanting to implement spam filtering can do so. Patents are useless here—you can’t patent an entire field (and even if you could it wouldn’t work).
2. Spam filtering is not easily modularizable. Most people want their email clients to filter spam, but they don’t want to switch email clients. Combined with the first observation, this makes it clear that a “spam filtering company” is going to run into trouble.

The benefit structure in industrial research labs is (perhaps) often even worse. The primary mechanism is via bonuses, where 10% of base salary might be large. Obviously, keeping the job is of much greater importance. A system where the base salary is bumped by a small (variable) amount each year is common. The implication of this is that rewards are delayed (reducing incentives) compared to a bonus and age discrimination becomes a problem.

A relative bright spot here is google. Much of what they work on is not research-as-we-know-it, but some of it is, and there seem to be significant mechanisms for rewarding success, even after the IPO.

5. Devolution of power There is a traditional notion of how things are done which says “everyone must have a manager to tell them what to do”. This works badly in research for several reasons.
1. People are simply not very good at doing research on a topic which does not interest them. The problem is hard enough that doing it in a disinterested or even not-fully-interested way greatly reduces the chance of success.
2. Another difficulty here is judgement. People make mistakes, including even managers. If a manager has a good idea about what is an interesting topic of research, then the power to say “work on this” is mildly beneficial over simple persuasion. On the other hand, a mistake is greatly harmful.

The alternative to a command-and-control notion of research is a persuasion-based system. In a persuasion-based system, people are free to either work with someone or not, as they judge. In a persuasion based system, no compelling idea means that researchers try different problems while, which seems appropriate. And, when something promising comes along, more people can be added as demand and persuasiveness allows. University style research partially achieves this (a notable complaint of students is that they do not have enough freedom). In industrial labs, this varies widely.

It is worth noting that devolution of power has occurred in all of the following surprisingly effective organizations/endeavors:

1. Linux kernel development (and more generally open source software). Only a decade ago, most economists would have considered the present success of open source software development as implausible. Anyone who can code can work on the linux kernel if they want.
2. Wikipedia Traditionaly encyclopedias complain that wikipedia articles can be false since anyone can submit, but this argument vastly underestimates the value of having a comprehensive and uptodate information resource.
3. Google made it so anyone could advertise in a simple and easy way. This simultaneously lowered the barrier to advertising and sanitized the process. We sometimes forget that google came long after other internet search engines so their success was far from assured.
6. Student exposure By “student” here, I mean “research student”. Research students are of course much more common on university campuses, although there are sometimes government funded research labs such as MPI in Germany and NICTA in Australia.

Any reasonable system involving people working together must take into account how new people are brought into the system. The appropriate point seems to be where training in research starts (typically, ‘phd students’). Aside from issues of a self-sustaining system, students can add something significant to the process of research. The effort of explaining a new topic of research often aids research via simplification and consideration in new contexts. Students of course can also contribute substantially to the basic mission of research because they inherently bring in new ideas.

7. Concentration There are several kinds of concentration, and each of them is important. The first kind is the ‘quality’ kind: there are better and worse researchers. Unfortunately, this is extraordinarily difficult to judge, except after the fact. Another kind is concentration of interests. Having 5 people each interested in the subject working on their own is often substantially less useful than having the 5 working together. The internet has greatly aided concentration because some significant benefit can be derived even from people who are not physically near to each other. Universities have traditionally had difficulty with this because the demands of teaching undergraduates imply a necessity for breadth of subjects. This implies that it is common for universities to not have two people working on a similar subject.

It is interesting that no place manages to capture them all. If all of these elements were to come together, the result might be very impressive.

## 11/7/2005

### Prediction Competitions

Tags: General jl@ 11:27 pm

There are two prediction competitions currently in the air.

1. The Performance Prediction Challenge by Isabelle Guyon. Good entries minimize a weighted 0/1 loss + the difference between a prediction of this loss and the observed truth on 5 datasets. Isabelle tells me all of the problems are “real world” and the test datasets are large enough (17K minimum) that the winner should be well determined by ability rather than luck. This is due March 1.
2. The Predictive Uncertainty Challenge by Gavin Cawley. Good entries minimize log loss on real valued output variables for one synthetic and 3 “real” datasets related to atmospheric prediction. The use of log loss (which can be infinite and hence is never convergent) and smaller test sets of size 1K to 7K examples makes the winner of this contest more luck dependent. Nevertheless, the contest may be of some interest particularly to the branch of learning (typically Bayes learning) which prefers to optimize log loss.

May the best predictor win.

## 11/5/2005

### The design of a computing cluster

Tags: General jl@ 4:26 pm

This is about the design of a computing cluster from the viewpoint of applied machine learning using current technology. We just built a small one at TTI so this is some evidence of what is feasible and thoughts about the design choices.

1. Architecture There are several architectural choices.
1. AMD Athlon64 based system. This seems to have the cheapest bang/buck. Maximum RAM is typically 2-3GB.
2. AMD Opteron based system. Opterons provide the additional capability to buy an SMP motherboard with two chips, and the motherboards often support 16GB of RAM. The RAM is also the more expensive error correcting type.
3. Intel PIV or Xeon based system. The PIV and Xeon based systems are the intel analog of the above 2. Due to architectural design reasons, these chips tend to run a bit hotter and be a bit more expensive.
4. Dual core chips. Both Intel and AMD have chips that actually have 2 processors embedded in them.
5. In the end, we decided to go with option (2). Roughly speaking, the AMD system seemed like a better value than Intel. The opteron systems were desirable over the Athlon64 systems because halving the number of nodes aides system setup and maintenance while preserving about the same (or slightly more) cost/cpu. Another concern driving us towards the Opteron system was the ability to expand the RAM at a later date. In the last year or so, CPU speeds have not increased signficantly, instead dual core chips have come out. This trend seems likely to continue which implies the time to obselescence of a machine is driven by the maximum RAM capacity.

2. Network Gigabit ethernet is cheap, easy, and even built into the motherboard.
3. Operating System The options are
1. Windows
2. Linux

We chose Linux (and in particular the Fedora Core 3 variant) because Linux means cheaper, less licensing hassles, and you get to work with a system that has been used in clusters for much longer. An additional issue is the version of Linux:

1. 32bit linux: The advantage here is everything “just works”. The disadvantage is that using more than 4GB of RAM is awkward and you lose out on some minor architectural speedups of 64bit mode.
2. 64bit linux
3. We ended up choosing 32bit linux simply for stability and ease-of-setup reasons. The ease-of-setup was particularly compelling with respect to the openMosix choice (below). The exact variant of linux we used was a matter of some initial exploration determined by Don Coleman (TTIs master of machines). It is very plausible that we will want to switch to 64bit linux at some point in the future.

4. Programming paradigm There are several paradigms for how to use a parallel machine. These are not exclusive.
1. Use a standard language with a specialized parallel programming library such as mpich. This choice can result in a significant slowdown in programming time, but is necessary to eak every bit of performance out of the system.
2. Use a batch control system to match jobs with nodes. There are several custom systems around for doing this, and it’s not hard to make up your own script. There is some learning curve here although it is necessarily smaller. With this approach, you can achieve near maximal performance as long as the individual processes do not need to communicate.
3. Turn the cluster into a large virtual machine via openMosix and then simply launch several processes. This is the worst option performance-wise and the best option convenience-wise. To use it, you simply start processes and the system takes care of distributing them across the cluster. Processes must, however, be designed to minimize disk IO (as well as program IO) in order to achieve high efficiency computation.

At TTI, we focused on the openMosix approach because this fits well with standard machine learning programs. In addition, starting at the ease-of-use end and then graduating to more difficult paradigms as necessary seems reasonable.

There are a couple things which did not quite work out. Ideally, each of the nodes would be rackmounted (for ease of maintenance) and, except for the “master node”, use ethernet boot on startup. The rackmounting was relatively easy, but the combination of ethernet boot, openmosix, and linux was frustrating. Instead Don ordered some very small hard drives for each node and simply installed linux on them. Another minor surprise is that the opteron motherboard required a video card in order to boot.

In the end, the total cost was about $1000 per CPU and it took (perhaps) a man-week to setup. There are many caveats here—prices change rapidly, software improves, and how you want to use the cluster is important to consider. Nevertheless, this design point is hopefully of some help in calibrating your own designs. (Feel free to add in any of your own experience below.) ## 10/20/2005 ### Machine Learning in the News Tags: General jl@ 10:55 am The New York Times had a short interview about machine learning in datamining being used pervasively by the IRS and large corporations to predict who to audit and who to target for various marketing campaigns. This is a big application area of machine learning. It can be harmful (learning + databases = another way to invade privacy) or beneficial (as google demonstrates, better targeting of marketing campaigns is far less annoying). This is yet more evidence that we can not rely upon “I’m just another fish in the school” logic for our expectations about treatment by government and large corporations. ## 10/19/2005 ### Workshop: Atomic Learning Tags: General jl@ 12:23 pm We are planning to have a workshop on atomic learning Jan 7 & 8 at TTI-Chicago. The earlier request for interest is here. The primary deadline is abstracts due Nov. 20 to jl@tti-c.org. ## 10/13/2005 ### Site tweak Tags: General jl@ 10:43 am Several people have had difficulty with comments which seem to have an allowed language significantly poorer than posts. The set of allowed html tags has been increased and the markdown filter has been put in place to try to make commenting easier. I’ll put some examples into the comments of this post. ## 10/12/2005 ### The unrealized potential of the research lab Tags: General jl@ 10:35 am I attended the IBM research 60th anniversary. IBM research is, by any reasonable account, the industrial research lab which has managed to bring the most value to it’s parent company over the long term. This can be seen by simply counting the survivors: IBM research is the only older research lab which has not gone through a period of massive firing. (Note that there are also new research labs.) Despite this impressive record, IBM research has failed, by far, to achieve it’s potential. Examples which came up in this meeting include: 1. It took about a decade to produce DRAM after it was invented in the lab. (In fact, Intel produced it first.) 2. Relational databases and SQL were invented and then languished. It was only under external competition that IBM released it’s own relational database. Why didn’t IBM grow an Oracle division? 3. An early lead in IP networking hardware did not result in IBM growing a Cisco division. Why not? And remember … IBM research is a stark success story compared to it’s competitors. Why is there such a pervasive failure to recognize the really big things when they come along in a research lab? This is a huge by capitolization standards: several ideas created in IBM labs have resulted in companies with a stock market valuation greater than IBM. This is also of fundamental concern to researchers everywhere, because that failure is much of the reason why research is chronically underfunded. A reasonable argument is “it’s much harder to predict the big ones in advance than in hindsight”. This is certainly true, but I don’t believe that is a full explanation. Too many ideas have succeeded because someone else outside of the orignal company recognized the value of the idea and exploited it. There is no fundamental reason why VCs should have an inherent advantage over the company running a research lab at recognizing good ideas within it’s own research lab. Muthu Muthukrishnan says “it’s the incentives”. In particular, people who invent something within a research lab have little personal incentive in seeing it’s potential realized so they fail to pursue it as vigorously as they might in a startup setting. This is a very reasonable argument: incentives for success at a research lab are typically a low percentage of base salary while startup founders have been known to become billionaires. A third argument is that researchers at a research lab are likely to find new and better ways of doing things that the company already does. This creates a problem because a large fraction of the company is specifically invested in the current-to-become-old way of doing things. When a debate happens, it is always easy to find the faults and drawbacks of the new method while ignoring the accepted faults of the old (this is common to new directions in research as well). None of these three reasons are fundamentally unremovable, and it seems plausible that the first industrial research lab to remove these obstacles will reap huge benefits. My current candidates for ‘most likely to remove barriers’ are google and NICTA. ## 10/10/2005 ### Predictive Search is Coming Tags: General jl@ 10:08 am “Search” is the other branch of AI research which has been succesful. Concrete examples include Deep Blue which beat the world chess champion and Chinook the champion checkers program. A set of core search techniques exist including A*, alpha-beta pruning, and others that can be applied to any of many different search problems. Given this, it may be surprising to learn that there has been relatively little succesful work on combining prediction and search. Given also that humans typically solve search problems using a number of predictive heuristics to narrow in on a solution, we might be surprised again. However, the big successful search-based systems have typically not used “smart” search algorithms. Insteady they have optimized for very fast search. This is not for lack of trying… many people have tried to synthesize search and prediction to various degrees of success. For example, Knightcap achieves good-but-not-stellar chess playing performance, and TD-gammon has achieved near-optimal Backgammon play. The basic claim of this post is that we will see more and stronger successes of the sort exemplified by Knightcap and TD-gammon and less of those exemplified by the “pure” search techniques. Here are two reasons: 1. The strengths of computers are changing. Computers have long held a huge advantage over humans in two areas: 1. Short term memory. A typical human can remember 7ish random numbers presented in a sequence. A computer can easily remember 107 random numbers. This strength is continuing to grow. 2. Sequential execution speed. What I mean by ‘sequential execution speed’ is simply time ordered instruction execution with arbitrary data dependencies. Neurons in a human brain might be able to fire at a rate of 102 or 103 cycles/second (depending on who you ask), while computers can run at 109 cycles/second. Sequential execution speed growth appears stalled for the moment due to heat issues. These two points of excellence are precisely what search algorithms typically require for good performance which partially explains why the human-preferred method of search is so different from the computer-preferred method. The advantages of humans have traditionally been: 1. Significant long term memory For example humans might have a significant advantage using a smarter search technique becaue they can use previous search-based successes and failures to bias search on new problems. Computers are gaining this advantage as well due to massive hard drives and networking. 2. Massively parallel hardware While the speed that brains work at may not be very significant, the total amount of computation done by a brain is enormous. For example, estimates of 1015 neural connections are credible. Chip designers, who are stymied in the production of sequential execution speed are parallelizing via speculative execution, special instruction sets, “hyperthreading”, dual core processors, SMP machines, and clusters. It will be awhile before computers can execute as many parallel instructions as computer processors, but not as long as you might expect—the cell processor claims 0.25*1012 floating point operations per second. Putting these observations together, we see that computers are becoming more ’rounded’ in their strengths which suggests highly parallelizable algorithms using large quantities of memory may become more viable solutions for search problems. These qualities of ‘highly parallelizable’ and ‘incorporating long term memory’ characterize many of our prediction algorithms. 2. We have gained a finer understanding of what learning is and how it can take place. In the beginning, on the applied side, learning was simply understood at a ‘type’ level with a relatively sparse set of types such as ‘binary classification’ and ‘multiclass classification’. Evidence of this can be seen in the UCI machine learning repository where most learning problems are ‘projected’ (via throwing away extra information) into binary or multiclass prediction even when the original problem was significantly different. What we have learned since then is that it is critical to consider: 1. The natural distribution over instances. What is meant here is that instances come from some process and it is only necessary to perform well with respect to the samples produced by that process. This observation guides us to focus attention in learning machines. At the theoretical level, it is difficult to even think about learning problems without considering some natural distribution over the instances we need to predict. At the practical level, there are many cases where people have enjoyed grealy improved success simply by using the natural distribution. This occurs repeatedly in the learning reductions analysis (an exemplar is here), but this core observation is common elsewhere as well. 2. Richer classification problems. It turns out that many of the details lost in translation of a problem to multiclass or binary classification are very important. For example, if one failure mode is much more severe than another exposing that to the learning machine can make a huge difference in performance. We now have a number of techniques for exposing this information to learning machines and can use much richer side information. The basic claim here is that understanding these two concepts is key to mixing search and prediction. Considering the ‘natural distribution’ is important because search systems create their own distribution over instances. This property of ‘self-creating distribution’ implies that it is easy to misdesign a combined learning/search system. For example, it is easy to train the learning algorithm on samples drawn from the wrong process. The value of searching in some direction vs. another direction is not easily described as a simple classification problem. Consequently, projecting away this information, as might have been done a decade ago, can result in a large performance loss. ## 10/8/2005 ### We have a winner Tags: General jl@ 6:55 pm The DARPA grandchallenge is a big contest for autonomous robot vehicle driving. It was run once in 2004 for the first time and all teams did badly. This year was notably different with the Stanford and CMU teams succesfully completing the course. A number of details are here and wikipedia has continuing coverage. A formal winner hasn’t been declared yet although Stanford completed the course quickest. The Stanford and CMU teams deserve a large round of applause as they have strongly demonstrated the feasibility of autonomous vehicles. The good news for machine learning is that the Stanford team (at least) is using some machine learning techniques. ## 10/3/2005 ### Not ICML Tags: General jl@ 4:00 am Alex Smola showed me this ICML 2006 webpage. This is NOT the ICML we know, but rather some people at “Enformatika”. Investigation shows that they registered with an anonymous yahoo email account from dotregistrar.com the “Home of the$6.79 wholesale domain!” and their nameservers are by Turkticaret, a Turkish internet company.

It appears the website has since been altered to “ICNL” (the above link uses the google cache).

They say that imitation is the sincerest form of flattery, so the organizers of the real ICML 2006 must feel quite flattered.

## 9/30/2005

### Research in conferences

Tags: General jl@ 12:52 pm

Conferences exist as part of the process of doing research. They provide many roles including “announcing research”, “meeting people”, and “point of reference”. Not all conferences are alike so a basic question is: “to what extent do individual conferences attempt to aid research?” This question is very difficult to answer in any satisfying way. What we can do is compare details of the process across multiple conferences.

1. Comments The average quality of comments across conferences can vary dramatically. At one extreme, the tradition in CS theory conferences is to provide essentially zero feedback. At the other extreme, some conferences have a strong tradition of providing detailed constructive feedback. Detailed feedback can give authors significant guidance about how to improve research. This is the most subjective entry.
2. Blind Virtually all conferences offer single blind review where authors do not know reviewers. Some also provide double blind review where reviewers do not know authors. The intention with double blind reviewing is to make the conference more approachable to first-time authors.
3. Author Feedback Author feedback is a mechanism where authors can provide feedback to reviewers (and, to some extent, complain). Providing an author feedback mechanism provides an opportunity for the worst reviewing errors to be corrected.
4. Conditional Accepts A conditional accept is some form of “we will accept this paper if conditions X,Y, and Z are met”. A conditional accept allows reviewers to demand different experiments or other details they need in order to make a decision. This might speed up research significantly because otherwise good papers need not wait another year.
5. Papers/PC member How many papers can one person actually review well? When there is an incredible load of papers to review, it becomes very tempting to make snap decisions without a thorough attempt at understanding. Snap decisions are often wrong. These numbers are based on the number of submissions with a computer science standard of 3 reviews per paper.

Each of these “options” make reviewing more difficult by requiring more reviewer work. There is a basic trade-off between the amount of time spent reviewing vs. working on new research and the speed of the review process itself. It is unclear where this optimal trade-off point lies, but the easy default is “not enough time spent reviewing” because reviewing is generally an unrewarding job.

It seems reasonable to cross reference these options with some measures of ‘conference impact’. For each of these, it’s important to realize these are not goal metrics and so their meaning is unclear. The best that can be said is that it is not bad to do well. Also keep in mind that measurements of “impact” are inherently “trailing indicators” which are not necessarily relevant to the way the conference is currently run.

1. average citations Citeseer has been used to estimate the average impact of a conference’s papers here using the average number of citations per paper.
2. max citations A number of people believe that the maximum number of citations given to any one paper is a strong indicator of the success of the conference. This can be measured by going to scholar.google.com and using ‘advanced search’ for the conference name.
 Conference Comments blindness author feedback conditional accepts Reviews/PC member log(average citations per paper+1) max citations ICML Sometimes Helpful Double Yes Yes 8 2.12 1079 AAAI Sometimes Helpful Double Yes No 8 1.87 650 COLT Sometimes Helpful Single No No 15? 1.49 710 NIPS Sometimes Helpful/Sometimes False Single Yes No 113(*) 1.06 891 CCC Sometimes Helpful Single No No 24 1.25 142 STOC Not Helpful Single No No 41 1.69 611 SODA Not Helpful Single No No 56 1.51 175

(*) To some extent this is a labeling problem. NIPS has an organized process of finding reviewers very similar to ICML. They are simply not called PC members.

Keep in mind that the above is a very incomplete list (it only includes the conferences that I interacted with) and feel free to add details in the comments.

Older Posts »