John Langford – Page 49 – Machine Learning (Theory)

1/6/20084/12/2008

Research Political Issues

I’ve avoided discussing politics here, although not for lack of interest. The problem with discussing politics is that it’s customary for people to say much based upon little information. Nevertheless, politics can have a substantial impact on science (and we might hope for the vice-versa). It’s primary election time in the United States, so the topic is timely, although the issues are not.

There are several policy decisions which substantially effect development of science and technology in the US.

Education The US has great contrasts in education. The top universities are very good places, yet the grade school education system produces mediocre results. For me, the contrast between a public education and Caltech was bracing. For many others attending Caltech, it clearly was not. Upgrading the k-12 education system in the US is a long-standing chronic problem which I know relatively little about. My own experience is that a basic attitude of “no child unrealized” is better than “no child left behind”. A fair claim can also be made that the US just doesn’t invest enough.
Respect Lack of respect for science and technology is routinely expressed in many ways in the US.
1. The most bald form of lack of respect is scientific censorship. This may be easily understood as a generality: you choose to spend a large fraction of your life learning to interpret some part of the world. After years, you come to some conclusion about the nature of the world. Then, someone with no particular experience or expertise tells you to alter it.
2. A more refined form of lack of respect is simply lack of presence in decision making. This isn’t necessarily intentional: many people simply make decisions from the gut, and then come up with reasons to justify their decision. This style explicitly cuts out the deep thinking of science. Many policies could have been better informed by a serious consideration of even basic science:
  1. The oil of Iraq is fundamentally less valuable if we are going to tackle global warming.
  2. Swapping gasoline for hydrogen-based transportable energy source is dubious because it introduces another energy storage conversion to lose energy on. The same goes for swapping bioethanol for gasoline. In contrast, hybrid and electric vehicles actually recover substantial energy from regenerative braking, and a plug-in hybrid could run off electricity in typical commuter usage.
  3. The Space Shuttle is a boondoggle design. The rocket equation implies that the ratio of initial to final mass for vehicles reaching earth orbit must be at least a factor of e^2.5 (it’s actually e^2.93 for the Space Shuttle). Making the system reusable implies that most of this mass returns to earth so the payload deliverable into space is only 1.2% of the liftoff mass. A better designed system might deliver payloads a factor of 4 larger or be much smaller.
  4. Passenger Inspections at airports is another poor policy from the perspective of science. It isn’t effective, and there is no cost-efficient way to make it effective against a motivated opponent. Solid evidence for this is the continued use of mules to smuggle drugs. The basic problem from a chemistry point of view is that too much can be done with a small amount of mass. Deterrence and limitation (armored cockpits and active resistance for example) are fine policies.
3. Lack of support. The simplest form of lack of respect is simply lack of support. The case for federal vs corporate funding of basic science and technology development is very simple: the benefit to society of conducting such work dramatically exceeds the benefit any one agent within society (such as a company) could gain from it. Of late, investment in core science has been an anemic 0.0005 GDP and visa issues hamstring broader technology development.
Confidence This is primarily related to the technology side of science and technology. Many policy decisions are made without confidence in the ability of technologists to adapt. This comes in at least two flavors.
1. The foreordained solution. Policy often comes in the form “we use approach X to solve problem Y” (some examples are above). This demonstrates an overconfidence by policy makers in there ability to pick the winner, and a lack of confidence in the ability of technologists to solve problems. It also represents an opportunity for large established industries to get huge payoffs at taxpayer expense. The X-prize represents the opposite of this approach, and it has been radically more effective by any reasonable standard.
2. Confusion about the meaning of wealth. Some people believe that wealth is about what you have. However, for a society it seems much better to measure wealth in terms of what the society can do. Policy makers often forget that science and technology is a capability when it comes time to think of a solution. For example, someone with no confidence in the ability to create and make affordable plugin electric hybrids might think it necessary to conquest for oil.
Stability People can’t program, do science, or invent new things when they are worried about more immediate events. There are several destabilizing trends going on in the US right now which either now or in the future may make it hard to focus away from immediate concerns.
1. Debt and money supply. The federal debt for the US government is about 3.5 times the federal budget. This is bad for the simple reason that investors buying US treasury bonds aren’t investing in new technology. However, the destabilizing concern is more subtle. Since world war II, the US dollar has become the standard currency for exchange around the world. Since debt by the government creates a temptation by the government to (effectively) print money, the number of dollars in circulation has been rapidly growing. But, a growing number of dollars means that the currency is devaluing, which makes owning dollars undesirable. I don’t know an example of a previous world currency that has ceased to be such, but basic economics says that bad things happen to dollar-based savings if all the dollars flow back into the US. So far, the decline of the dollar has been relatively gradual, but a very disruptive cliff might exist out there somewhere. Policies which increase debt (like cutting taxes and increasing spending) exacerbate this problem. There is no fix once the dollar loses world currency status because confidence can be lost quickly, but not regained.
2. Health Care. The US is running an experiment to determine how large a fraction of GDP can be devoted to health care. Currently it’s over 15%, in first place, and growing. This is even worse than it sounds, because many comparable countries in Europe (or Japan) have older populations which should generally be more expensive to take care of. In the present situation, because health care is incredibly expensive, losing health insurance (which is typically tied to a job) is potentially catastrophic for any individual.
3. Wealth Asymmetry. The US has shifted towards a substantially more asymmetric division of wealth since the 1970s. An asymmetric division of wealth is not fundamentally bad—there needs to be room for great success to imply great rewards. However, a casual correlation of science and technology development with the gini coefficient map reveals that a large gini coefficient and substantial science and technology development do not coincide. The problem is that wealth becomes inheritable, and it’s very unlikely that the wealth is inherited by a someone interested in science and technology. Wealth is now scheduled to become perfectly inheritable in 2010 in the US.

I’m sure some of these issues are endemic to many other parts of the world as well, because there are fundamental conceptual difficulties with investing in the unknown instead of the known.

12/21/2007

Vowpal Wabbit Code Release

We are releasing the Vowpal Wabbit (Fast Online Learning) code as open source under a BSD (revised) license. This is a project at Yahoo! Research to build a useful large scale learning algorithm which Lihong Li, Alex Strehl, and I have been working on.

To appreciate the meaning of “large”, it’s useful to define “small” and “medium”. A “small” supervised learning problem is one where a human could use a labeled dataset and come up with a reasonable predictor. A “medium” supervised learning problem dataset fits into the RAM of a modern desktop computer. A “large” supervised learning problem is one which does not fit into the RAM of a normal machine. VW tackles large scale learning problems by this definition of large. I’m not aware of any other open source Machine Learning tools which can handle this scale (although they may exist). A few close ones are:

IBM’s Parallel Machine Learning Toolbox isn’t quite open source. The approach used by this toolbox is essentially map-reduce style computation, which doesn’t seem amenable to online learning approaches. This is significant, because the fastest learning algorithms without parallelization tend to be online learning algorithms.
Leon Bottou‘s sgd implementation first loads data into RAM, then learns. Leon’s code is a great demonstrator of how fast and effective online learning approaches (specifically stochastic gradient descent) can be. VW is about a factor of 3 faster on my desktop, and yields a lower error rate solution.

There are several other features such as feature pairing, sparse features, and namespacing that are often handy in practice.

At present, VW optimizes squared loss via gradient descent or exponentiated gradient descent over a linear representation.

This code is free to use, incorporate, and modify as per the BSD (revised) license. The project is ongoing inside of Yahoo. We will gladly incorporate significant improvements from other people, and I believe any significant improvements are of substantial research interest.

12/19/2007

Cool and interesting things seen at NIPS

I learned a number of things at NIPS.

The financial people were there in greater force than previously. Two Sigma sponsored NIPS while DRW Trading had a booth.
The adversarial machine learning workshop had a number of talks about interesting applications where an adversary really is out to try and mess up your learning algorithm. This is very different from the situation we often think of where the world is oblivious to our learning. This may present new and convincing applications for the learning-against-an-adversary work common at COLT.
There were several interesing papers.
1. Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni had a paper on General Agnostic Active Learning. The basic idea is that active learning can be done via reduction to a form of supervised learning problem. This is great, because we have many supervised learning algorithms from which the benefits of active learning may be derived.
2. Joseph Bradley and Robert Schapire had a Paper on Filterboost. Filterboost is an online boosting algorithm which I think of as the boost-by-filtration approaches in the first boosting paper updated for an adaboost-like structure. These kinds of approaches are doubtless helpful for large scale learning problems which are becoming more common.
3. Peter Bartlett, Elad Hazan, and Sasha Rakhlin had a paper on Adaptive Online Learning. This paper refines earlier results for online learning against an adversary via gradient descent, which is plausibly of great use in practice.
MLOSS was giving out free T-shirts which were cool. I missed the workshop starting this effort at last year’s NIPS due to workshop overload, but open source machine learning is definitely of great and sound interest to the community.

12/17/2007

New Machine Learning mailing list

IMLS (which is the nonprofit running ICML) has setup a new mailing list for Machine Learning News. The list address is ML-news@googlegroups.com, and signup requires a google account (which you can create). Only members can send messages.

12/12/2007

Workshop Summary—Principles of Learning Problem Design

This is a summary of the workshop on Learning Problem Design which Alina and I ran at NIPS this year.

The first question many people have is “What is learning problem design?” This workshop is about admitting that solving learning problems does not start with labeled data, but rather somewhere before. When humans are hired to produce labels, this is usually not a serious problem because you can tell them precisely what semantics you want the labels to have, and we can fix some set of features in advance. However, when other methods are used this becomes more problematic. This focus is important for Machine Learning because there are very large quantities of data which are not labeled by a hired human.

The title of the workshop was a bit ambitious, because a workshop is not long enough to synthesize a diversity of approaches into a coherent set of principles. For me, the posters at the end of the workshop were quite helpful in getting approaches to gel.

Here are some answers to “where do the labels come from?”:

Simulation Use a simulator (which need not be that good) to predict the cost of various choices and turn that into label information. Ashutosh had some cool demos showing the power of this approach. Gregory also presented a poster which might be viewed this way.
Agreement A label is a point of agreement. Luis often used an agreement mechanism to induce labels with games. Sham discussed the power of agreement to constrain learning algorithms. Huzefa‘s work on bioprediction can be thought of as partly using agreement with previous structures to simulate the label of a new structure.
Compilation Labels can be found by compiling one learning problem into another. Mark and I both talked about reductions a bit, which come with some nice formal guarantees.
Backprop Labels are the signals in generalized backpropagation (David Bradley‘s talk).

Some answers to “where do the data come from” are:

Everywhere The essential idea is to integrate as many data sources as possible. Rakesh had several algorithms which (in combination) allowed him to use a large number of diverse data sources in a text domain.
Sparsity A representation is formed by finding a sparse set of basis functions on otherwise totally unlabeled data. Rajat discussed self-taught learning algorithms which achieve this.
Self-prediction A representation is formed by learning to self-predict a set of raw features. Hal‘s talk covered this idea.

A workshop like this is successful if it informs the questions we ask (and answer) in the future. Some natural questions (some of which were discussed) are:

What is a natural, sufficient langauge for adding prior information into a learning system? Which languages are insufficient? Shai described a sense in which kernels are insufficient as a language for prior information. Bayesian analysis emphasizes reasoning about the parameters of the model, but the language of examples or maybe label expectations may be more natural.
What is missing from the above lists? And are the elements of the lists actually distinct?
How do we modularize? Many of the approaches use problem-specific tricks. That’s to be expected for a direction of research which is just starting, but it’s important to modularize these techniques so they can be repeatedly and easily applied. Achieving modularity in a manner which supports prior information properly seems tricky.
How do we formalize and analyze? Of the items listed above, I feel like we only have some reasonable understanding of the compilation approach. The other approaches and questions are essentially unexplored territory where some serious thinking may be helpful.