Machine Learning (Theory)

1/6/2008

Research Political Issues

Tags: Funding,General,Research jl@ 5:02 pm

I’ve avoided discussing politics here, although not for lack of interest. The problem with discussing politics is that it’s customary for people to say much based upon little information. Nevertheless, politics can have a substantial impact on science (and we might hope for the vice-versa). It’s primary election time in the United States, so the topic is timely, although the issues are not.

There are several policy decisions which substantially effect development of science and technology in the US.

  1. Education The US has great contrasts in education. The top universities are very good places, yet the grade school education system produces mediocre results. For me, the contrast between a public education and Caltech was bracing. For many others attending Caltech, it clearly was not. Upgrading the k-12 education system in the US is a long-standing chronic problem which I know relatively little about. My own experience is that a basic attitude of “no child unrealized” is better than “no child left behind”. A fair claim can also be made that the US just doesn’t invest enough.
  2. Respect Lack of respect for science and technology is routinely expressed in many ways in the US.
    1. The most bald form of lack of respect is scientific censorship. This may be easily understood as a generality: you choose to spend a large fraction of your life learning to interpret some part of the world. After years, you come to some conclusion about the nature of the world. Then, someone with no particular experience or expertise tells you to alter it.
    2. A more refined form of lack of respect is simply lack of presence in decision making. This isn’t necessarily intentional: many people simply make decisions from the gut, and then come up with reasons to justify their decision. This style explicitly cuts out the deep thinking of science. Many policies could have been better informed by a serious consideration of even basic science:
      1. The oil of Iraq is fundamentally less valuable if we are going to tackle global warming.
      2. Swapping gasoline for hydrogen-based transportable energy source is dubious because it introduces another energy storage conversion to lose energy on. The same goes for swapping bioethanol for gasoline. In contrast, hybrid and electric vehicles actually recover substantial energy from regenerative braking, and a plug-in hybrid could run off electricity in typical commuter usage.
      3. The Space Shuttle is a boondoggle design. The rocket equation implies that the ratio of initial to final mass for vehicles reaching earth orbit must be at least a factor of e2.5 (it’s actually e2.93 for the Space Shuttle). Making the system reusable implies that most of this mass returns to earth so the payload deliverable into space is only 1.2% of the liftoff mass. A better designed system might deliver payloads a factor of 4 larger or be much smaller.
      4. Passenger Inspections at airports is another poor policy from the perspective of science. It isn’t effective, and there is no cost-efficient way to make it effective against a motivated opponent. Solid evidence for this is the continued use of mules to smuggle drugs. The basic problem from a chemistry point of view is that too much can be done with a small amount of mass. Deterrence and limitation (armored cockpits and active resistance for example) are fine policies.
    3. Lack of support. The simplest form of lack of respect is simply lack of support. The case for federal vs corporate funding of basic science and technology development is very simple: the benefit to society of conducting such work dramatically exceeds the benefit any one agent within society (such as a company) could gain from it. Of late, investment in core science has been an anemic 0.0005 GDP and visa issues hamstring broader technology development.
  3. Confidence This is primarily related to the technology side of science and technology. Many policy decisions are made without confidence in the ability of technologists to adapt. This comes in at least two flavors.
    1. The foreordained solution. Policy often comes in the form “we use approach X to solve problem Y” (some examples are above). This demonstrates an overconfidence by policy makers in there ability to pick the winner, and a lack of confidence in the ability of technologists to solve problems. It also represents an opportunity for large established industries to get huge payoffs at taxpayer expense. The X-prize represents the opposite of this approach, and it has been radically more effective by any reasonable standard.
    2. Confusion about the meaning of wealth. Some people believe that wealth is about what you have. However, for a society it seems much better to measure wealth in terms of what the society can do. Policy makers often forget that science and technology is a capability when it comes time to think of a solution. For example, someone with no confidence in the ability to create and make affordable plugin electric hybrids might think it necessary to conquest for oil.
  4. Stability People can’t program, do science, or invent new things when they are worried about more immediate events. There are several destabilizing trends going on in the US right now which either now or in the future may make it hard to focus away from immediate concerns.
    1. Debt and money supply. The federal debt for the US government is about 3.5 times the federal budget. This is bad for the simple reason that investors buying US treasury bonds aren’t investing in new technology. However, the destabilizing concern is more subtle. Since world war II, the US dollar has become the standard currency for exchange around the world. Since debt by the government creates a temptation by the government to (effectively) print money, the number of dollars in circulation has been rapidly growing. But, a growing number of dollars means that the currency is devaluing, which makes owning dollars undesirable. I don’t know an example of a previous world currency that has ceased to be such, but basic economics says that bad things happen to dollar-based savings if all the dollars flow back into the US. So far, the decline of the dollar has been relatively gradual, but a very disruptive cliff might exist out there somewhere. Policies which increase debt (like cutting taxes and increasing spending) exacerbate this problem. There is no fix once the dollar loses world currency status because confidence can be lost quickly, but not regained.
    2. Health Care. The US is running an experiment to determine how large a fraction of GDP can be devoted to health care. Currently it’s over 15%, in first place, and growing. This is even worse than it sounds, because many comparable countries in Europe (or Japan) have older populations which should generally be more expensive to take care of. In the present situation, because health care is incredibly expensive, losing health insurance (which is typically tied to a job) is potentially catastrophic for any individual.
    3. Wealth Asymmetry. The US has shifted towards a substantially more asymmetric division of wealth since the 1970s. An asymmetric division of wealth is not fundamentally bad—there needs to be room for great success to imply great rewards. However, a casual correlation of science and technology development with the gini coefficient map reveals that a large gini coefficient and substantial science and technology development do not coincide. The problem is that wealth becomes inheritable, and it’s very unlikely that the wealth is inherited by a someone interested in science and technology. Wealth is now scheduled to become perfectly inheritable in 2010 in the US.

I’m sure some of these issues are endemic to many other parts of the world as well, because there are fundamental conceptual difficulties with investing in the unknown instead of the known.

6/16/2006

Regularization = Robustness

The Gibbs-Jaynes theorem is a classical result that tells us that the highest entropy distribution (most uncertain, least committed, etc.) subject to expectation constraints on a set of features is an exponential family distribution with the features as sufficient statistics. In math,

argmax_p H(p)
s.t. E_p[f_i] = c_i

is given by e^{\sum \lambda_i f_i}/Z. (Z here is the necessary normalization constraint, and the lambdas are free parameters we set to meet the expectation constraints).

A great deal of statistical mechanics flows from this result, and it has proven very fruitful in learning as well. (Motivating work in models in text learning and Conditional Random Fields, for instance. ) The result has been demonstrated a number of ways. One of the most elegant is the “geometric” version here.

In the case when the expectation constraints come from data, this tells us that the maximum entropy distribution is exactly the maximum likelihood distribution in the exponential family. It’s a surprising connection and the duality it flows from appears in a wide variety of work. (For instance, Martin Wainwright’s approximate inference techniques rely (in essence) on this result.)

In practice, we know that Maximum Likelihood with a lot of features is bound to overfit. The traditional trick is to pull a sleight of hand in the derivation. We start with the primal entropy problem, move to the dual, and in the dual add a “prior” that penalizes the lambdas. (Typically an l_1 or l_2 penalty or constraint.) This game is played in a variety of papers, and it’s a sleight of hand because the penalties don’t come from the motivating problem (the primal) but rather get tacked on at the end. In short: it’s a hack.

So I realized a few months back, that the primal (entropy) problem that regularization relates to is remarkably natural. Basically, it tells us that regularization in the dual corresponds directly to uncertainty (mini-max) about the constraints in the primal. What we end up with is a distribution p that is robust in the sense that it maximizes the entropy subject to a large set of potential constraints. More recently, I realized that I’m not even close to having been the first to figure that out. Miroslav Dudík, Steven J. Phillips and Robert E. Schapire, have a paper that derives this relation and then goes a step further to show what performance guarantees the method provides. It’s a great paper and I hope you get a chance to check it out:

Performance guarantees for regularized maximum entropy density estimation.

(Even better: if you’re attending ICML this year, I believe you will see Rob Schapire talk about some of this and related material as an invited speaker.)

It turns out the idea generalizes quite a bit. In Robust design of biological experiments. P. Flaherty, M. I. Jordan and A. P. Arkin show a related result where regularization directly follows from a robustness or uncertainty guarantee. And if you want the whole, beautiful framework you’re in luck. Yasemin Altun and Alex Smola have a paper (that I haven’t yet finished, but at least begins very well) that generalizes the regularized maximum entropy duality to a whole class of statistical inference procedures. If you’re at COLT, you can check this out as well.

Unifying Divergence Minimization and Statistical Inference via Convex Duality

The deep, unifying result seems to be what the title of the post says: robustness = regularization. This viewpoint makes regularization seem like much less of a hack, and goes further in suggesting just what range of constants might be reasonable. The work is very relevant to learning, but the general idea goes beyond to various problems where we only approximately know constraints.

6/5/2006

Server Shift, Site Tweaks, Suggestions?

Tags: General jl@ 8:40 pm

Hunch.net has shifted to a new server, and wordpress has been updated to the latest version. If anyone notices difficulties associated with this, please comment. (Note that DNS updates can take awhile so the shift may not yet be complete.)
More generally, this is a good time to ask for suggestions. What would make this blog more useful?

4/30/2006

John Langford –> Yahoo Research, NY

Tags: General jl@ 10:02 pm

I will join Yahoo Research (in New York) after my contract ends at TTI-Chicago.

The deciding reasons are:

  1. Yahoo is running into many hard learning problems. This is precisely the situation where basic research might hope to have the greatest impact.
  2. Yahoo Research understands research including publishing, conferences, etc…
  3. Yahoo Research is growing, so there is a chance I can help it grow well.
  4. Yahoo understands the internet, including (but not at all limited to) experimenting with research blogs.

In the end, Yahoo Research seems like the place where I might have a chance to make the greatest difference.

Yahoo (as a company) has made a strong bet on Yahoo Research. We-the-researchers all hope that bet will pay off, and this seems plausible. I’ll certainly have fun trying.

12/28/2005

Yet more nips thoughts

Tags: General DrewBagnell@ 12:13 pm

I only managed to make it out to the NIPS workshops this year so
I’ll give my comments on what I saw there.

The Learing and Robotics workshops lives again. I hope it
continues and gets more high quality papers in the future. The
most interesting talk for me was Larry Jackel’s on the LAGR
program (see John’s previous post on said program). I got some
ideas as to what progress has been made. Larry really explained
the types of benchmarks and the tradeoffs that had to be made to
make the goals achievable but challenging.

Hal Daume gave a very interesting talk about structured
prediction using RL techniques, something near and dear to my own
heart. He achieved rather impressive results using only a very
greedy search.

The non-parametric Bayes workshop was great. I enjoyed the entire
morning session I spent there, and particularly (the usually
desultory) discussion periods. One interesting topic was the
Gibbs/Variational inference divide. I won’t try to summarize
especially as no conclusion was reached. It was interesting to
note that samplers are competitive with the variational
approaches for many Dirichlet process problems. One open question
I left with was whether the fast variants of Gibbs sampling could
be made multi-processor as the naive variants can.

I also have a reading list of sorts from the main
conference. Most of the papers mentioned in previous posts on
NIPS are on that list as well as these: (in no particular order)

The Information-Form Data Association Filter
Sebastian Thrun, Brad Schumitsch, Gary Bradski, Kunle Olukotun
[ps.gz][pdf][bibtex]

Divergences, surrogate loss functions and experimental design
XuanLong Nguyen, Martin Wainwright, Michael Jordan [ps.gz][pdf][bibtex]

Generalization to Unseen Cases
Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri [ps.gz][pdf][bibtex]

Gaussian Process Dynamical Models
David Fleet, Jack Wang, Aaron Hertzmann [ps.gz][pdf][bibtex]

Convex Neural Networks
Yoshua Bengio, Nicolas Le Roux, Pascal Vincent, Olivier Delalleau,
Patrice Marcotte [ps.gz][pdf][bibtex]

Describing Visual Scenes using Transformed Dirichlet Processes
Erik Sudderth, Antonio Torralba, William Freeman, Alan Willsky
[ps.gz][pdf][bibtex]

Learning vehicular dynamics, with application to modeling helicopters
Pieter Abbeel, Varun Ganapathi, Andrew Ng [ps.gz][pdf][bibtex]

Tensor Subspace Analysis
Xiaofei He, Deng Cai, Partha Niyogi [ps.gz][pdf][bibtex]

Older Posts »

Powered by WordPress