Science in the Government

I found the article on “Political Science” at the New York Times interesting. Essentially the article is about allegations that the US government has been systematically distorting scientific views. With a petition by some 7000+ scientists alleging such behavior this is clearly a significant concern.

One thing not mentioned explicitly in this discussion is that there are fundamental cultural differences between academic research and the rest of the world. In academic research, careful, clear thought is valued. This value is achieved by both formal and informal mechanisms. One example of a formal mechanism is peer review.

In contrast, in the land of politics, the basic value is agreement. It is only with some amount of agreement that a new law can be passed or other actions can be taken. Since Science (with a capitol ‘S’) has accomplished many things, it can be a significant tool in persuading people. This makes it compelling for a politician to use science as a mechanism for pushing agreement on their viewpoint.

Most scientists would not mind if their research is used in a public debate. The difficulty arises when the use of science is not representative of the beliefs of scientists. This can happen in many ways. For example, agreement is uncommon in research which implies that it is almost always possible, by carefully picking and choosing, to find one scientist who supports almost any viewpoint.

Such misrepresentations of scientific beliefs about the world violate the fundamental value of “careful, clear thought”, so they are regarded as fundamentally dangerous to the process of research. Naturally, fundamentally dangerous things are sensitive issues which can easily lead to large petitions.

This combination of mismatched values is what appears to be happening. It is less clear what should be done about it.

One response has been (as the article title suggests) politicization of science and scientists. For example the Union of Concerned Scientists (which organized the petition) has a viewpoint and is pushing it. As another example, anecdotal evidence suggests a strong majority of scientists in the US voted against Bush in the last presidential election.

I would prefer a different approach, which is essentially a separation of responsibilities. Given a sufficient separation of powers, scientists should be the most reliable source for describing and predicting the outcomes of some courses of action and the impact of new technologies. What is done with such information is up to the rest of the world. This style of “sharply defined well-separated powers” has worked fairly well elsewhere. Supreme court judges (who specialize in interpretation of law) are, by design, relatively unaffectable by the rest of politics. A newer example is the federal reserve board who have been relatively unaffected by changes in politics, even though it is easy to imagine their powers could dramatically effect election outcomes. This last example is a matter of custom rather than constitutional law.

Neither of the above examples are perfect—the separation of powers has failed on multiple occasions. Nevertheless, it seems to be a useful ideal.

(Dis)similarities between academia and open source programmers

Martin Pool and I recently discussed the similarities and differences between academia and open source programming.

Similarities:

  1. Cost profile Research and programming share approximately the same cost profile: A large upfront effort is required to produce something useful, and then “anyone” can use it. (The “anyone” is not quite right for either group because only sufficiently technical people could use it.)
  2. Wealth profile A “wealthy” academic or open source programmer is someone who has contributed a lot to other people in research or programs. Much of academia is a “gift culture”: whoever gives the most is most respected.
  3. Problems Both academia and open source programming suffer from similar problems.
    1. Whether or not (and which) open source program is used are perhaps too-often personality driven rather than driven by capability or usefulness. Similar phenomena can happen in academia with respect to directions of research.
    2. Funding is often a problem for both groups. Academics often invest many hours in writing grants while open source programmers simply often are not paid.
  4. Both groups of people work in a mixed competitive/collaborative environment.
  5. Both groups use conferences as a significant mechanism of communication.

Given the similarities, it is not too surprising that there is significant cooperation between academia and open source programming, and it is relatively common to crossover from one to the other.

The differences are perhaps more interesting to examine because they may point out where one group can learn from the other.

  1. A few open source projects have achieved significantly larger scales than academia as far as coordination amongst many people over a long time. Big project examples include linux, apache, and mozilla. Groups of people of this scale in academia are typically things like “the ICML community”, or “people working on Bayesian learning”, which are significantly less tightly coupled than any of the above projects. This suggests it may be possible to achieve significantly larger close collaborations in academia.
  2. Academia has managed to secure significantly more funding than open source programmers. Funding typically comes from a mixture of student tuition and government grants. Part of the reason for better funding in academia is that it has been around longer and so been able to accomplish more. Perhaps governments will start funding open source programming more seriously if they produce an equivalent (with respect to societal impact) of the atom bomb.
  3. Academia has a relatively standard career path: grade school education, undergraduate education, graduate education, then apply for a job as a professor at a university. In contrast the closest thing to a career path for open source programmers is something like “do a bunch of open source projects and become so wildly succesful that some company hires you to do the same thing”. This is a difficult path but perhaps it is slowly becoming easier and there is still much room for improvement.
  4. Open source programmers take significantly more advantage of modern tools for communication. As an example of this, Martin mentioned that perhaps half the people working on Ubuntu have blogs. In academia, they are still a rarity.
  5. Open source programmers have considerably more freedom of location. Academic research is almost always tied to a particular university or lab, while many people who work on open source projects can choose to live esssentially anywhere with reasonable internet access.

Do you believe in induction?

Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing.

As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings.

The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to think about it. In the real world, we do not care about the expectation over all possible sequences, but perhaps instead about some (weighted) expectation over the set of problems we actually encounter. It is enitrely possible that we can form a prediction algorithm with good performance over this set of problems.

This is one of the fundamental reasons why experiments are done in machine learning. If we want to access the set of problems we actually encounter, we must do this empirically. Although we must work with the world to understand what a good general-purpose learning algorithm is, quantifying how good the algorithm is may be difficult. In particular, performing well on the last 100 encountered learning problems may say nothing about performing well on the next encountered learning problem.

This is where induction comes in. It has been noted by Hume that there is no mathematical proof that the sun will rise tomorrow which does not rely on unverifiable assumptions about the world. Nevertheless, the belief in sunrise tomorrow is essentially universal. A good general purpose learning algorithm is similar to ‘sunrise’: we can’t prove that we will succeed on the next learning problem encountered, but nevertheless we might believe it for inductive reasons. And we might be right.

SVM Adaptability

Several recent papers have shown that SVM-like optimizations can be used to handle several large family loss functions.

This is a good thing because it is implausible that the loss function imposed by the world can not be taken into account in the process of solving a prediction problem. Even people used to the hard-core Bayesian approach to learning often note that some approximations are almost inevitable in specifying a prior and/or integrating to achieve a posterior. Taking into account how the system will be evaluated can allow both computational effort and design effort to be focused so as to improve performance.

A current laundry list of capabilities includes:

  1. 2002 multiclass SVM including arbitrary cost matrices
  2. ICML 2003 Hidden Markov Models
  3. NIPS 2003 Markov Networks (see some discussion)
  4. EMNLP 2004 Context free grammars
  5. ICML 2004 Any loss (with much computation)
  6. ICML 2005 Any constrained linear prediction model (that’s my own name).
  7. ICML 2005 Any loss dependent on a contingency table

I am personally interested in how this relates to the learning reductions work which has similar goals, but works at a different abstraction level (the learning problem rather than algorithmic mechanism). The difference in abstraction implies that anything solvable by reduction should be solvable by a direct algorithmic mechanism. However, comparing and constrasting the results I know of it seems that what is solvable via reduction to classification versus what is solvable via direct SVM-like methods is currently incomparable.

  1. Can SVMs be tuned to directly solve (example dependent) cost sensitive classification? Obviously, they can be tuned indirectly via reduction, but it is easy to imagine more tractable direct optimizations.
  2. How efficiently can learning reductions be used to solve structured prediction problems? Structured prediction problems are instances of cost sensitive classification, but the regret transform efficiency which occurs when this embedding is done is too weak to be of interest.
  3. Are there any problems efficiently solvable by SVM-like algorithms which are not efficiently solvable via learning reductions?

Why Manifold-Based Dimension Reduction Techniques?

Manifold based dimension-reduction algorithms share the following general outline.

Given: a metric d() and a set of points S

  1. Construct a graph with a point in every node and every edge connecting to the node of one of the k-nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes.
  2. Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors.
  3. Find a set of points in a low dimensional space which preserve the digested properties.

Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures.

Despite doing some early work in this direction, I suffer from a motivational problem: Why do we want to recover the low dimensional structure? One answer is “for better data visualization”. This is compelling if you have data visualization problems. However, I don’t — I want to make machines that can better predict the future, which generally appears to be a sound goal of learning. Reducing the dimensionality of a dataset is not obviously helpful in accomplishing this. In fact, doing so violates one of the basic intuitions of applied learning algorithms “avoid double approximation”. (One approximation = the projection into the low dimensional space, another approximation = the classifier learned on that space.)

Another answer is “for robots”. Several people have experimented with using a vision sensor and a dimension reduction technique in an attempt to extract the manifold of pose space. These attempts have not generally worked well, basically because the euclidean distance on pixels is not particularly good at predicting which things are “nearby”. However, we might be able to do considerably better if we learn the distance. At the 1-bit level, we might learn a predictor from image pairs to “nearby” or “far”. Any stream S of images i1, i2, i3, …, in can be transformed into a binary problem according to:
{((ij,ik),1 – I(j = k+1 or k = j+1): ij,ik in S}. In unmath “the binary problem formed by predicting whether images are adjacent in the chain of experience”. (*) A good solution to this binary problem would give us an
interesting 1-bit metric. Using regression and counting numbers of transitions might provide a more conventional multibit metric.

This metric, if well solved, has a concrete meaning: the minimum distance in terms of actuator transitions between positions. A shortest path in this space is a sequence of actuator movements leading from a position A to a position B. A projection of this space into low dimensions provides some common format which both the human and the robot can understand. Commanding the robot to go to some location is just a matter of pointing out that location in the low dimensional projection.

This is a possible use for manifold based dimension reduction techniques which I find compelling, if it works out. (Anyone interested in playing with this should talk to Dana Wilkinson who is considering experimenting with this approach.)

(*) We probably would want to tweak the positive/negative ratio to reflect the pattern encountered in usage.
(**) Post tweaked to fix an oversight.