Machine Learning (Theory)

4/28/2005

Science Fiction and Research

Tags: General jl@ 9:48 am

A big part of doing research is imagining how things could be different, and then trying to figure out how to get there.

A big part of science fiction is imagining how things could be different, and then working through the implications.

Because of the similarity here, reading science fiction can sometimes be helpful in understanding and doing research. (And, hey, it’s fun.) Here’s some list of science fiction books I enjoyed which seem particularly relevant to computer science and (sometimes) learning systems:

  1. Vernor Vinge, “True Names”, “A Fire Upon the Deep”
  2. Marc Stiegler, “David’s Sling”, “Earthweb”
  3. Charles Stross, “Singularity Sky”
  4. Greg Egan, “Diaspora”
  5. Joe Haldeman, “Forever Peace”

(There are surely many others.)

Incidentally, the nature of science fiction itself has changed. Decades ago, science fiction projected great increases in the power humans control (example: E.E. Smith Lensman series). That didn’t really happen in the last 50 years. Instead, we gradually refined the degree to which we can control various kinds of power. Science fiction has changed to reflect this. This can be understood as a shift from physics-based progress to engineering or computer science based progress.

4/27/2005

DARPA project: LAGR

Tags: Funding, Problems, Robots jl@ 9:41 am

Larry Jackal has set up the LAGR (”Learning Applied to Ground Robotics”) project (and competition) which seems to be quite well designed. Features include:

  1. Many participants (8 going on 12?)
  2. Standardized hardware. In the DARPA grand challenge contestants entering with motorcycles are at a severe disadvantage to those entering with a Hummer. Similarly, contestants using more powerful sensors can gain huge advantages.
  3. Monthly contests, with full feedback (but since the hardware is standardized, only code is shipped). One of the premises of the program is that robust systems are desired. Monthly evaluations at different locations can help measure this and provide data.
  4. Attacks a known hard problem. (cross country driving)

4/26/2005

To calibrate or not?

Tags: General jl@ 9:20 am

A calibrated predictor is one which predicts the probability of a binary event with the property: For all predictions p, the proportion of the time that 1 is observed is p.

Since there are infinitely many p, this definition must be “softened” to make sense for any finite number of samples. The standard method for “softening” is to consider all predictions in a small neighborhood about each possible p.

A great deal of effort has been devoted to strategies for achieving calibrated (such as here) prediction. With statements like: (under minimal conditions) you can always make calibrated predictions.

Given the strength of these statements, we might conclude we are done, but that would be a “confusion of ends”. A confusion of ends arises in the following way:

  1. We want good probabilistic predictions.
  2. Good probabilistic predictions are calibrated.
  3. Therefore, we want calibrated predictions.

The “Therefore” step misses the fact that calibration is a necessary but not a sufficient characterization of good probabilities. For example on the sequence “010101010…”, always predicting p=0.5 is calibrated.

This leads to the question: What is a sufficient characterization of good probabilities? There are several candidates:

  1. From Vohra: Calibrated on all simple subsequences.
  2. Small squared error: sumx (x-px)2.
  3. Small log probability: sumx log (1/px)

I don’t yet understand which of these candidates is preferrable.

There is a sense in which none of them can be preferred. In any complete prediction system, the probabilities are used in some manner, and there is some loss (or utility) associated with it’s use. The “real” goal is minimizing that loss. Depending on the sanity of the method using the probabilities, this may even imply that lieing about the probabilities is preferred. Nevertheless, we can hope for a sane use of probabilities and a sufficient mechanism for predicting good probabilities might eventually result in good performance for any sane use.

4/25/2005

Embeddings: what are they good for?

Tags: General sanjoy@ 1:22 am

I’ve been looking at some recent embeddings work, and am struck by how beautiful the theory and algorithms are. It also makes me wonder, what are embeddings good for?

A few things immediately come to mind:

(1) For visualization of high-dimensional data sets.

In this case, one would like good algorithms for embedding specifically into 2- and 3-dimensional Euclidean spaces.

(2) For nonparametric modeling.

The usual nonparametric models (histograms, nearest neighbor) often require resources which are exponential in the dimension. So if the data actually lie close to some low-dimensional
surface, it might be a good idea to first identify this surface and embed the data before applying the model.

Incidentally, for applications like these, it’s important to have a functional mapping from high to low dimension, which some techniques do not yield up.

(3) As a prelude to classifier learning.

The hope here is presumably that learning will be easier in the low-dimensional space, because of (i) better generalization and (ii) a more “natural” layout of the data.

I’d be curious to know of other uses for embeddings.

4/23/2005

Advantages and Disadvantages of Bayesian Learning

Tags: Bayesian jl@ 9:29 am

I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of:

  1. Specify a prior over world models.
  2. Integrate using Bayes law with respect to all observed information to compute a posterior over world models.
  3. Predict according to the posterior.

Bayesian learning has many advantages over other learning programs:

  1. Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe of “think harder” (about specifying a prior over world models) and “compute harder” (to calculate a posterior) will eventually succeed. Many other machine learning approaches don’t have this guarantee.
  2. Language Bayesian and near-Bayesian methods have an associated language for specifying priors and posteriors. This is significantly helpful when working on the “think harder” part of a solution.
  3. Intuitions Bayesian learning involves specifying a prior and integration, two activities which seem to be universally useful. (see intuitions).

With all of these advantages, Bayesian learning is a strong program. However, there are also some very significant disadvantages.

  1. Information theoretically infeasible It turns out that specifying a prior is extremely difficult. Roughly speaking, we must specify a real number for every setting of the world model parameters. Many people well-versed in Bayesian learning don’t notice this difficulty for two reasons:
    1. They know languages allowing more compact specification of priors. Acquiring this knowledge takes some signficant effort.
    2. They lie. They don’t specify their actual prior, but rather one which is convenient. (This shouldn’t be taken too badly, because it often works.)
  2. Computationally infeasible Let’s suppose I could accurately specify a prior over every air molecule in a room. Even then, computing a posterior may be extremely difficult. This difficulty implies that computational approximation is required.
  3. Unautomatic The “think harder” part of the Bayesian research program is (in some sense) a “Bayesian employment” act. It guarantees that as long as new learning problems exist, there will be a need for Bayesian engineers to solve them. (Zoubin likes to counter that a superprior over all priors can be employed for automation, but this seems to add to the other disadvantages.)

Overall, if a learning problem must be solved a Bayesian should probably be working on it and has a good chance of solving it.
I wish I knew whether or not the drawbacks can be convincingly addressed. My impression so far is “not always”.

4/22/2005

New Blog: [Lowerbounds,Upperbounds]

Tags: General jl@ 10:05 am

Maverick Woo and the Aladdin group at CMU have started a CS theory-related blog here.

4/21/2005

Dynamic Programming Generalizations and Their Use

Tags: General, Papers jl@ 8:38 pm

David Mcallester gave a talk about this paper (with Pedro Felzenszwalb). I’ll try to give a high level summary of why it’s interesting.

Dynamic programming is most familiar as instantiated by Viterbi decoding in a hidden markov model. It is a general paradigm for problem solving where subproblems are solved and used to solve larger problems. In the Viterbi decoding example, the subproblem is “What is the most probable path ending at each state at timestep t?”, and the larger problem is the same except at timestep t+1. There are a few optimizations you can do here:

  1. Dynamic Programming -> queued Dynamic Programming. Keep track of the “cost so far” (or “most probable path”) and (carefully) only look at extensions to paths likely to yield the shortest path. “Carefully” here is defined by Dijkstra’s shortest path algorithm.
  2. queued Dynamic programming -> A*Add a lower bound on the cost to complete a path (or an upper bound on the probability of a completion) for the priority queue of Dijkstra’s shortest path. This can yield computational speedups varying between negligible and outstanding.
  3. A* -> Hierarchical A* The efficiency of A* search is dependent on the tightness of it’s lower bound, which brings up the question: “Where do you get the lower bound?” One appealing answer is from A* applied to a simplified problem equivalent to the original problem, but with states aliased (many states in original problem = 1 state in new problem). This technique can be applied recursively until the problem is trivial.

Each of these steps has been noted previously (although perhaps not in the generality of this paper). What seems new and interesting is that the entire hierarchy of A* searches can be done simultaneously on one priority queue.

The resulting algorithm can use low level information to optimize high level search as well as high level information to optimize low level search in a holistic process. It’s not clear yet how far this approach can be pushed, but this quality is quite appealing. Naturally, there are many plausible learning-related applications.

4/16/2005

Which Assumptions are Reasonable?

Tags: General jl@ 12:50 pm

One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful.

Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings. The meaning used here is “assumption = axiom” (i.e. something you can not verify).

Assumption Reasonable? Which analysis? Example/notes
Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…)
Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you have a cyclic process generating data.
Finite exchangeability (FEX) Sometimes reasonable as for IID There are a good number of situations where there is a population we wish to classify, pay someone to classify a random subset, and then try to learn.
Input space uniform on a sphere No. PAC, active learning I’ve never observed this in practice.
Functional form: “or” of variables, decision list, “and” of variables Sometimes reasonable PAC analysis There are often at least OK functions of this form that make good predictions
No Noise Rarely reasonable PAC, ERM Most learning problems appear to be of the form where the correct prediction given the inputs is fundamentally ambiguous.
Functional form: Monotonic on variables Often PAC-style Many natural problems seem to have behavior monotonic in their input variables.
Functional form: xor Occasionally PAC I was suprised to observe this.
Fast Mixing Sometimes RL Interactive processes often fail to mix, ever, because entropy always increases.
Known optimal state distribution Sometimes RL Sometimes humans know what is going on, and sometimes not.
Small approximation error everywhere Rarely RL Approximate policy iteration is known for sometimes behaving oddly.

If anyone particularly agrees or disagrees with the reasonableness of these assumptions, I’m quite interested.

4/14/2005

Families of Learning Theory Statements

Tags: Organization jl@ 4:41 pm

The diagram above shows a very broad viewpoint of learning theory.

arrow Typical statement Examples
Past->Past Some prediction algorithm A does almost as well as any of a set of algorithms. Weighted Majority
Past->Future Assuming independent samples, past performance predicts future performance. PAC analysis, ERM analysis
Future->Future Future prediction performance on subproblems implies future prediction performance using algorithm A. ECOC, Probing

A basic question is: Are there other varieties of statements of this type? Avrim noted that there are also “arrows between arrows”: generic methods for transforming between Past->Past statements and Past->Future statements. Are there others?

4/10/2005

Is the Goal Understanding or Prediction?

Tags: Organization jl@ 6:28 pm

Steve Smale and I have a debate about goals of learning theory.

Steve likes theorems with a dependence on unobservable quantities. For example, if D is a distribution over a space X x [0,1], you can state a theorem about the error rate dependent on the variance, E(x,y)~D (y-Ey’~D|x[y'])2.

I dislike this, because I want to use the theorems to produce code solving learning problems. Since I don’t know (and can’t measure) the variance, a theorem depending on the variance does not help me—I would not know what variance to plug into the learning algorithm.

Recast more broadly, this is a debate between “declarative” and “operative” mathematics. A strong example of “declarative” mathematics is “a new kind of science”. Roughly speaking, the goal of this kind of approach seems to be finding a way to explain the observations we make. Examples include “some things are unpredictable”, “a phase transition exists”, etc…

“Operative” mathematics helps you make predictions about the world. A strong example of operative mathematics is Newtonian mechanics in physics: it’s a great tool to help you predict what is going to happen in the world.

In addition to the “I want to do things” motivation for operative mathematics, I find it less arbitrary. In particular, two reasonable people can each be convinced they understand a topic in ways so different that they do not understand the viewpoint. If these understandings are operative, the rest of us on the sidelines can better appreciate which understanding is “best”.

4/8/2005

Fast SVMs

Tags: Papers, Supervised jl@ 9:57 am

There was a presentation at snowbird about parallelized support vector machines. In many cases, people parallelize by ignoring serial operations, but that is not what happened here—they parallelize with optimizations. Consequently, this seems to be the fastest SVM in existence.

There is a related paper here.

4/6/2005

Structured Regret Minimization

Tags: Online, Papers jl@ 3:50 pm

Geoff Gordon made an interesting presentation at the snowbird learning workshop discussing the use of no-regret algorithms for the use of several robot-related learning problems. There seems to be a draft here. This seems interesting in two ways:

  1. Drawback Removal One of the significant problems with these online algorithms is that they can’t cope with structure very easily. This drawback is addressed for certain structures.
  2. Experiments One criticism of such algorithms is that they are too “worst case”. Several experiments suggest that protecting yourself against this worst case does not necessarily incur a great loss.

4/4/2005

Grounds for Rejection

Tags: Reviewing jl@ 6:31 pm

It’s reviewing season right now, so I thought I would list (at a high level) the sorts of problems which I see in papers. Hopefully, this will help us all write better papers.

The following flaws are fatal to any paper:

  1. Incorrect theorem or lemma statements A typo might be “ok”, if it can be understood. Any theorem or lemma which indicates an incorrect understanding of reality must be rejected. Not doing so would severely harm the integrity of the conference. A paper rejected for this reason must be fixed.
  2. Lack of Understanding If a paper is understood by none of the (typically 3) reviewers then it must be rejected for the same reason. This is more controversial than it sounds because there are some people who maximize paper complexity in the hope of impressing the reviewer. The tactic sometimes succeeds with some reviewers (but not with me).

    As a reviewer, I sometimes get lost for stupid reasons. This is why an anonymized communication channel with the author can be very helpful.

  3. Bad idea Rarely, a paper comes along with an obviously bad idea. These also must be rejected for the integrity of science

The following flaws have a strong negative impact on my opinion of the paper.

  1. Kneecapping the Giants. “Kneecapping the giants” papers take a previously published idea, cripple it, and then come up with an improvement on the crippled version. This often looks great experimentally, but is unconvincing because it does not improve on the state of the art.
  2. Only Toys. The paper emphasizes experimental evidence on datasets specially created to show the good performance of their algorithm. Unfortunately, because learning is worst-case-impossible, I have little trust that performing well on a toy dataset implies good performance on real-world datasets.

My actual standard for reviewing is quite low, and I’m happy to approve of incremental improvements. Unfortunately, even that standard is such that I suggest rejection on most reviewed papers.

4/1/2005

Basic computer science research takes a hit

Tags: Funding jl@ 11:46 pm

The New York Times has an interesting article about how DARPA has dropped funding for computer science to universities by about a factor of 2 over the last 5 years and become less directed towards basic research. Partially in response, the number of grant submissions to NSF has grown by a factor of 3 (with the NSF budget staying approximately constant in the interim).

This is the sort of policy decision which may make sense for the defense department, but which means a large hit for basic research on information technology development in the US. For example “darpa funded the invention of the internet” is reasonably correct. This policy decision is particularly painful in the context of NSF budget cuts and the end of extensive phone monopoly funded research at Bell labs.

The good news from a learning perspective is that (based on anecdotal evidence) much of the remaining funding is aimed at learning and learning-related fields. Methods of making good automated predictions obviously have a lot of applications that DARPA cares about and the technology often isn’t there yet.

The Producer-Consumer Model of Research

Tags: Reviewing jl@ 1:08 pm

In the quest to understand what good reviewing is, perhaps it’s worthwhile to think about what good research is. One way to think about good research is in terms of a producer/consumer model.

In the producer/consumer model of research, for any element of research there are producers (authors and coauthors of papers, for example) and consumers (people who use the papers to make new papers or code solving problems). An produced bit of research is judged as “good” if it is used by many consumers. There are two basic questions which immediately arise:

  1. Is this a good model of research?
  2. Are there alternatives?

The producer/consumer model has some difficulties which can be (partially) addressed.

  1. Disconnect. A group of people doing research on some subject may become disconnected from the rest of the world. Each person uses the research of other people in the group so it appears good research is being done, but the group has no impact on the rest of the world. One way to detect this is by looking at the consumers2 (the consumers of the consumers) and higher order powers. If the set doesn’t expand much with higher order powers, then there is a disconnect.
  2. Latency. It is extraordinarily difficult to determine in advance whether a piece of research will have many consumers. A particular piece of work may be useful only after a very long period of time. This difficulty is particularly severe for theoretical research.
  3. Self-fulfillment To some extent, interesting research by this definition is simply research presented to the largest possible audience. The odds that someone will build on the research are simply larger when it is presented to a larger audience. Some portion of this effect is “ok”—certainly attempting to educate other people is a good idea. But in judging the value of a piece of research, discounting by the vigor with which it is presented may be healthy for the system. (In effect, this as a bias against spamming.)

If we accept the difficulties of the producer consumer model, then good reviewing becomes a problem of predicting what research will have a large impact in terms of the numbers of consumers (and consumers^2, etc…) Citations can act (to some extent) as a proxy for consumption implying that it may be possible to (retroactively) score a reviewer’s judgement. There are many difficulties here. For example a citation of the form “[joe blow 93] is wrong and here’s why” isn’t an example of the sort of use we want to encourage. Another important effect is that a reviewer who rejects a paper biases the number of citations a paper later recieves. Another is that a rejected paper that has been resubmitted to another place may change so that it is simply a better paper. It isn’t obvious what a good method is for taking all of these effects into account.

Clearly, there are problems with this model for judging research (and at the second order, judgements of reviews of research). However, I am not aware of any other abstract model for “good research” which is even this good. If you know one, please comment.

Powered by WordPress