Foster Provost gave a talk at the ICML metalearning workshop on “metalearning” and the “no free lunch theorem” which seems worth summarizing.
As a review: the no free lunch theorem is the most complicated way we know of to say that a bias is required in order to learn. The simplest way to see this is in a nonprobabilistic setting. If you are given examples of the form (x,y) and you wish to predict y from x then any prediction mechanism errs half the time in expectation over all sequences of examples. The proof of this is very simple: on every example a predictor must make some prediction and by symmetry over the set of sequences it will be wrong half the time and right half the time. The basic idea of this proof has been applied to many other settings.
The simplistic interpretation of this theorem which many people jump to is “machine learning is dead” since there can be no single learning algorithm which can solve all learning problems. This is the wrong way to think about it. In the real world, we do not care about the expectation over all possible sequences, but perhaps instead about some (weighted) expectation over the set of problems we actually encounter. It is enitrely possible that we can form a prediction algorithm with good performance over this set of problems.
This is one of the fundamental reasons why experiments are done in machine learning. If we want to access the set of problems we actually encounter, we must do this empirically. Although we must work with the world to understand what a good general-purpose learning algorithm is, quantifying how good the algorithm is may be difficult. In particular, performing well on the last 100 encountered learning problems may say nothing about performing well on the next encountered learning problem.
This is where induction comes in. It has been noted by Hume that there is no mathematical proof that the sun will rise tomorrow which does not rely on unverifiable assumptions about the world. Nevertheless, the belief in sunrise tomorrow is essentially universal. A good general purpose learning algorithm is similar to ‘sunrise’: we can’t prove that we will succeed on the next learning problem encountered, but nevertheless we might believe it for inductive reasons. And we might be right.
The philosopher of science,Karl Pope, proposed a solution to the problem of induction. No evidence can prove that a theory is true, but a theory can be corroborated by testing it repeatedly. The theory that fits the observations is the most rational one, still that does not make predictions whether the theory will hold in the future, If there is a test that makes the theory fail then a new theory can be suggested that corrects the old one and embodies it only as an approximation. Falsification is thus a valuable tool to choose among theories. Therefore we have a model for our solar system that states that the sun will rise in regular intervals which fits our past observations. Nevertheless, it says nothing about the future but it’s the best theory we’ve got.
I’m reading a book that gets me hooked, “Scientific Reasoning. The Bayesian Approach”. I’m in the beginning but it talks about induction in Pope’s philosophy and why it’s better to have a Bayesian approach to the problem of inference. It’s very interesting to the discussion above.
Anyone knows this book? Am I biased in my inference about the quality of this book before reading it? 😉
Can I just ask something that may be naive? Even though there is no mathematical prove that the sun will raise tomorrow, isn’t there a physical proof? Astronomers must have calculated how long the sun will last? I realise that this again is based on a theory that may be tentative, but why mathematics only have the priveledge of defining axioms that must be taken for granted?
Silent Bob:
“why mathematics only have the priveledge of defining axioms that must be taken for granted? ”
Mainly because mathematicians deal with the realm of the imaginary. They get to say “if this happens to be true, and this as well, then that leads to this being true”. Scientists use the same sort of philosophy: “If the laws of the universe are constant, and barring any catastrophic event, then the sun will rise tomorrow”.
This still includes some very fundamental axioms, but they *are* taken as granted without proof. Reality, on the other hand, provides for the (slim) possibility of the laws of physics changing instantaneously, preventing the sun from rising tomorrow. This is why induction can never be the final word in any system.
This whole discussion seems to me to be closely related to Godel’s incompleteness theorum, which states that there is no such thing as a system of logic that does not contain at least one non-provable self-referential statement.
Dam good posting.I read some of you’re articles and they are really nice.
You guys are talking about Karl Popper, not Karl Pope. And Hume’s point is just that all our knowledge is based on information from time t < Present, and any prediction about the future is making the fundamental assumption that the world at time T > Present is somehow similar to what we’ve observed. So the “physical proof” that the sun will come up tomorrow is exactly false because of Hume’s argument. However, the argument for a “mathematical proof” comes from the fact that mathematics is a formal logical system which many claim to be independent of the observed universe (whether or not this is true comes down to a philosophical choice, at least in my philosophy…). It’s unclear how we can obtain a “mathematical proof” that the sun will come up tomorrow because we must have some relation between our formal mathematical system and observed physical reality. By dirtying our mathematics with observations, we’re again prey to Hume’s argument.
Also, I don’t see how this relates to Godel’s work, except for the fact that both arguments are frequently discussed in popular science books and provide ready analogies for unsubstantiated pseudoscientific arguments made by those who want to appear innformed or erudite.
err, my post got garbled because i used the “less than” and “greater than” symbols, which were assumed to be HTML tags. should go something like “based on information from time t less than the present. so any prediction about the future is based on the assumption that the world at time T greater than the Present is somehow similar to what we’ve observed.”
May someone comment on this thought I have please. Abstracting from the physical world can provide us with a mathematical model where a proof that the sun will rise can be formulated? I suppose the problem in this case is the ‘fidelity’ of our model to reality?
Hi jeff, thanks for the correction. It’s popper, not pope! I just would like to note that we should not be that strict and use the world “false” for a theory, just as we can never claim that a theory is “true”. If a theory was “false”, as in entirely “false”, science and technology would never advance as they would be based on “false” foundations and never make any progress. I suppose that since mankind has made so much progress, perhaps we can say that induction works?
I’m new to learning theory, but it seems one of its main contributions is to provide various candidate formalizations of “bias” and determine to what degree induction is possible given them. Here’s what I think is true so far, in rough order of increasing strength of bias (though I suppose really there’s only a partial order..). I’d appreciate any corrections or comments:
– Assume nothing, and No Free Lunch tells you you lose as much as you win.
– Add “the future is like the past”, ala Hume, weakly formalized as “examples are drawn from some fixed, unknown distribution”.
– If you believe “unsupervised learning is compression”, asymptotically optimal compressors like Lempel-Ziv work here. Alternately, some asymptotically “consistent” Bayesian compression schemes (e.g. predict via Bayesian inference, then encode via e.g. an arithmetic code) also work here, although I don’t know how to bound to what degree a mismatch between the prior and the true distribution screws you.
– If you’re doing supervised learning, and you’re willing to assume further that the real concept comes from your hypothesis space, then you can use PAC machinery to get a notion of “inductability”; if you’re also Bayesian, then PAC Bayesian methods can conceivably help bound how badly off you’ll be given prior mismatch. If you won’t assume the real concept comes from your hypothesis space (“agnostic learning”), I’m unclear on the definition and utility of the resulting generalization bounds.
– If you believe “learning is density estimation” (either unsupervised, or conditional density estimation/regression for supervised), then asymptotic consistency results for estimators help here, with classic notions of bias and variance. Also sometimes Bayesian inference with a mismatched prior will do okay (as the prior washes out in the limit). I can’t seem to find sharp results on convergence rate, though (e.g. tracking KL-divergence between estimated/inferred distribution and true distribution).
– Assume more than just the fixed distribution, and perform Bayesian inference; if your assumptions are wrong in some way you can’t characterize, all bets are off, but if they’re correct, you’ll learn quickly. This is of course inapplicable if you can’t write down useful assumptions about the relevant distribution.
Is this a fair characterization? Are there major strength-levels of biases that I haven’t included or others I’ve majorly misunderstood?
“Also, I don’t see how this relates to Godel’s work”
I’d always thought of Hume’s argument as a specific case of the Incompleteness Theorum – the self-referential statement in this case being “The past indicates the future because that’s the way it’s always happened.” Maybe it’s just in the way I’m interpreting it.
Present is somehow similar to what we’ve observed. So the “physical proof†that the sun will come up tomorrow is exactly false because of Hume’s argument. However, the argument for a “mathematical proof†comes from the fact that mathematics is a formal logical system which many claim to be independent of the observed universe.Abstracting from the physical world can provide us with a mathematical model where a proof that the sun will rise can be formulated.