“Science” has many meanings, but one common meaning is “the scientific method” which is a principled method for investigating the world using the following steps:
- Form a hypothesis about the world.
- Use the hypothesis to make predictions.
- Run experiments to confirm or disprove the predictions.
The ordering of these steps is very important to the scientific method. In particular, predictions must be made before experiments are run.
Given that we all believe in the scientific method of investigation, it may be surprising to learn that cheating is very common. This happens for many reasons, some innocent and some not.
- Drug studies. Pharmaceutical companies make predictions about the effects of their drugs and then conduct blind clinical studies to determine their effect. Unfortunately, they have also been caught using some of the more advanced techniques for cheating here: including “reprobleming”, “data set selection”, and probably “overfitting by review”. It isn’t too surprising to observe this: when the testers of a drug have $109 or more riding on the outcome the temptation to make the outcome “right” is extreme.
- Wrong experiments. When conducting experiments of some new phenomena, it is common for the experimental apparatus to simply not work right. In that setting, throwing out the “bad data” can make the results much cleaner… or it can simply be cheating. Millikan did this in the ‘oil drop’ experiment which measured the electron charge.
Done right, allowing some kinds of “cheating” may be helpful to the progress of science since we can more quickly find the truth about the world. Done wrong, it results in modern nightmares like painkillers that cause heart attacks. (Of course, the more common outcome is that the drugs effectiveness is just overstated.)
A basic question is “How do you do it right?” And a basic answer is “With prediction theory bounds”. Each prediction bound has a number of things in common:
- They assume that the data is independently and identically drawn. This is well suited to experimental situations where experimenters work very hard to make different experiments be independent. In fact, this is a better fit than typical machine learning applications where independence of the data is typically more questionable or simply false.
- They make no assumption about the distribution that the data is drawn from. This is important for experimental testing of predictions because the distribution that observations are expected to come from is a part of the theory under test.
These two properties above form an ‘equivalence class’ over different mathematical bounds where each bound can be trusted to an equivalent degree. Inside of this equivalent class there are several that may be helpful in determining whether deviations from the scientific method are reasonable or not.
- The most basic test set bound corresponds to the scientific method above.
- The Occam’s Razor bound allows a careful reordering of steps (1), (2) and step (3). More “interesting” bounds like the VC-bound and the PAC-Bayes bound allow more radical alterations of these steps. Several are discussed here.
- The Sample Compression bound allows careful disposal of some datapoints.
- Progressive Validation bounds (such as here, here or here) allow hypotheses to be safely reformulated in arbitrary ways as experiments progress.
Scientific experimenters looking for a little extra flexibility in the scientific method may find these approaches useful. (And if they don’t, maybe there is another bound in this equivalence class that needs to be worked out.)
If you haven’t seen it, you might be interested in this article which has been making the rounds lately. Although it totally oversells the result, it’s quite interesting. For example, if several groups test for a result, and only one of them gets a result that is “statistically significant,” then often that result will be hyped and the others ignored, whereas merging all the data may not show a significant result.
Of course, Bayesians have another take on how to interpret scientific results (which I’d like to know more about). It’s been around for quite awhile (e.g., Jaynes’ book is entitled Probability Theory: The Logic of Science). It makes sense to me that you should choose the hypothesis that is best supported by the data, without restricting it to hypotheses stated prior to the experiment.
I liked the article—thanks. The example you give is another version of ‘overfitting by review’.
I believe Bayesians must state at least a prior before doing an experiment if they want to get a sane posterior. It’s always bothered me that two reasonable people can disagree about what a good prior is. No agreement implies no agreement about the posterior, and that can imply many practical difficulties. Maybe it’s ok to have explicit disagreements of this sort, but it would be a significant difference from what we have now.