Proprietary Data in Academic Research?

Should results of experiments on proprietary datasets be in the academic research literature?

The arguments I can imagine in the “against” column are:

  1. Experiments are not repeatable. Repeatability in experiments is essential to science because it allows others to compare new methods with old and discover which is better.
  2. It’s unfair. Academics who don’t have insider access to proprietary data are at a substantial disadvantage when competing with others who do.

I’m unsympathetic to argument (2). To me, it looks like their are simply some resource constraints, and these should not prevent research progress. For example, we wouldn’t prevent publishing about particle accelerator experiments by physicists at CERN because physicists at CMU couldn’t run their own experiments.

Argument (1) seems like a real issue.

The argument for is:

  1. Yes, they are another form of evidence that an algorithm is good. The degree to which they are evidence is less than for publicly repeatable experiments, but greater than nothing.
  2. What if research can only be done in a proprietary setting? It has to be good for society at large to know what works.
  3. Consider the game theory perspective. For example, suppose ICML decides to reject all papers with experiments on proprietary datasets. And suppose KDD decides to consider them as weak evidence. The long term result may be that beginning research on new topics which is only really doable in companies starts and then grows at KDD.

I consider the arguments for to be stronger than the arguments against, but I’m aware others have other beliefs. I think it would be good to have a policy statement from machine learning conferences in their call for papers, as trends suggest this becoming a more serious problem in the mid-term future.