Machine Learning (Theory)

7/2/2008

Proprietary Data in Academic Research?

Tags: Machine Learning,Research jl@ 10:36 am

Should results of experiments on proprietary datasets be in the academic research literature?

The arguments I can imagine in the “against” column are:

  1. Experiments are not repeatable. Repeatability in experiments is essential to science because it allows others to compare new methods with old and discover which is better.
  2. It’s unfair. Academics who don’t have insider access to proprietary data are at a substantial disadvantage when competing with others who do.

I’m unsympathetic to argument (2). To me, it looks like their are simply some resource constraints, and these should not prevent research progress. For example, we wouldn’t prevent publishing about particle accelerator experiments by physicists at CERN because physicists at CMU couldn’t run their own experiments.

Argument (1) seems like a real issue.

The argument for is:

  1. Yes, they are another form of evidence that an algorithm is good. The degree to which they are evidence is less than for publicly repeatable experiments, but greater than nothing.
  2. What if research can only be done in a proprietary setting? It has to be good for society at large to know what works.
  3. Consider the game theory perspective. For example, suppose ICML decides to reject all papers with experiments on proprietary datasets. And suppose KDD decides to consider them as weak evidence. The long term result may be that beginning research on new topics which is only really doable in companies starts and then grows at KDD.

I consider the arguments for to be stronger than the arguments against, but I’m aware others have other beliefs. I think it would be good to have a policy statement from machine learning conferences in their call for papers, as trends suggest this becoming a more serious problem in the mid-term future.

7 Comments to “Proprietary Data in Academic Research?”
  1. Ted Dunning says:

    This problem also manifests with data sets that are simply too large to be portable, or real-time environments that are too difficult to replicate exactly. There is really no difference between a data-set that cannot be ported due to contractual reasons from one that cannot be ported for pragmatic reasons.

    The fact is, these results are weak evidence but they are also often the most cutting edge results. The choice of the academic community is either to learn from non-replicable results or to not hear about the results at all. I think the former is preferable.

    At Veoh, we have an example of a (non-machine-learning) problem that is pragmatically not portable. We have about 10^9 files that make up our web-site comprising nearly a PB of data. Storing these inexpensively and efficiently is difficult and we have a solution that works for us. It is not particularly feasible to move these files and replicate the traffic patterns, but people designing file systems could plausibly learn from our experience and we would be the better for their learning.

    For many years, the intelligence community had a similar problem. They had masses of data from various kinds of signal intercepts and openly available information, but they couldn’t allow access to this data. Their solution was to produce sample data that could be, more or less, freely distributed and to invent analogous problems to go with the data. Thus we have (or had) the TREC and MUC conferences. That was an excellent solution which is not particularly available as an option for many businesses.

  2. JoSeK says:

    I think Open Access to data should be preferable but sometimes (for example when working in/for a big company) there is no more option that publish wihout granting access to the data or not publishing at all, even when the results shed light on a certain topic. But other times the reasons for using propietary data are less clear. Some people doesn’t like to share the datasets, or even are too lazy for uploading to a web (even when there exists very interesting repositories like UCI). It could be an interesting policy for ML/DM conferences to ask for a good reason for not giving access to the data used for the experiments in order to publish the paper.

  3. I’m running into this now as I review papers that rely on proprietary web search logs. I’m not thrilled about it. But I don’t see the major web search companies releasing this data any time soon–especially not after the AOL fiasco. And no one else has data like this. So I think accepting such research is the best alternative we have, at least for the foreseeable future.

  4. In Criminology data is never shared between researchers because e.g. all studies about recidivism require access to criminal records and other personal information about the inmates/parolees from some state corrections agency…

  5. R says:

    This is not really a new issue in the grand scheme of things. A lot of experimental science is done in such a way that repeating experiments require significant effort. This could range from specialized equipment ranging from airframes/earthquake simulators/high-speed impact tests to experiments on scarce resources such as rare biological samples. In principle, many of these experiments are open although in practice that is not entirely true. For instance, in a TV interview, one of the Nobel-prize winning scientists working on Bose-Einstein Condensates mentioned that it would take 2 years for a new entrant (not just a layman, but a suitably qualified physicist from a slightly different background) to replicate all of their experimental know-how.
    So, what makes it possible to do science in these settings is the formulation of a scientific theory and hypotheses that transcend the data. If the theory is clear enough and rich enough, then someone without that specific data set can come up with other clever ways to work and contribute in the same space. This makes it possible for a meaningful scientific conversation to take place – and it shouldn’t matter one way or the other what sort of data is being used.

  6. anon says:

    This year KDD required that datasets should be publicly available, or the paper would be downgraded when decisions for acceptance are taken. I think (and the comments above seem to support this), that this requirement is wrong. In too many cases it is impossible to share data, especially when it contains confidential personal information or trade secrets.
    On the other hand, we should discourage people in academia who refuse to share data so as to keep an edge for themselves. Data gathered using public resources should be publicly available, if the person who wants it is willing to make an effort to take it, for example, by sending enough hard disks to the owner of the data.
    Finally, tongue in cheek, I would not worry too much about papers from industry that contain data which cannot be shared: If you go over the papers from NIPS from the last few years you’ll find that they rarely get accepted anyway.

  7. Balaji Krishnapuram says:

    The idea of stopping or downgrading papers from industry just because they cant share their proprietary data seems counter-productive. I fear that the move to downgrade the papers that use proprietary data is a step in the wrong direction.

    1) While reproducibility is a laudable goal, in academic research publications we have an a-priori expectation of honesty and integrity — without this the whole system would break down. In my opinion, we do not need to insist that every result be entirely reproducible.

    2) I am really concerned about an unfortunate trend in a large number of academic ML papers: a large fraction of the papers at ICML/NIPS/… are traditionally devoted to finding a slightly different solution for a well studied question, showing small improvements on relatively small benchmark datasets (eg UCI). If they were never published, they would not be sorely missed even 5 years down the line. Just think what fraction of the papers published from 2003 are still routinely cited/used! I agree that such papers are useful, and we should continue to encourage the submission of such papers as well. However, I think the field itself really needs a lot more work on formulating new problem abstractions, new methods for application-domain specific issues, etc.

    ICML/NIPS/UAI Papers would benefit substantially if they focused on real life challenges such as those found in several medium to large scale problems that are addressed everyday in commercial research groups at Yahoo, Microsoft, IBM, Google, Siemens etc. Indeed, instead of writing algorithms with the principal aim of writing publications, we should make an effort to turn the system on its head; we should try to solve real problems, and publish when we are successful at it. Society & the ML research community would benefit if such really high-impact papers (which solve real life problems) are given high review scores even if the data is not publicly shared.

    Because of broken reward systems (eg long pub record=>tenure), the community has spent far too much time working on slight rehashes of old problems, and fixating on 0.1% improvements on tiny UCI datasets. Isn’t it time we changed the situation?

Leave a Reply


Powered by WordPress