“Files don’t just disappear.”
“They do if you drop them down an elevator shaft.”
Data Entry Guy and George in “Dead Like Me”
I recently watched a presentation and read a paper about data archives for scientific data. There are a lot of arguments for data archives, irrespective of the discipline. For example, Crystal (2004) gives a nice overview of the advantages for science, e.g. it allows for tests of new hypotheses and the use of new methods, new exploration of the data, serves the public, makes the paper more ‘complete’, and it increases the impact. Also, many disciplines see it as an ethical obligation to share data as a scientist, so that other scientists can check the analyses.
So, normally, data-sharing should be a no-brainer.
Challenges of getting Researchers to use data archives
However, while data sharing would be ethical and beneficial to science, there is still a lot of resistance. What I found interesting were the challenges of getting researchers to use data archives that came up in the talk and when I thought about it. Because with all the advantages and ethical obligations, few scientists use data archives — or are willing to share data if asked. And yes, sometimes there are reasons, for example:
- science is incredibly competitive: Scientists might be afraid that other scientists working on the same questions (= direct competition for jobs and funding) will use the data to advance their scientific careers. Not necessarily by plagiarizing, but by using the data to inspire/fuel their next research projects. Thus data sharing might hamper their own careers.
- accidental mistakes could be discovered: Data analysis is highly complex. The correct behavior when mistakes are spotted is to correct — or in cases where the results now lead to different conclusions — retract the paper. Given the ‘publish or perish‘ nature of science today, some scientists might not want to have their data checked after publication. It’s an “I don’t know, I don’t wanna know, it’s already outta there” attitude that is beneficial for the scientist in the short run (papers +1), but damaging to science in the long run, esp. for all those — including the original scientist — who base their work on faulty data/analyses.
- additional work: To share the data, you have to annotate it and use certain standards. You might know what the data means, but that does not mean that anyone else understands it. Given the … idiosyncratic workflows of many scientists, this takes additional effort or a change in workflow. Not something many researchers working under high pressure are that willing to do.
The presenter was arguing for the institute he was working for as a means to archive the data. It was an effort to make it the data repository for psychological data in Germany. In principle, that’s a good idea, esp. for the institute itself, as it ensures its survival. However, I am very skeptical that this works.
Personally, I think journals are the way to go.
Data should be shared with the Journals, who can then share it with other researchers
Personally, I think the only way to ensure data sharing and documentation is to make it a part of the publication process. Scientists want — need — to publish, so journals have some leverage. Journals can use access to the data as part of their subscription service and as part of good scientific practice. It should be part of the peer-review and there should be a written statement that all data of the project is shared, not only the variables that lead to significant results. Sure, you have to anonymize the data in psychology, given that you usually work with humans, but with the exception of rare cases (e.g., the only 56-year old male psychology student in a small city) that should not be a problem.
The advantage would be that data sharing does not come up months or years after publication, when another scientist is asking for the data. It would also ensure easy availability, as journals would handle the access to the data. Also, data manipulation or fabrication might become easier to spot. It would also ensure long-term availability — as long as the journal exists, so does the data. And if a journal folds, a professional organization could keep the issues and data as an archive. No need to trust individual institutes. Libraries could host (mirror) local copies including the data.
Thus, for long-term availability of data, I really hope that soon journals will make it a default requirement for authors to provide the actual data. Journals have this leverage and science will benefit from it — in the long run.
Crystal, J. D. (2004). Data archiving in animal experimentation: Merits, challenges, and a case study. Behavior Research Methods, Instruments, & Computers, 36(4), 656–660.