10 aspects of highly effective research data
Good research data management makes data reusable
By Anita de Waard, Helena Cousijn, PhD, and IJsbrand Jan Aalbersberg, PhD Posted on 11 December 2015
Research articles have traditionally been seen as the most important output of scientific research; they have been around for 350 years, starting with the Royal Society’s journal in 1665. However, with the increased digitization of research along with new possibilities to store and preserve research data, there is a growing awareness of the importance of research data and in particular the importance of sharing research data to allow reuse.
Why data sharing?
Funding bodies are actively taking steps to encourage data sharing. As part of Horizon 2020, the European Commission has launched its Open Access to Data Pilot, where researchers are asked to share data in several core areas, unless they have a reason to opt out. America’s major funding agencies are pursuing similar efforts: the National Institutes of Health has announced that “NIH intends to make public access to digital scientific data the standard for all NIH-funded research,” and the National Science Foundation expects researcher “to share … the primary data, samples, physical collections and other supporting materials created or gathered in the course of work.”
There are several reasons sharing data is very important for the enhancement of science.
First, it makes research more controllable and replicable. Over recent years there have been examples where research was falsified or was simply not replicable. If research data were shared, these problems would come to light much earlier and would therefore have a less damaging impact. Second, researchers often acquire the same data, which would not be necessary if data would be publicly available. A lot of money could be saved and used for conducting novel research, which is an important motivation for funders, institutions and researchers alike. Third, in cases where researchers have acquired similar datasets, it can be very valuable to combine these datasets. This increases the statistical power of the analyses and thereby the chances of detecting genuine effects. Finally, sharing data allows other researchers from both the same and other fields to apply their expertise and carry out new analyses, thereby fostering multidisciplinary research.
10 aspects of highly effective data
The main goal of data sharing is that other researchers should be able to reuse the data. Therefore, reusability should constantly be taken into account when designing systems that store and create data. All parties working with research data should think about handling the data in a way that makes it optimally usable downstream. We believe that data reuse could be optimized by aligning the 10 aspects of data listed below. This pyramid – loosely modeled on Maslow’s hierarchy of human needs – can be seen as an extension of the FAIR Data Principles (data should be Findable, Accessible, Interoperable and Reusable) and can function as a roadmap for the development of better data management processes and systems throughout the data lifecycle. We will discuss each aspect in turn.
The first step in the hierarchy of research data needs is that data that have been acquired need to be stored. At this moment, many research groups do not have clearly defined ways of making sure their data are stored somewhere, making it difficult for researchers within and outside the group to reuse the data for purposes other than the initial experiment. This problem is increasingly recognized by research institutes and funders, who have introduced data management plans to ensure that research groups define the ways to store their datasets before their experiments. New technology such as electronic lab notebooks is a viable option for storing the observations and results of experiments. Both domain-specific and general data repositories sometimes allow researchers to store their data without making these public, which provides a good way for researchers to store their data for the duration of the research project.
A closely related point is that data need to be preserved for the long term. Once research data is stored, it then needs to be preserved in a format-independent manner or risk data obsolescence. Information can only be valuable when it is in a format we can use, and few of us have the time to dig through old archives to recover, reprocess and digitize data. Making sure research data is archived correctly and will be saved for a long period of time is very important. Fortunately, there are organizations, such as Data Archiving and Networked Services (DANS) in the Netherlands, that provide information about data preservation and an infrastructure for it. Also, data repositories can play an important role here, especially when they have solid dark archives in place, which guarantee that data will not be lost even if the data repository ceases to exist.
Even when data is stored and preserved, this does not necessarily mean it is automatically accessible. Both researchers and machines may want to access the data, for example, for meta-analyses or other kinds of re-use. Researchers are increasingly being required by their institution or funder to make their data accessible, which has caused researchers to start thinking about solutions. Luckily, there are a number of different ways researchers can make their data accessible. They can do this either by depositing their data in a public repository, or by using a data sharing system such as Mendeley Data, where researchers create private data sharing spaces that can be opened to larger communities or the wider public. At Elsevier, we recently launched an Open Data Pilot, where we make raw research data (as submitted with an article) openly accessible alongside the article for any web user. With this feature, storage, preservation, accessibility and discoverability (see below) are all covered. Researchers can submit their raw research data as a supplementary file, and this file will then be made available under a CC-BY license. This requires little extra work from authors and is therefore an easy way to make data accessible.
Even if data are stored, preserved and in principle accessible, this is not very worthwhile if the data cannot be discovered by others. Where finding scientific papers is now a very straightforward process, this is not yet the case for research data. The discoverability of data can be enhanced via the research article but also independently. Regarding the former, an important way to make data more discoverable is to link articles to the data sets these articles are based on. Both Elsevier and other publishers support various mechanisms to set up such links, for instance, through inclusion of data DOIs or data accession numbers, which automatically link to associated data in public databases. When the data location is not yet known at the time of publication, Elsevier collaborates with external data repositories to automatically add the logo of the database next to the article post-publication, which functions as a deep link to the dataset (deposited by the author of the article or a data curator). In addition, recent funding proposals encourage the development of data search engines to make data independently searchable; initiatives such as the National Data Service and the Data Discovery Index aim to provide a data “discovery layer” over research data. In a project co-funded by a National Science Foundation EAGER Grant, Elsevier is working on a data search pilot with the Carnegie Mellon School of Computer Science to develop superior ways to access and query tabular content extracted from articles and imported from research databases.
Data citations are very important for two reasons: they provide a way to track, record and report on data submissions and reuse, and they ensure that researchers get credit for their work. One of the barriers to data sharing has been that it requires extra work from researchers for little reward. Data citations have the potential to change that because they can be easily incorporated in the current reward system based on article citations. Therefore, researchers should think about providing their data with a unique, persistent and resolvable ID, for which in some cases accession numbers can be used. However, the best example of a unique persistent identifier is the Digital Object Identifier, which both articles and data can be identified by. In addition, FORCE 11 has developed a set of principles to describe how data should be cited.
To enable data to be reused, it needs to be clear which units of measurements were used, how the data was collected and which abbreviations and parameters are used. Data provenance is crucial for comprehension. Preferably, proper metadata are added right at the point of storing the data. Which metadata need to be added will differ between disciplines, but the more elaborate the metadata, the greater the comprehensibility will be. Publishers can help here, and several publishers now publish dedicated data journals, such as Elsevier’s Data in Brief. In these data journals, scientists can provide a thorough description of their datasets, which makes it easier for other researchers to understand the data, process they used to capture the data, and anomalies in the data (or in the capturing process) that a re-user of the data should be aware of, supporting proper data reuse. For data published within the article, we have developed a suite of tools to improve data comprehension such as in-article data visualizations, like interactive plots. Here we take author-submitted data and present it as a plot that readers can hover over to see the value of a data point right from the plot, or switch from a graphical view to a tabular view to inspect the data in greater detail.
While it is very common for research articles to be peer reviewed, this is still quite uncommon for research data. However, it is an important step when it comes to quality control and trustworthiness of data. Publishers can also play a role here because they have the procedures in place to carry out the review process. Peer review can make the difference between data that is just posted and data that is published (and thus can be trusted). In many cases, datasets are shared by posting them through the web, but data that have gone through the peer review process can be published. When looking at current practices, there are different degrees of peer review. In some cases, a dataset might be manually checked for proper formatting according to discipline-specific standards before being included in the data repository. In other cases, image data might be automatically checked for manipulation before inclusion in an article. In still other cases, the data might be validated for having a proper description attached as metadata – with which the data can be fully understood and re-used. In Elsevier’s Open Data Pilot, reviewers are asked to check that the submitted files are raw data that can be parsed and are commonly used within the relevant domain; for data journals, data are more thoroughly checked.
Reproducibility of research results is a big concern for science. To increase the credibility of research results, a Reproducibility Initiative was introduced to validate (for a fee) key experimental results via independent replication. Irreproducibility often originates from missing elements to research data, which are needed in order to achieve the same research results. For example resources (e.g., antibodies, model organisms, and software) reported in the biomedical literature often lack sufficient detail to enable reproducibility or reuse. The industry is taking this very seriously with various activities helping to address this need. Elsevier has contributed to the Force11 Resource Identification Initiative, which aims to enable resource identification within the biomedical literature through a pilot study, promoting the use of unique Research Resource Identifiers (RRIDs). The Research Data Alliance (RDA) also has an interest group to address reproducibility.
The key benefit for the wider research community of having research data being shared is the ability to reuse this data. Only when research data is sufficiently trustworthy and reproducible will other researchers re-use the data. This may be to enlarge a sample or to use information in ways it may not originally have been intended for. It is therefore recommended to allow for attaching a user license to datasets already at the very first step of data sharing: at the time of storage and preservation. This will enable any user to clearly understand what they can and cannot do with the data, and can also help ensure they give researchers and data creators the appropriate credit. There are a variety of user license available with the most common ones being Creative Commons.
All the steps and initiatives described here should ultimately lead to this goal: facilitating reuse to make research more reproducible and efficient.
We believe that it is important to integrate these nine aspects of “highly effective research data.” For instance, data should be preserved so that it can be reused. To be citable, it needs to be accessible. But also, in building systems for data reuse or data citation, the practices of current systems for storing and sharing data need to be taken into account. These nine layers and 10th integration step are intended as a guiding principle by which research data management practices can be ordered and checked, rather than as a prescription for perfect performance.
This hierarchy of aspects for effective research data is a suggested practical outline for handling research data and proposes a way of thinking about aspects such as data storage and annotation as a connected set of stages which are highly interdependent. Our suggested pyramid complements the pyramid presented by Opportunities for Data Exchange (ODE), as the latter focuses on the different manifestation forms that research data can have.
Creating an efficient, effective and sustainable data ecosystem requires collaboration among all parties involved in the creation, storage, retrieval and use of this data: researchers, institutions, government offices and funders as well as publishers and software developers. Cross-stakeholder groups (such as Force11, the RDA the NDS and other national and international groups) bringing all of these parties together are essential to setting the pace of change towards better sharing of data and methods, more transparency, and a higher-value, more effective way of scientific communication. We are actively supporting and endorsing such groups, and we look forward to participating in them and contributing to future discussions to enable an ecosystem of effective data management to support important new discoveries and insights in sciences and humanities.
Elsevier Connect Contributors
Anita de Waard (@anitadewaard) has a degree in low-temperature physics from Leiden University in The Netherlands and worked in Moscow before joining Elsevier as a physics publisher in 1988. Since 1997, she has worked on bridging the gap between science publishing and computational and information technologies, collaborating with academic groups in Europe and the US. Her past accomplishments include working on a semantic model for the research paper and cofounding the interdisciplinary member organization FORCE11: The Future of Research Communications and E-Science. Since 2006, de Waard has been conducting research through the University of Utrecht on discourse analysis of biological text, with an emphasis on finding key rhetorical components, offering possible applications in the fields of hypothesis detection and automated copy editing tools. For her current role as VP of Research Data Collaborations at Elsevier, Anita is developing cross-disciplinary frameworks for sharing data and tools to store, share and search experimental outputs.
Dr. Helena Cousijn obtained a PhD in neuroscience from the University of Oxford, where she developed a strong interest in research data. Having worked with various kinds of data and on several data-related challenges, she is now the Product Manager for Research Data at Elsevier. In this role, she is responsible for finding solutions to help researchers store, share, discover and use data. Helena is based in Amsterdam.
Dr. IJsbrand Jan Aalbersberg is Senior VP Research Integrity for Elsevier. After joining the company in 1997, he served as VP of Technology at Elsevier Engineering Information from 1999 to 2002. As Technology Director in Elsevier Science & Technology from 2002 to 2005, he was one of the initiators of Scopus , responsible for its publishing-technology connection. In 2009 he started to focus on new publishing formats, and lead the Content Innovation and Article of the Future activities at Elsevier, and initiated a number of initiatives related to research data. Hs current position, which he holds since 2015, focuses on the integrity of both the content and the products that Elsevier offers to the researcher.
Dr. Aalbersberg holds a PhD in theoretical computer science from Leiden University in the Netherlands. He is based in Amsterdam.