DataLink tool helps researchers search, cite and write about data
Elsevier’s new DataLink platform will help scientists search for data, write about it – and get credit for it
By Paige Shaklee, PhD; Kaia Motter, and Elena Zudilova-Seinstra, PhD Posted on 3 March 2015
Update: DataLink tool now supports data visualization
It is now possible to preview genomic datasets discovered via search. Read about this visualization feature toward the end of this story.
Since the advent of next-generation DNA sequencing, there has been a boom in the amount of genetic data and resources that are publicly available. In addition, there is increased pressure on scientists to make their research more transparent and reproducible by sharing their data. The NIH Genomic Data Sharing Policy states that “sharing genomic data provides opportunities to accelerate research through the power of combining large and information-rich datasets.” This push for sharing has been echoed by funding agencies, as noted in a 2009 Nature article and most recently with the requirement that all NIH-funded researchers deposit their genomics datasets in a public repository.
While researchers do comply with sharing mandates, there is often great reluctance in the genomics community to share data that is not directly associated with a published research article. Their reasons are sound:
- The data may never be found if it is not deposited in the standard database for a particular niche research community.
- Researchers rarely receive any formal credit, such as a citation, for sharing their data.
- There is rarely enough information accompanying the deposited data to make it of true value to other researchers.
We developed the DataLink platform to lower the barriers that make it difficult for genomics researchers to promote and discover data. DataLink consists of:
- A database search engine
- An automatic data citation generator
- A data article writing tool for the Genomics Data journal
- A data visualization tool
DataLink not only helps researchers search for genomic data, it ensures that researchers get credit and recognition for sharing their data by providing a means to generate data citations and write data articles about their own data. It is an open resource designed to facilitate data discovery, transparency and reproducibility, and encourage the broader dissemination of data-driven knowledge through scientific journal publication. It was developed in collaboration with Kitware SAS.
Our goal in creating DataLink is to link the genomics community with the datasets they share, to support standardized data citation principles, and to simplify the writing process, especially for non-standard scientific article publication.
Database search engine
Data that is deposited in public databases is freely available, but many researchers tend to only search the databases they already know about from word of mouth or based on where their nearest colleagues search, missing out on relevant datasets stored elsewhere. Meanwhile, researchers searching for data on the Internet may come across blog posts, discussion boards and content that is not properly curated, and hence may be unreliable and irrelevant.
DataLink narrows the search results, providing access only to data sets stored in several major public databases and articles that have been indexed in the most important publishing platforms and archival solutions. DataLink searches GenBank, Gene Expression Omnibus and Array Express for genomic and genetic datasets based on keywords the researcher enters. Next to it, DataLink provides search results based on those keywords from articles published on ScienceDirect and archived by PubMed. In the future, we hope to expand our search engine to include a more comprehensive index of databases relevant for genomics and genetics community.
Elsevier Genetics and Genomics articles provide the DataLink search engine directly alongside the article on ScienceDirect. Readers can find data relevant to the article by copying key words from the article into the search field.
Automatic data citation generator
Though there are standards for data citation, uptake in the scientific community has been slow. Using DataLink, scientists can create a properly formatted data citation for any dataset of interest with the click of a button, where required information is populated via an API of the corresponding database. Scientists may then edit the automatically generated citation, save it to a running reference list and make notes related to the references on a virtual scratch pad.
Dr. Amy Tang, bioinformatician and bioinformatics trainer in the Functional Genomics group at the EMBL European Bioinformatics Institute, explained the challenge and how automatic generation of citations could help:
It can be very difficult for a busy wet-lab scientist to know what constitutes a useful data citation, especially when there are so many types of identifiers, such as accession numbers, for both old and new data types. In publications, often scientists don't quote accessions (perhaps not knowing that they are essential) or mistype the accession by accident. Having the citation generated automatically will help a lot with these issues.
Data article writing tool
In various domains, research data is normally shared as supplementary files provided with the full-length articles. These files are usually not properly documented, which makes it difficult for other scientists to interpret this data and re-use it. Though there is consensus among the genomics community that data should be shared, researchers are still pained by having to share their data, with reason. Many researchers cannot glean all the possible insights they would like from an individual dataset by the time they’re ready to publish their first research study but fear getting scooped, missing a great discovery, or having their data misunderstood if they share their data right away.
By giving researchers the chance to place a stamp on their data and provide context by writing a data article, there is greater incentive to share data. While data search and citation are valuable, data without context is not useful to other researchers. To make the data transparent and the research findings as relevant and reproducible as possible, the data should be accompanied by metadata and details about how the data was acquired, filtered and/or reformatted, how samples were treated, the settings and type of sequencing instrument, and base level analysis.
To address these problems and simplify the process, DataLink’s data article writing tool enables researchers to fill in a standardized template that requests essential metadata and details on the materials and methods and any computer code needed to understand a given genomic dataset. Authors are required to have uploaded their data into a public data repository and include data citation in the Data in Brief article. Once authors have filled in the required fields, they may either save a draft for themselves or click on the submit button. Their Data in Brief article will then be directly submitted for peer review to Genomics Data, an Elsevier open access journal dedicated to facilitating transparency and reusability of genomic datasets.
Genomics Data’s Data in Brief articles are peer reviewed by members of the editorial board for thoroughness, clarity and adequate description of the genomic dataset being discussed in the article. By writing a Data in Brief article, not are authors given the opportunity to share their data and essential metadata in a thorough way, they are ensured that a description of their data is citable and archived through the publication of a peer-reviewed scientific article.
Dr. Benjamin Haibe-Kains, Assistant Professor in the Department of Medical Biophysics, University of Toronto and an editorial board member and author for Genomics Data, wrote about his experience: “Writing and publishing my Data in Brief paper was quick as I finally could use all the code and the previous results I generated to ensure the relevance and quality of the dataset we used in our collaborative work in breast cancer. As this dataset has already been used by several other research groups, I had a strong incentive to publish more details about the data themselves and the normalization and quality controls I have performed. I believe that my Data in Brief publication will improve the visibility of our dataset and address most of the common questions I received from independent researchers.”
The DataLink writing tool serves an important purpose of educating scientists on how to write data articles and making this new type of research publication recognizable in the scientific community and hence valuable for a researcher’s career. This tool is being piloted for Genomics Data. If successful, it can be expanded to other research domains.
DataLink tool now supports data visualization
It is now possible to preview genomic datasets discovered via search. Next to it, authors can now prepare and embed data visualizations directly into their data articles. For instance, they can export an image of the portion of the genome and add the data citation for the genome directly to their data article. Or they can upload relevant phylogenetic tree data and preview how it will be visualized inside the data article and explore exactly the same interactivity once the article appears in Genomics Data journal on ScienceDirect.
Ongoing data initiatives at Elsevier
Recently, Elsevier released its Research Data Policy, encouraging data openness and sharing. Related initiatives include:
- Elsevier database linking: articles that reference datasets in publicly established repositories directly link to that dataset.
- Open Data: Raw data files are made open access under a CC-BY license for a series of pilot journals at Elsevier.
- Data in Brief journal: Authors write data articles to thoroughly describe their supplementary data files, or data that may never have been otherwise published. All data articles are published Open Access under a CC-BY license and associated data must be publicly available.
- Open data initiative for materials science: Elsevier’s new data-sharing initiative for materials science provides new ways of storing, sharing and accessing research data for this community. It involves 13 materials science journals and was recently featured on the White House Office of Science & Technology blog.
Elsevier Connect Contributors
Dr. Paige Shaklee (@p_shaklee) made her way from studying physics at Colorado School of Mines to nanoscience at TU Delft to biophysics at Leiden University, where she received her PhD. After doing postdoctoral research in Biochemistry at Stanford University, she joined Cell Press in 2011 as the Editor of Trends in Biotechnology. Last year, she joined Elsevier's biochemistry publishing team as a Publisher for the Genomics portfolio. She is based in Cambridge, Massachusetts.
Kaia Motter (@KaiaMotter) is an Executive Publisher on the neuroscience publishing team at Elsevier. She became interested in data as the publisher for the Genetics portfolio. In addition to the DataLink project, Kaia has led numerous initiatives to help researchers make their data more visible and more interactive, including the development of an application to showcase visual data networks. She is based in Cambridge, Massachusetts.
Dr. Elena Zudilova-Seinstra is Senior Content Innovation Manager for Journal & Data Solutions for Elsevier’s Research Applications and Platform group. She joined Elsevier in 2010 as a Senior User Experience Specialist for the User Centered Design group. She holds a PhD in Computer Science and an MSc degree in Technical Engineering from the St. Petersburg State Technical University. Before joining Elsevier, she worked at the University of Amsterdam, SARA Computing and Networking Services and Corning Inc.