Recently, we went to the national meeting of the American Chemical Society (ACS) in Philadelphia and talked to people about data management. Here’s what we heard.
Too much choice vs. don't box me in
Data management in chemistry is a relatively new topic fraught with questions and disagreement about how it can best be accomplished. It was addressed recently at an all-day session of the Chemical Informatics group at the ACS meeting.
In a program covering all aspects of the management of chemistry data, we heard different sides. Some researchers are frustrated by the plethora of recommendations, requests, mandates and policies around data from institutions, funders and publishers. Faced with a supersized menu of options, they become paralyzed and just want simple clear guidance on what to do. On the other hand, some researchers are concerned about lack of flexibility. They worry that if they're given a set of instructions on what to do, development of new techniques will be hampered because they'll have to try to force their data to fit a certain format. Prof. Jonathan Sweedler of the University of Illinois at Urbana-Champaign provided the example of this perspective. The Sweedler Lab does mass spectometry imaging, where each pixel of an image represents a mass spec run, leading to terabytes of data that no repository wants or knows what to do with. This confirms what we heard from other fields during the development of Mendeley Data. A flexible system which can accommodate all types of research is needed, and it's up to the service providers to solve the findability issues.
Quality vs. innovation
A key focus of the day was a discussion of the challenges of curating data, checking it for errors, and associating it with descriptive metadata so it can be found and reused. With some kinds of chemical data, notably crystallographic data, there are fairly robust standards in place that allow for automated checks to confirm file integrity. But in many cases, there isn't even basic agreement on what to name the properties being described. Dr. Ian Bruno, Scientific Software Engineer and Director of Strategic Partnerships at the Cambridge Crystallographic Data Center and Dr. Martin Hicks, member of the Board of Directors of the Beilstein-Institut, reported on the importance of standardizing materials properties, allowing databases to be organized, new submissions to be validated, and users to make use of the data.
Other speakers brought up specific requirements to keep pace with innovation. An interesting approach was demonstrated by Mestrelab Research, which creates scientific software. They have built a tool which creates submission-ready validated CIF files from experimental data. This means that the standard format can change as rapidly as it needs to in order to keep pace with innovation in techniques, and the individual researcher doesn't have to keep up-to-date on how to format data for submission (only the data validation and packaging tool needs to be updated). This approach could be scaled up to handle many more types of data, making data more easily findable and reusable without extra work from the researcher, while easing the job of editors.
Why don’t you share your data?
Of course, these aren't the only issues around data sharing. For example, a survey by Wiley indicated that the main reason researchers don't share data is an uncertainty around intellectual property and confidentiality. This result is self-reported, however, and some respondents may well have cited IP issues as a way to avoid saying they just can't be bothered. But it clearly points to a lack of understanding of the value proposition of sharing data and well as a very real lack of guidance in some areas about IP and confidentiality issues.
In our talk, we presented the “Pyramid of data needs” (see the figure above). We explained that to support research data that is reusable and reproducible, we need to ensure an environment where data is first of all preserved, and next accessible and retrievable. At Elsevier, our tools follow the data lifecycle, from Hivebench to Mendeley Data (for data creation and storage) to data journals and Datasearch (for data publication and data discovery).
Lessons from early web design
The discussion was reminiscent of early debates about how to make it easy to find things on the web. Some people argued for descriptive tags in the metadata of web pages that would allow pages to be categorized and enable machine-readable statements of the relationship of pages to one another. While this would allow for advanced queries of the web, in the end, people settled on simple-to-create pages, and Google came along to solve the discovery problem through their page-rank algorithm (which was actually inspired by the practice of academic citation)!
The early design of the web has lessons for the discussion on where to put data, too. The architects of the World Wide Web knew there could be no consensus decision on where to put pages about a given topic. They chose a simple format, HTML, and a simple communications protocol (HTTP) and let the problem of finding pages be solved one layer up. Likewise, no consensus decision on where to store data needs to be made. If data is on the web somewhere in a certain minimal format and retrievable over HTTP, then user choice isn't constrained and the discovery of data can be solved by services such as Elsevier Datasearch.
— William Gunn
Main image: A visual hierarchy of desirable qualities of research data. (Source: “10 aspects of highly effective research data,” Elsevier Connect, December 2015)