How to create a good data management plan

During a recent webinar on the topic, questions raised ranged from requests for practical tips to confidentiality and copyright - presenter Dr. Rob Hooft answers them in this article

StrategyWordCubes_1200x600.jpg

Data management plans are growing in importance in the world of research.

Not only are funders increasingly making these plans a requirement when researchers submit their grant applications, they are also a very useful tool to save researchers time and effort when running experiments. Additionally, they add value to the wider scientific community as well-organized data provides a great starting point for other researchers to carry out further analysis.

But what are they? Data management plans simply describe how data will be acquired, treated and preserved both during and after a research project.

Recently, Elsevier’s Publishing Campus hosted the webinar “Creating a good research data management plan”, produced in collaboration with the Dutch TechCentre for Life Sciences (DTL), with which Elsevier partners. The webinar  addressed questions around data management plans and the FAIR principles for managing data, which DTL was instrumental in developing. In this article, we explore the answers to some of the most popular queries from attendees.

Practical implementation

Q. What's the best way to start making a good plan?

A. There are several tools available via the internet to create and maintain data management plans. Very well-known ones are DMPonline and DMPtool. These tools are also available as open source if you would like to run them in your own institute. Templates for the questions that many funders ask to be answered in a plan are available.

Q. I want to manage my own experiments. I have >100 samples and several measurements connected to them. Is there good software to find my own data and, for example, compare some of the collected data?      

A. This is very dependent on the research field. In Life Sciences, such tools exist for many of the separate types of measurements that can be done. And in some cases there are special tools that can integrate results from different techniques with each other and with reference data, enabling complex queries and analyses. But the full strength of such integration tools is only realized when different users have very similar goals. In many cases, special adaptations are still necessary for each project.

In the recent past, people from different fields would be using Microsoft Excel or similar software for very varied sets of data containing up to hundreds of samples. Since this does not scale very well or encourage working methods in which all steps taken in filtering and analyzing data are properly recorded, I advise against using such spreadsheets unless they are accompanied by plug-ins facilitating the data acquisition and processing in accordance with the FAIR principles.

Q. Why do you want people to use as little metadata as possible? Isn’t metadata good?

A. This is a misunderstanding. Indeed, metadata is good, and reusable data requires a lot of metadata. The confusion is coming from the name of the standards to use: they are called “minimal metadata standards” because they specify the absolute minimum information that you need to make a data set reusable.

Q. Isn’t this a complicated approach? 

A. Yes it is, and there is no way around it. Data management has many complexities in itself, and not doing proper data management can be a serious risk for a project and lower the effectiveness of (public) funding provided to it. Data management planning reduces those risks, and the FAIR principles can help you with this planning. Certainly now that we are in the early phases of the transition towards long term data stewardship a lot of aspects of this work are new to all of us. I’m sure it will become easier over time.

Legal issues

Q. How do you manage copyright issues in reusing data? 

A. This is a very difficult question. Data in itself is not subject to copyright. Copyright is only applicable to works that have been created by a “creative act”, and data are considered to be just facts, exempt from copyright. Creating a specific data collection could constitute a creative act, and there a copyright could be applicable. However, this is subject to a lot of controversy. In any case, the best way to deal with copyright issues for the reuse of data is to check the license. As I say in the webinar, I consider CC0 the best license to apply to research data sets that do not have other legal restrictions on distribution.

Q. How do we deal with data that has license restrictions? If it has a non-commercial-use clause is it safer not to use this data?

A. If you are looking to possibly commercialize the results of your research, or explicitly want others to be able to do so, you have to pay special attention when basing your research on other people's data with restrictions to non-commercial use. If you do use such data, the consequence may be that your own results may require similar license restrictions. Contact the legal specialists at your institute!

Q. What can you do when your research subjects are very vulnerable and have been guaranteed anonymity and confidentiality?

A. For any data about humans, privacy is of concern. In some cases even the lives of the people involved may be at risk. Access to such datasets will be heavily restricted. Making data FAIR explicitly does not require data to be open: the “A” for “Accessible” requires that you describe the conditions under which access can be obtained. If there are good reasons to keep the data completely closed, that is acceptable.

In most cases, aggregated data, with a level of aggregation suitable to cover the risk of leaking any identities, could still be made open.

The relationship between data and papers

Q. Please clarify how to make conclusions from a paper also available as digital data, in addition to the narrative form.     

A. In DTL and Elsevier we are working with Linked Data technology. Linked Data can be used to express properties and relationships in ways that are almost as expressive as human text, but much more understandable for computers. An example from the Life Sciences is “malaria is transmitted by mosquitoes”, which could be the conclusion from a paper. If you want to know more about how this can be done, read about the concept of “nanopublications” at Nanopub.org.

Q. What do you think about publishing the data as supplementary material to a paper?              

A. Publishing data as supplementary material is only a partial solution. A paper makes data accessible under well-defined conditions, but only if the paper is accepted. As I discuss in the webinar, the data you collected can also be valuable if your hypothesis turns out to be untrue, or even if the experiment fails and no paper results from it. Supplementary data to a paper is also not easy to find, especially for computers. Of course it is important for the paper to be able to find the related data set and to be able to find the paper from the data in any repository. To make this possible, Publishers are working very hard together on a system to associate papers with the referenced data sets.

Q. How can I make a data management plan for qualitative data?

A. For qualitative data, all the FAIR principles apply as well. The most important difference with quantitative data lies in the form in which it can be made understandable by computers. If there is no standard exchange format for the kind of data that you are collecting (e.g. until now you have been describing it in text in your papers), it is a good idea to check out what you can do with Linked Data technology to describe qualitative properties of your subjects, and associations. Again, you can find more information at Nanopub.org.

International research

Q. Should research in Life Sciences be conducted in worldwide collaborations or nationally?

A. All research is becoming more and more international. I do not think that national barriers should play a significant role in the access to data, except where this is mandated by legal restrictions. Especially when researching cause and remedy for rare diseases, international data sharing is essential, so we must find solutions to privacy issues. Many of the reference data sources are also maintained internationally. The cost of maintaining these data resources is the subject of international discussion. It is clear that the international community should pay for this together, and research infrastructures and funders are indeed discussing different models for funding.

Q. Is there an all-in-one platform (cloud maybe) to help researchers in developing countries where their institutional technical support is weak such as in backup repositories, sharing data, and other related FAIR issues?

A. There is no such single integrated platform at this moment. There are efforts in both the commercial and academic worlds to integrate solutions more and more. Cloud solutions are indeed the current form in which this takes shape. None of this is cheap, however, and remote access to large volumes of data requires fast connections which can be exceedingly expensive. Dealing with large data sets is increasingly common for research projects. I do think that more and more access to compute power and data storage will be commoditized, and using it will be as easy as using electricity. I do not know how fast such developments will take place, and certainly not how this will differentially affect the developing world.

Tags


Contributors


Comments


comments powered by Disqus