Researchers are now working with more data than ever before, especially in the cognitive computing field. Constant advances in technology mean that the enormous datasets involved just keep getting bigger, potentially costing more to generate and store. This can be particularly challenging for researchers with limited resources.
The answer lies in open access datasets. The sharing of data is absolutely essential for the researcher ecosystem for two key reasons, according to Viviane Clay, a PhD student in Computational Cognition at Osnabrück University in Germany:
The first is for collaboration and moving the field forward together in a more efficient way. And the second is all about responsibility and accountability — to be able to show what you did and enable other people to check back on it so that not everything is behind closed curtains and you're not simply showing the most significant results at the end in your paper.
Another key benefit of data sharing is the ability to conserve resources. Generating large datasets takes up a lot of resources in the form of time and computing power, but if data is shared among researchers, no one else has to dedicate these resources and use up all that energy again. This is especially important for research institutions that don't have huge computational resources.
Viviane, whose research centers on making machine learning and computer vision algorithms learn more similarly to what we observe in humans, emphasizes the need for accountability and why it's so important for the research community:
Because I use a lot of open access code and data myself in various projects, I just feel like sharing my work is a good way to give back to the community. Sharing data and code feels like part of my responsibility, as I'm accountable for what I publish.
I just want people to be able to go and look at my data themselves, and if they have doubts about the results, they can check on it. If the data can convince them, that's great, but if they find something where I made a mistake, that's even better. I just want people to be able to trust me and my results. Making everything accessible helps me to do that.
The award is presented to researchers who publish their datasets on Mendeley Data and make it available to others in a way that echoes the FAIR Data Principles: Findable, Accessible, Interoperable and Reusable.
The majority of Viviane's award-winning dataset is made up of trained models of artificial neural networks. This training takes a lot of computational effort, with some of the models trained for almost a month. The dataset also features labeled data in the form of hand-labeled images. The labels reflect what objects are shown in the images, which can then be used to test what’s encoded in the neural networks and carry out different experiments with them. Viviane explains:
I use neural networks, and then I look at what they learn under different conditions. And what I specifically look at is how this is influenced if the network learns through interaction. A lot of artificial neural networks are just trained with a static dataset, but I'm trying to see if it helps to train a network in an actual environment, interacting with the world. I look at what they learn and how that's different to a network trained in the conventional way.
This research could help change how we train neural networks, making them more reliable. There are various real-world applications for this, including self-driving cars, Viviane says:
If you have image recognition trained in the conventional way, it can be tricked pretty easily. It can be tricked to just not see a stop sign, for example, and obviously that can be very dangerous. If we can train the AI in a more stable and robust way so it doesn't make those mistakes anymore, then that would be a really good application.
Viviane is very excited to have won the Mendeley Data FAIRest Datasets Award for her efforts. Despite having won awards in the past, she says this one is particularly special and could help her to win funding for future research:
I feel like the award didn't build on stuff I did before or other advantages that I had. I felt like it just honored that dataset and my work itself.
The field is really pushing in the direction of open access and making data and code freely available. A lot of funding applications actually want you to have a plan about that, and I think it will be very advantageous to be able to show that I've already published the data and received this award for it.
Research data management
When it comes to research data management, Viviane says the research community needs the ability to seamlessly upload structured data so that it is easily discoverable and shareable – qualities she finds in Mendeley Data.
With Mendeley Data, I like being able to describe your data, and putting it up in that kind of folder structure. I also like the feature that indexes your shared data so that it can be cited. This is very, very helpful.
The only limitation for Viviane is file size limit. "The trained models are pretty big, so I only uploaded the main models, and some of the extra experiments I couldn't upload," she says.
In the machine learning world, the sharing of data and code is already quite standard. "Almost everything that I produce can be reproduced just by running the code again," says Viviane, who is thankful that so many others in her research field have also made their code and data available:
Often when I read a paper, I like to first look at the code or the data that the paper is based on, just to get a general idea. I really like when I can find this online and I don't have to ask the authors for it.
How can we broaden data sharing to other fields?
Beyond cognitive computing, Viviane firmly believes that the widespread adoption of data sharing would be very beneficial in other fields as well, she says. This involves promoting a culture of data sharing at research institutions, from the top-down. More guidance on what can and can't be shared is also a crucial point in encouraging more researchers to share their data, she adds:
It would be helpful to have clear guidelines on what data is allowed to be published. For example, I've also worked with subject recordings before and have not published those because I'm not sure what the legal guidelines are with that.
The sharing of large datasets is clearly the way forward for the research ecosystem. This kind of widespread collaboration can support researchers in keeping their data transparent while working more efficiently, potentially helping them make key breakthroughs faster.
comments powered by Disqus