With advanced imaging technology, it is possible to record the activity of 100,000 individual neurons in a live zebrafish larva; just 30 minutes of footage produces 1 terabyte of data. Biology – like all scientific disciplines – is becoming increasingly reliant on data, providing researchers with a challenge: How do they find the data they need for their experiments?
For a chemist who needs to find a particular molecule, an astrophysicist who is looking for a star or a biologist trying to identify a peptide, broad or subject-specific databases can help. The volumes of information they come across in these databases are impressive: Elsevier’s Reaxys, for example, contains more than 40 million chemical reactions and 75 million compounds.
While open data ensures that researchers have access to the information, there’s still the task of sorting through it all to uncover what’s relevant – while leaving room for serendipitous discovery. Making data user-friendly requires collaboration between data experts and technology providers to set up the right infrastructure.
A recent partnership with the collaboration network Amsterdam Data Science (ADS) and Elsevier involves ranking datasets in search results. Disruptive Technology Director Dr. Paul Groth of Elsevier Labs explained:
The really challenging thing is to correctly rank which data sets should come up first when you do a search. You don’t have as much information with data as you do with things like documents – imagine a spreadsheet with named columns and rows and rows of numbers, how do you see what the data is about clearly enough to rank it in a search? We're working with Amsterdam Data Science to try to figure out how to do this much better.
From my point of view, academic search is a fascinating problem – one that has not been studied extensively at scale. Elsevier has very interesting data, not just on publications but revealing how people search, including in big databases like Mendeley Data. As academics we have access to that ecosystem from the outside, but having access from the inside is super interesting.
The algorithms they can develop with that data can “learn” from people’s behavior, preferences and changes in their preferences over time. Part of this development requires theory, which they can develop at their desks, but part is experimentation – they need to test the algorithms under different conditions.
“This is why you need partners like Elsevier with systems and users in place,” Dr. De Rijke said. “There’s a clear benefit for the advancement of science; Elsevier becomes part of the experimental environment.”
Generating serendipitous discoveries
One of the things Dr. De Rijke is interested in is how academics search – and how their behavior differs from web searchers. This knowledge will help him establish how algorithms can automatically learn to improve ranking and recommendations, which is information Dr. Groth can apply to develop technology that supports researchers.
Algorithms designed to understand researchers' reading habits are already changing the way scientists find information. Elsevier's recommender service on ScienceDirect learns from 12 million users a month, giving them relevant suggestions that can help them on their search journeys. Dr. Groth aims to provide a similar tool for data – one that can suggest data, methods and other information based on search and reading behavior.
4 things that make data discoverable
At the European Data Forum in June, Dr. Paul Groth talked about data science at Elsevier, highlighting the four things that make data more discoverable:
- Archiving it – Mendeley Data and other databases can give data a permanent home and a DOI, so it’s citable
- Making it accessible – using APIs and links in articles referencing the data
- Making it searchable – the ADS partnership aims to improve search and ranking tools
- Making it reusable – ensure the data is clear, valid and understandable
Watch Dr. Groth’s presentation
With a background in philosophy and mathematics, Dr. De Rijke is interested in the representation and retrieval of information; he moved into computer science to make his work more practical and see it applied inside and outside science to make a real difference to research:
The aim is to guide scientists as they search but at the same time facilitate serendipity – chance encounters with information they might not be aware they were looking for or even that it’s something they could ask for. It’s an interesting challenge to facilitate the process of people getting lost in the information and having a chance encounter with something. Serendipity is important for scientific search; it’s not something we’ve solved but we would like to make good progress in this area.
Machine learning for the search of the future
This is just the beginning. Through the partnership, Elsevier and ADS are redefining what it means to do computer science and product development, and on the horizon they see a more advanced role for tools like Elsevier DataSearch.
Dr. Groth, a former academic computer science researcher, moved to Elsevier to apply his expertise to developing technology that helps people do scientific research.
We already do this, but there's just so much more we can do; it's a really exciting time to be in this transition period as we focus more on technology. By collaborating with some of the top computer science research groups in the world, we can take advantage of all of the expertise out there to make research tools better and apply our knowledge back to actually doing science.
Elsevier and the computer scientists of Amsterdam Data Science are partnering to understand how researchers look for data, with the potential to develop smart technology to facilitate serendipity. Through industry-academia partnerships, we can create powerful tools that help researchers discover new paths of inquiry.
For more stories about people and projects empowered by knowledge, we invite you to visit Empowering Knowledge.
He’s not alone: partnerships like this encourage people to move between academia and industry, further strengthening the bridges between institutions. In the case of the Elsevier-ADS partnership, PhD students will spend part of their time working on the project at the Elsevier office.
This is a vital aspect of collaboration, facilitating knowledge sharing and transfer through the movement of people. As Dr. De Rijke explained:
In the long run, it’s a very good format for continuous innovation – it’s not just academics throwing a solution over the wall and someone at a company picking it up. It’s a continuous, two-way process, so having people moving back and forth is important.
By working together in this way, it may be possible to take data search to a whole new level. For Dr. Groth this means creating tools that go though the research cycle with scientists, so whether they are thinking about a hypothesis or busy with an experiment, the system can suggest a dataset or method that will help them at that particular moment.
We’re now at the stage where a researcher can see relevant datasets in the side panel of the article they’re reading on ScienceDirect, and see recommendations for further reading. The next stage is for the technology to know even before the scientists has started a search what information, methods and data they might need.
And what if the technology could go even further and suggest research questions that may not have been asked? Dr. De Rijke believes the methods he is working on in computer science could lead to a system that can generate research hypotheses based on what has already been published. For complex topics like climate change, this technology could ensure all the important avenues are being explored.
There’s a Dutch phrase that describes this: toekomstmuziek (literally “future music”). It means we’re not there yet, but with insights from search interactions and the development of publications over time, we will be able to begin to ask questions automatically. This changes the role of scientists, shifting creativity to a different place. Now, a lot goes into asking good questions, but in the future, this kind of automation could mean creativity is applied more to finding solutions or analyzing results.
Amsterdam Data Science (ADS) is a collaboration network organization largely funded by the Netherlands Organisation for Scientific Research (NWO) that involves researchers from Amsterdam University of Applied Sciences (HvA), Centrum Wiskunde & Informatica (CWI), Universiteit van Amsterdam (UvA), and Vrije Universiteit Amsterdam (VU). Strengthening the bridges that have existed for years between Elsevier and these institutions, ADS and Elsevier have signed a long-term agreement to work together on projects like DataSearch, to boost data science and support researchers.
Elsevier Labs is an advanced technology research group at Elsevier that aims to do three things: invent new technologies, such as new model languages; support Elsevier’s technology strategy, assessing how new technologies impact the business; and accelerate development, which involves recruiting what Dr. Groth calls “uber hackers” to work through particularly challenging problems with the team.