Today’s data scientists are interested in a lot more than simply visualizing data. These scientists are using knowledge from linguistics to teach computers to understand language. They are using methods from library science to help decipher scientific articles. They are using natural language processing (NLP) to fact check social media posts. They are bringing together knowledge from an array of disciplines in order to make sense of the vast amounts of information people contend with on a daily basis.
Data science is still a young discipline. Although researchers and statisticians have been using computers to analyze data since the 1970s, it is only within the last 20 that data science has come into full bloom as an independent, interdisciplinary field. Significant advances have been made in machine learning, neural networks, and natural language processing during that time, and now a new generation of data scientists is emerging who are taking those tools and applying them in novel and unexpected ways, specifically by enhancing their work with knowledge from other domains.
Incorporating other disciplines into data science is the driving force behind Kathleen Gregory’s work. Gregory is a PhD Candidate at the Royal Netherlands Academy of Arts and Sciences working at Data Archiving and Network Services (DANS) on the Re-SEARCH project. DANS is part of a collaborative NWO-funded effort to optimize a search engine devoted to research data with the broader aim of stimulating the discovery and reuse of research data.
As a former high school chemistry teacher with a BS in neuroscience, an MA in education and MS in library and information science, Gregory said her shift to data science seemed like a natural one. “Data science is a discipline that encompasses more than just the technical,” she explained. “Increasingly, questions of ethics, literacy, philosophy and policy are permeating the conversation as people seek to understand the use and effects of data and the new ways of doing science that open data enables.”
In describing her work, she said:
I take a more social-sciences-based perspective than the other team members. My aim is to characterize the behaviors involved in the retrieval of research data via interviews, case studies, and bibliometric methods. I am also interested in learning more about how these data retrieval practices, especially practices related to the evaluation of research data for reuse, are developed within particular communities. I am most excited by the challenge of combining STS (science and technology studies) and IR (information retrieval) perspectives to develop new frameworks for thinking about data search.
As a PhD candidate in an ever-changing field, Gregory said she relies on the experience of her mentors: “My doctoral supervisors, Andrea Scharnhorst, Coordinator of Research and Innovation at DANS, and Professor Dr. Sally Wyatt, currently at Maastricht University, are wonderful role models and help me to develop my knowledge, skills and perspective in scientometrics and STS, respectively. I feel very privileged to work closely with both of them.”
The collaboration between DANS and Elsevier is another source of direction. “They provide infrastructure, resources and advice for all of the involved parties,” she said. “(They) have experience with both information retrieval and research data management and have so far provided valuable advice regarding publications and methods.”
“One thing that I am learning is the importance of flexibility,” she added. “As the project develops and new perspectives are brought together, it is to be expected that our original research plans will also need to be modified. Maintaining an attitude of openness and flexibility allows the research to develop in interesting, organic, and unexpected ways.”
She sees the future of data science as an ever more collaborative one. “One of the trends that I see in many disciplines is an increase in interdisciplinary and cross-institutional projects,” she said. “The drivers for these collaborations are many, but I believe that one of the effects is the potential for an increase in creativity, as well as the opportunity to bring perspectives rooted in certain disciplines to other academic fields.”
Another next-generation researcher with an interdisciplinary approach is Dr. Isabelle Augenstein, Assistant Professor in the Department of Computer Science at the University of Copenhagen, who specializes in NLP and machine learning.
She came to data science via her love of both computer science and linguistics:
I was always very interested in computer science but I was also interested in language because language is ambiguous, and it seemed like a very hard problem to automatically understand it. I work in statistical natural language processing, which means we use statistical methods to try and understand how language works. One thing that has become popular in the last few years that we were also using is machine learning methods using neural networks. Those are very good at helping to understand how words relate to one another and how sentences relate to one another.
In her postdoctoral work at University College London, Dr. Augenstein worked with the UCL Machine Reading group, a collaboration between UCL and Elsevier. “Our role was to try to automatically understand scientific papers; to extract content from them automatically,” she said. Once extracted, this content can be easily searched by users in an intelligent fashion and leveraged by other systems.
The collaboration between UCL and Elsevier was useful to Dr. Augenstein both with respect to her project goals and also in terms of understanding the field of data science and the direction the field is taking. “They provided the data and the use case for the project, as well as in-person meetings. We had, for example, a full-day workshop where the UCL Machine Reading group would present their work and Elsevier would present theirs and we would talk about the future of the collaboration and the future of the field.”
Her primary source of inspiration, however, was her mentor, Dr. Sebastian Riedel, a Reader in the Computer Science department at UCL and an Allen Distinguished Investigator. “We had weekly meetings where we discussed methods and ongoing work, but also discussed trends in computer science and natural language processing and how some trends might be short lived whereas others might be bigger trends,” Dr. Augenstein said. “There are things that people might focus on within a year, but then it’s important to see what might happen in the larger time frame, say in five years, and what would be a good thing to focus on now to still be relevant in five years.”
Now Dr. Augenstein is taking what she learned at University College London and using that knowledge and experience in her own research, which involves fact-checking social media posts. Dr. Augenstein attributes part of her unique perspective to her relative youth. “In natural language processing specifically, more established researchers started working in the field when datasets were mostly based on news articles, whereas I have worked with social media data a lot — I’ve even worked on studying representation learning for emojis — and this seems quite natural to me.”
“I’m really happy to see people try to incorporate linguistics back in to NLP,” Dr. Augenstein added. “That is where it started – trying to understand how grammar works, how language works in general, and then building some theoretical models of language and trying to implement those models. For the last few years, people weren’t really trying to look at this; they were trying to use brute force machine learning without this intuition of how language works. People are starting to push back against that, and I’m really glad to see that. I think this is where the field will likely develop over the next few years.”
See what Elsevier is doing with machine learning and NLP
Training the next generation
At Elsevier, we take our responsibility to this next generation seriously by supporting data science internships. Elsevier’s Content and Innovation (C&I) group invests heavily in the constitution of the internship program. Dr. Georgios Tsatsaronis, Principal NLP Scientist in the C&I group, explained: “The C&I group maintains a very well connected network with academia for this purpose, from which the interns are found. This network expands worldwide, focusing on some of the most respected universities across the globe, such as the University of Melbourne and Cambridge University.”
The internship subjects are prepared by C&I on the basis of research questions related to research projects the group is leading. “In turn,” Dr. Tsatsaronis explained, “once the interns work in these questions, for which they are provided with all the necessary supervision, infrastructure and support, the outcomes are often prepared for submission to high-impact journals and conferences.
A strong presence at such conferences, which attract the top people and communities in their respective research fields, enables Elsevier to place itself within the elite of the companies who are actually applying machine learning, natural language processing and artificial intelligence in production systems, and also enables us to communicate our mission: help scientists and health professionals get better outcomes, become more productive, make new discoveries, and have more impact on society.