Nishant Mintri, a Natural Language Processing (NLP) Specialist at Elsevier, knows what you can achieve with the right data. He’s currently working on the Funding Body project, which extracts funding information from articles and showcases it to universities and funding institutions in Scopus, Elsevier’s abstract and citation database of peer-reviewed literature – the largest of its kind.
That information helps these institutions understand the impact of their research and where to direct funding in the future.
“When you join Elsevier, the amount of data you get to play with and apply to real-world problems is huge,” Nishant said.
He previously worked in the IT industry in India. He came to the Netherlands in 2017 to do his master’s degree in Information Studies, specializing in data science. When he tried to work on machine learning projects during his master’s program, to his frustration, he found that the data he wanted to use to properly train the models was far less than was actually required. That changed when he came to Elsevier two years ago.
Joining a big company like Elsevier, you’re immediately working with these huge datasets – Scopus itself has more than 70 million articles indexed. It feels like you can do whatever you want, and there’s a lot of freedom for experimentation.
Initially, that amount of data can be intimidating, Nishant said. One of his first experiments that analyzed Scopus records took 10 hours to process just 1 million of the 70 million articles in the database. With support from his co-workers, he started to figure out how to tackle the huge amount of data to extract meaningful insights:
It quickly becomes obvious that you need to be smart when it comes to analyzing datasets as big as the ones you have access to at Elsevier. When you start talking to other people within your team, they share their experience when dealing with ‘big data,’ from constricting your search spaces to parallel processing. These insights helped me a lot when I was a newbie in the field.
For the Funding Bodies project, the machine learning algorithms examines text in millions of research papers to determine which sentences have to do with funding information, and from there, which organizations provided financial support for the work. The machine learning models are trained on data that has been annotated by subject matter experts; these include examples of what a funding sentence looks like and what it does not. Output from the models are shared with authors and experts, and their feedback helps to continuously fine-tune and improve the quality of the models.
Nishant compares it to cooking a gourmet meal:
Initially the data is dirty in that you can’t just use it as it first appears. You need to spend a bit of time prepping it, peeling off the layers. So it’s like chopping up your ingredients and getting them ready. In this use case, the actual text needs to be extracted from PDF documents – tokenized and segmented into sections.
Once that’s done, you get to cook, and that’s the part I find really satisfying – running it through a model, seeing whether the results are as good as you’d get with a human going through the information manually, tweaking what I’ve found, automating things, improving the algorithms.
That’s when it really kicks in, and you get that great feeling of making progress and doing something for the business and its customers.
Because of the work Nishant is involved in, research managers can quickly examine which funding programs their institutions have been involved in, allowing them to monitor how funds are being used. Funders can see the outputs of their work, and universities can get an overview of who their key partners are make decisions based on that information. Making that difference is one the things that matters most to Nishant.
In my previous role it was hard to see how my work affected the end user, which could be a bit frustrating. But here, I can see easily see how the changes I make have a positive effect on people using the platform.
comments powered by Disqus