How weโre using AI to boost productivity for chemistry researchers
2023๋ 2์ 6์ผ | 10๋ถ ์ฝ๊ธฐ
์ ์: Eleonora Echegaray

A data enrichment expert takes you behind the scenes of Elsevierโs award-winning Reaxys Content Catalyst team
Caption: The Elsevier team is presented with the Data Science Excellence Award for the Reaxys Content Catalyst (left to right): Mark Sheehan (VP, Data Science, Life Sciences, Elsevier), Anitha Golla, PhD (Senior Data Enrichment Expert, Elsevier) Chetan Bhagat (award presenter, Indian author), and Abhinav Agnihotry (Data Scientist, Elsevier)
Chemistry researchers worldwide use Elsevierโs expert-curated chemical information platform,ย Reaxys, to find the information and compounds they need in a broad range of fields, from pharmaceutical drug discovery and chemical R&D to academic research and education. Recently, the team behind the Reaxys Content Catalyst was awarded aย Data Science Excellence Awardย ์ ํญ/์ฐฝ์์ ์ด๊ธฐย for innovation in analytics, data science and artificial intelligence.
I sat down with Drย Anitha Gollaย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ, a Senior Data Enrichment Expert at Elsevier, to talk about her teamโs work and what theyโre doing to continually expand and update the content available in Reaxys.
It quickly became obvious that her work is her own reward. But she was still thrilled her team won this award alongside heavyweights like Axis Bank Limited, IBM, Schneider Electric and Wells Fargo.
โThese days everybody is doing something with AI and data science โ thereโs just so much work going on,โ Anitha said. โSo itโs fantastic to get this sort of validation from the greater AI community.โ

Anitha Golla, PhD
100 million documents and counting
The award capped Indiaโs biggest AI conference,ย Cypher22ย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ, whenย Analytics India Magazineย ์ ํญ/์ฐฝ์์ ์ด๊ธฐย hosted the fourth edition of the awards in September. The prize recognized the teamโs efforts in the AI-powered content enrichment production pipeline Reaxys Content Catalyst (RCC), which works to radically boost the content available in Reaxys โ which in turn works to boost R&D productivity for chemistry researchers.
The prize also coincided with the pipeline passing a key benchmark: processing over 100 million documents.
โBoth of these achievements are really just a testimony of the power of cross-functional teams,โ Anitha said.
Diversity of thought: collaborating across functions
Anitha developed a taste for working on a multidisciplinary team while working on her PhD in bioorganic chemistry at theย Karlsruhe Institute of Technology (KIT)ย ์ ํญ/์ฐฝ์์ ์ด๊ธฐย in Germany:
โMy supervisor had a small startup, and his aim was to provide biologists with as many peptides as possible for their research. These needed to be both cheap and of high quality. And to help make this happen, I got to work with all these amazing people: physicists, biologists, engineers.โ
โPreviously, I was largely a lone researcher. But this experience helped me understand if you work with all these different people, amazing things can happen. And they can happen better and faster than if you did it alone.โ
A high-impact niche
The complexity of her current work certainly requires a cross-functional team.
โThere are millions of documents published in the scientific community that have the capacity to change the world on every level,โ she says. โIt could be about a life-saving drug or about changing the way we make decisions or approach a certain challenge. Our job is to make sure that this content is up to date so people can take it from there in the fastest and smartest way possible.โ
While passionate about the relevance of her work, Anitha was still pleasantly surprised by the award. โWeโre actually quite niche,โ she said. โWeโre collecting the chemical facts โ from both texts and images โ and giving them to the scientific community in a way to help drive their decisions and actually help them do their extraordinary work.โ
โOur customers literally told us what they wanted โฆโ
โOur project also stands out for being entirely born out of customer needs,โ Anitha added. โOur customers literally told us what they wanted: to be able to find certain things โ substances, biological targets โ very quickly in patents published in the last 20-odd years. They wanted a sense of the competitive landscape so they could work within this landscape and not against it.
โTraditionally, thereโs only been one way to get this sort of information: hire an army of chemists to read each of those millions of documents line by line. But of course, this is much too slow and costly. So we sought to automate the process โ after all, Elsevier was already applying data science to almost everything else.โ
No average day
The project involves a team of 40+ people, depending on what work needs to be done.
โOn any given day, I work with people from three or four different domains โ hardcore chemists, data scientists, data engineers, data architects, software people, etcetera,โ Anitha explained. โI have to switch from thinking like a chemist checking to see if a structure is correct, or looking at it like a statistician for precision. So that keeps it exciting.โ
It also keeps things challenging, she said: โYou might come up with something that makes sense to chemists. But then when the people on the software side look at it, they say itโs too costly in terms of computational powerย ย or time. And later, while something might work on a small scale, itโs a whole different story when itโs productionized and applied to millions of documents. But the fantastic thing is that everyone wants to find that right balance where everyoneโs happy.โ
Onward and upward
The project was ambitious from its inception.
โIt was never just about a pipeline that could process patents quickly and accurately,โ Anitha explained. โIt also needed to be updated and upgraded every time something new arrived โ be it more documents or new technologies, approaches or products. It needed to be a fully modular pipeline โ like plug-and-play โ that could easily be adopted and just keep on running. So that involved a lot of planning.โ
Now, as the pipeline has been extended to data from journals, all this planning is paying off. Further iterative development of the infrastructure is planned for 2023, including an extension to Elsevierโs biomedical literature databaseย Embaseย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ.
And the ambitions continue to grow.
โAt one point down the road, I see a pipeline where anything can go through, and it just branches out to different products,โ Anitha said. โIt will be able to classify everything on its own, thanks to Elsevierโs massive taxonomies.
โOnce you realize there are so many things you can do from the data perspective in terms of getting actionable insights, the sky becomes the limit โ not only for chemists and other life sciences [researchers] but beyond.โ
๊ธฐ์ฌ์

EE