How data scientists are uncovering chemical compounds hidden in patents

Scientists at the University of Melbourne and Elsevier are running a competition to develop models that can find chemical reactions in text

Thorne quote card

The open science movement is helping scientists around the world access data, methods, software and a range of scientific discoveries. But for chemists, part of the information cycle is obscured, making it difficult and time-consuming to access information on chemical compounds until it’s published in an academic journal, which can be years after discovery. On the other hand, such information will be immediately disclosed – in part or fully – in an industrial patent.

Chemical patents are unwieldy, complex and often deliberately ambiguous documents, but the chemical industry needs them as a key source of information. It’s impossible for scientists to explore patents manually, making it crucial to develop automatic approaches that can extract the information they need. Elsevier’s Reaxys database features this information, but it’s currently excerpted manually by experts in a labor-intensive process.

A collaborative team of data scientists and chemists at Elsevier, and computational linguists at the University of Melbourne in Australia, are working on a more automated solution. In a project called ChEMU (Cheminformatics Elsevier Melbourne University), they are developing a natural language processing-based model that will be able to scan millions of pages of patent documents to automatically identify chemical compounds and their positions in reactions.

The benefits reach far beyond the University of Melbourne building their knowledge and Elsevier refining its product: For chemists, the outcome of the project will help them find the information they need about a chemical compound or a reaction quickly and easily, and for the global data science community, it will provide a wealth of biochemical content and the opportunity to develop groundbreaking models.

Dr. Karin Verspoor, Professor in the University of Melbourne's School of Computing and Information Systems, is leading the project from the academic side. She commented:

We need each other – it's absolutely a collaboration. Sometimes academia and companies are on different time scales, but this collaboration is going really well. I think we all feel like we're pulling in the same direction on this, and it's just a real pleasure to see that level of commitment from both sides.

Chemical compounds in patents: the needle-in-a-haystack problem

That shared commitment is vital because the problem they’re attempting to solve is a difficult one. When scientists develop a new chemical compound (which happens a lot: about 1 million new compounds are published every year), they often patent it. Chemical patents are written by lawyers, not chemists, and they are complex: they exist to protect the compound, not to expose it. They are long – they can be 900 pages each – and they are written ambiguously, often containing the names of 3,000 chemical compounds when the patent only really relates to one.

Then there’s the detail: chemical compounds are often named systematically based on their structure, with different prefixes and suffixes being used to represent the elements in the compound. A single mistake in what can be a very long string of letters can break the whole compound, making it impossible to recreate. And those small mistakes are sometimes introduced into patent documents to prevent the information from being used.

This makes it very hard to understand the information and find what you’re looking for. It’s a needle in a haystack, only that haystack has been dipped in glue and shrink-wrapped. Principal NLP/ML Scientist Dr. Saber Akhondi, computer scientist-turned-bioinformatician-turned-cheminformatician who is in charge of data science for Reaxys, is leading the project for Elsevier. He explained:

Putting all this together – it's a hard task. We want to figure out how to identify the relevant chemical compounds in a patent document. What is the patent trying to cover? Then you can go a step further and investigate what reaction the compound is part of and where it appears in tables in the document, which are even more challenging to extract data from.

The team has already come up with one solution to the first step in the process: identifying chemical compounds in patent documents. For this, in addition to data from Reaxys, they used an external dataset created by academia called BioSemantics – and a huge corpus including 1 billion pages of patent documents from seven countries. “We wanted to build a model that not only gives good performance over our own Elsevier data, but that can beat every other proposed model to this data,” said Dr. Camilo Thorne, an NLP Scientist at Elsevier.

And that’s what they did: their results in the summer of 2019 showed that they had produced a model that out-performed all others proposed for the BioSemantics data by a large margin. In fact, their model could identify the chemical compounds correctly in 93 out of 100 cases.

University of Melbourne PhD student Zenan Zhai presents the chemical compound detection results at the BioNLP 2019 workshop at the ACL conference in August.

University of Melbourne PhD student Zenan Zhai presented the chemical compound detection results at the BioNLP 2019 workshop at the Annual Meeting of the Association for Computational Linguistics (ACL) in Florence, Italy, in August. He is also the first author of the paper the team recently published, and the work is part of his PhD.

Shared task: opening up the problem

Dr. Camilo Thorne, NLP Scientist at Elsevier, and Melbourne PhD student Zenan Zhai presented the chemical compound detection results at the Association for Computational Linguistics (ACL) conference in Florence, Italy.

The team used a state-of-the-art model called word embedding to identify chemical compounds in text. Using a numerical approach, word embedding makes it possible to see each word in a document in context. The model reads a string of words, then labels each word according to its position, its relation to other words and its role in a sentence. This information is then fed onto a deep learning naming recognition model. The result is that the model will tell you whether a word is part of a chemical compound, and where it is in that compound name. This provides a way of spotting the name of a relevant chemical compound in a 900-page document designed to hide it – the needle in the haystack.

Now they’re taking the next step to reactions by setting up a shared task – a friendly competition that’s open to any participants who want to build a model to address the problem. A chemical reaction can involve several different compounds and have numerous steps, so the team wants to develop a way of identifying the position the chemical compound has within a reaction, and what kind of reaction it is. Saber likens a reaction to a recipe:

You mix A with B and it gives you C. And then you cook the mixture at a temperature of 180 degrees. Getting this type of information out of a patent is that first step we’ve taken – identifying entity types. It gives you the list of ingredients and details like the temperature. Then there's the process – adding A to B is step one, which gives C, then cooking it at 180 degrees. Those are the events – it’s that step-by-step reaction we’re also looking for in the shared task.

Here’s how it works: the ChEMU team has created a dataset called a gold set, which contains 1,500 reaction snippets from patents, totalling about 7,000 sentences. They provide this gold set about five months in advance of the competition deadline, along with information that should be extracted by whatever model the participants come up with. Participants then get to work building a model that can solve the problem the team sets out in the shared task. About a week before the deadline, the team will provide participants with some data they can use to test their models. They get together at a workshop to assess and discuss the models, and they may publish the results.

Everyone benefits in this situation: the ChEMU team gets a new perspective and the community gets access to data they wouldn’t otherwise have. Karin commented:

The real power of this is to get a whole bunch of teams around the world building their own solutions, and what we hope will come out of it is that somebody does better than we can. We're obviously trying to build our own solutions, and we will provide those as benchmarks for the task, but if someone comes up with a better model, we can then see the problem differently and think about different solutions.

The ChEMU shared task

The shared task – called The ChEMU evaluation campaign: named entity recognition and event extraction of chemical reactions from patents – is now open and will run until the models are presented at CLEF 2020 on September 22-25, 2020, in Thessaloniki, Greece.

Paticipants in the shared task will develop models to identify chemical compounds and their roles in reactions.

Working together, we all benefit

The shared task is reflective of the project as a whole: each member of the team brings their own expertise, ideas and perspective to the work, and each benefits from the result. This has been a focus for the project since the start – its objectives included releasing datasets to engage the academic and R&D community.

Karin is a computational linguist, which means she works on trying to get computers to understand human language and think about strategies for machine reading of text that was produced by humans. It’s a hot topic, with companies like Google using text mining technology for applications such as translation engines. While today’s technology is fascinating, the aspect of the ChEMU project that brings the most to Karin’s work is the access to Elsevier’s domain experts.

We don't typically have access to domain experts in chemistry. The whole foundation for this work is the Reaxys product, so the experience that Elsevier has with structuring information and their knowledge of chemical reactions is huge, because they know what information is important to them – they know what information the product targets and catalogs.

For the Melbourne team, being able to work with people who understand the language of chemistry has been really helpful and has enabled them to run with models they wouldn’t otherwise have been able to develop. Karin has worked with biomedical experts in the past, and she’s excited that there’s still so much to do.

What is science about? At a really high level, science is about contributing to knowledge, increasing our understanding. That's what it's about for me – this project contributes to improving our technology and improving our understanding of how language works. I call that job security because I think I'll probably be doing that for a lifetime – there are endless things we haven't solved yet.

The funding for the ChEMU project runs until 2021, and where the work will go after that remains to be seen. But with so many specialist domains and different languages to explore, the team won’t be short of challenges.

Meet the team

The ChEMU project is a collaboration between data scientists and chemists at Elsevier and computational linguists at the University of Melbourne.


University of Melbourne

  • Prof. Karin Verspoor (PI)
  • Prof. Trevor Cohn
  • Dr. Dat Quoc Nguyen, PostDoc
  • Zenan Zhai, PhD student
  • Biaoyan Fang, PhD student
  • Professor Tim Baldwin
  • Dr Hiyori Yohsikawa


  • Professor Lawrence Cavedon, RMIT University


Written by

Lucy Goodchild van Hilten

Written by

Lucy Goodchild van Hilten

After a few accidents, Lucy Goodchild van Hilten discovered that she’s a much better writer than a scientist. Following an MSc in the History of Science, Medicine and Technology at Imperial College London, she became Assistant Editor of Microbiology Today. A stint in the press office at Imperial saw her stories on the front pages, and she moved to Amsterdam to work at Elsevier as Senior Marketing Communications Manager for Life Sciences. She’s now a freelance writer at Tell Lucy. Tweet her @LucyGoodchild.


comments powered by Disqus