Over the past four years, Elsevier Life Science has been enjoying a very productive research collaboration with the world-renowned NLP (natural language processing) specialist Prof Karin Verspoor and her international research team. Karin, who also is Dean of the School of Computing Technologies at RMIT University in Australia, led the academic end of our ChEMU (Cheminformatics Elsevier Melbourne Universities) project. Partly in recognition of this work, Karin was recently nominated for a prestigious Women in AI Award.
The ChEMU project is working to apply AI to automate the process of extracting information about chemical reactions in chemical patents — not just from texts but also from tables. Specifically, the researchers are developing AI technologies to further activate the chemical information stored in Elsevier’s Reaxys database of 256 million chemical structures, reactions and properties derived from over 100 million patent literature and journal articles.
So far, the collaboration has resulted in numerous data sets, 20 research publications and one patent, with more to follow. These will further empower chemists and pharma companies to do their work more efficiently and effectively.
“What makes patents so interesting is that they are the place where information comes out first, since it usually takes four to six years for data in a patent to be published in a journal” says Dr Saber Akhondi, Senior Director of Data Science for Life Science at Elsevier, who leads the team composed of Dr Camilo Thorne and Dr Christian Druckenbrodt. Making this information available will positively impact everything from scientific hypothesis generation to streamlining drug development.
We sat down with Karin to celebrate and review the project — and to look at how the project, along with the future of healthcare, might evolve.
How would you describe your job to someone outside your field?
Well, my job as Dean involves lots of operational management — so I wouldn’t go too deep into that [laughs]. I’d likely jump to the research side of my job, which I would describe as trying to get computers to be able to read and understand documents in the same way that you and I can read and understand documents. And that basically means we are trying to get computers to make sense of language — to not only make sense of the words but also the relationships between words.
And in the case of the ChEMU project, the documents you wanted machines to understand were chemistry patents …
Yes. And patents are rather esoteric and they present complex information. How do you synthesize a chemical? What are the reactions involved in synthesizing that chemical? It’s a kind of instructional text — like a manual or recipe. And in fact, one of our PhD students covered this: looking at recipes and the way we describe things as a process when, for example, we bake a cake. On an abstract level, a cake recipe and a chemical patent aren't so different: you have ingredients and you combine them following certain steps. Along the way, we do an incredible amount of interpretation — and it’s this constant interpreting we do as humans that is the trickiest part to reproduce.
Can you give an example of this interpreting we as humans do?
Let's say we’re baking a cake. You need to separate the eggs into whites and yolks and then do something with each of these. But what the heck are whites? Of course, we already know. But how’s a computer supposed to know that an egg is a thing with whites and yolks unless you tell it. Here you go, machine: here’s some world knowledge. And that’s what we're trying to do: to fill in the gaps for the computer that humans just take for granted.
You are also doing similar work with other domains — such as biology research papers and human clinical records. How do chemistry patents differ from these other domains? For instance, with patents, some of the information is consciously hidden away in the name of protecting IP.
At some level, it’s all very similar: you are sifting through all the words in documents to find the core entities and relationships that are expressed in those documents. And actually, this common practice in chemical patents of “intentional obfuscation” — where you try to balance proving you have something worth patenting and not giving away everything to the competition — I’ve also seen it in other domains, such as biomedical research. In that case, funding is always related to the health of humans — not mice or snakes. So if you're a molecular biologist, the best possible thing you can do is emphasize that your research is relevant to humans. For instance, if you are talking about manipulating a certain mouse gene, you might not mention the experiment only holds for that certain mouse gene — in fact they might not even say if it’s a mouse gene or a human gene.
So how do you deal with this trickiness from the scientific research world?
Natural language is tricky anyway, regardless of any intentional obfuscation. As I mentioned earlier, there's all this implicit kind of knowledge that we have to deal with. And so in our work with Elsevier, we're annotating as much data using human understanding and human interpretation as we can. And then we're trying to learn from that — the human is always the gold standard here.
We start from the assumption that humans understand the language and the descriptions in these texts that we're trying to analyze. We work together to define a schema of the kind of core entities and relationships — the core facts, if you will, that are captured in this language.
And that’s where the annotating of data becomes so important — human experts looping in their real-world knowledge …
Right. We are defining a parallel language consisting of which key information we want to capture from a document. And then using that, we have humans go through and say: ‘OK, this here’s an example of this kind of entity and this kind of relationship.’ And then we build a model using that data that we can then use on another text coming in — that is, we learn to mimic the annotation that the human has done over that text. This creates a nice virtuous cycle where we use humans to give us examples of what we’re after and how we should interpret these texts. And then we can build a model which we can then use to scale out and do this work automatically.
And once this groundwork is done, you can start generating value …
That’s been a big achievement in this project: all the work that we’ve done to annotate all of this data. It’s an amazing and tremendous value that Elsevier can bring all these domain experts to the table who can really give us that gold standard representation of what’s in these texts. We can't do any of this fancy machine learning AI stuff unless we have the human input about what's important about these texts. So we’ve been able to define the annotation schema and agree on what are the core pieces of information that we want to capture in this text — that really helps us to define the problem that we’re trying to solve.
What have been the other big successes of ChEMU?
Well, we’ve been able to break down into smaller subproblems this rather audacious goal of trying to understand through machine reading how to synthesize chemicals. And by solving these subproblems, we can now finally experiment with bringing the whole thing together.
Initially, we had all these different subparts of our big project, which work on very specific problems. For instance, there was a student who worked on the problem of reference resolution, like connecting “yolks” to “eggs” or “the mixture” to the component parts of the mixture. Another was working on table processing and how we make sense of information in tables. There was a postdoc working on named entity recognition and event extraction. We even had a visiting researcher from Fujitsu in Japan, who spent a year on this project with us and is still joining us every other week in our team meetings; she worked on linking reaction statements together.
Now with many intermediate problems dealt with, we can put these together into an end-to-end solution that can take a patent document and pull out the key facts and relationships.
Another key element of ChEMU was running annual community challenges based on opening up datasets to stimulate further research related to, for example, entity recognition or event extraction.
It's a fantastic approach! Because this way, it’s not just us trying to solve these problems and putting our methods out there. It’s other researchers, other research teams, other experts that are contributing. And ultimately, we’re going to learn from what they attempt, even if our methods are still outperforming the methods of the other participants. We always learn something in the process. And in fact, in the first year, a team that was not our team won the challenge — they were able to outperform our best methods. We try to unpack why and improve our approach in the process. So we will definitely continue with these challenges.
How else would you like the project to move forward?
I'd really like to see us being able to bring everything together and have something that Elsevier can deploy as an end-to-end solution — to really be able to impact the information management and knowledge management in the Chemical domain. That would be huge.
How do you see AI in general transforming healthcare over the next few years?
That’s a tough one. I think a lot of people share a vision of the amazing power AI has: the fact that we can do things at a massive scale, and with very detailed data, to find patterns and make predictions. This gives us opportunities to really examine all sorts of questions about how to diagnose and treat disease and understand those diseases better — particularly rarer diseases with limited data on a more local level. Of course, there are a number of challenges such as the current availability of data and our comfort with sharing data. But I think these can eventually be overcome.
Lessons learned in developing pharma’s go-to drug safety platform