20+ years of digital innovation at Elsevier — and how the latest can accelerate R&D
January 8, 2024
By Ann-Marie Roche
© istock.com/poba
The VP of Data Science for Elsevier’s Life Sciences team explains how pharma companies can accelerate their digital transformation using Elsevier’s AI-driven R&D innovations.
Mark Sheehan is VP of Data Science for Elsevier’s Life Sciences team. His 20-plus years at the company map closely with the Elsevier’s digital journey. And today, pharmaceutical companies can follow a similar journey — albeit highly accelerated — using Elsevier’s latest AI-driven R&D boosters.
We talked to Mark about his experiences along the way: from the joys of cracking open a newly printed book, to enabling R&D professionals to speedily crack new synthetic pathways at scale. “It’s true, innovation always involves new technology,” he says. “But it’s equally about human collaboration.”
When Mark joined Elsevier as a project manager in 2002, the company was in the first phase of transitioning away from being primarily a print publisher. With each technological transition that followed — from the move to online through to content enrichment to today’s predictive modeling and exploring opportunities in generative AI for drug discovery — Mark has been closely involved with Elsevier’s transformation into information analytics innovator. Today, he leads the Data Science team for the Life Sciences, exploring how AI and other technologies can streamline the research and development work of chemists and biologists.
In many ways, Mark’s career is not only a mirror of Elsevier’s evolution over the same time, but also of the current evolution of many companies and organizations that are embracing digital transformation to achieve their short- and long-term goals. We asked Mark to look back at some of his pivotal professional moments that may resonate for those on this journey.
First wave: The digitization of information
You were there at the internet’s big bang, weren’t you?
In some ways, I did get the full spectrum. When I started in publishing, immediately prior to Elsevier, I was working with a very small journals publisher where they copyedited, typeset and printed the journals themselves all in the same small building in North London. And when I arrived at Elsevier, it was a classic case of right place, right time. Luckily, I proved to be a terrible copy editor and proofreader — I just didn’t have the patience for it. But I did have a natural affinity for computers and some rudimentary programming skills, backed by a willingness to learn.
My first boss at Elsevier was very supportive of my growth, even though we came from very different worlds. He would love it when the first copies of a book came in from the printer — he’d pick it up, sniff the fresh smell of ink on paper, and vigorously shake it upside down to make sure the binding was strong. Meanwhile, I was this enthusiastic puppy evangelizing about the internet and jumping up and down to take on each and every task relating to the shift online. He largely left me and the other geeks to it and put a lot of trust in us to lay the foundations for the future.
By then, ScienceDirect was already up and running — a sign that the, um, shelf life of those printed books was becoming limited?
ScienceDirect was actually an amazing strategy for the time. As a subscription service offering all Elsevier’s content in one place at one price, there are some comparisons with what Spotify would do with music years later. So it was a bold shift for the whole company, first in journals then in books.
Doomsayers were predicting the death of the book.
Indeed. With the rise of desktop publishing, the whole notion of a book as a container of information was fundamentally shifting. We began thinking about what unit of information the customers ultimately wanted. How could we best organize these articles and journals for them? Did they want the full book or just a chapter? Again, it’s similar to music when customers shifted from CD to purchasing individual tracks on iTunes. Forgive me, I do use a lot of music analogies, but both these industries were fundamentally disrupted in parallel.
But coming from a place as a legacy print publisher, Elsevier must have had some grumpy employees unwilling to embrace the shift to digitization. And Elsevier is a rather huge company.
Yes, it’s a common perception that larger companies are slow to change. And there were challenges, let's be clear on that. Some people really cared about the craft of print, which is great — and print does remain important in many markets. And yes, some colleagues got upset sometimes that we had to standardize our print designs so they could also work online, which I can also sympathize with. And indeed, to push through change, you have to consider the human implications as much as the technical ones. But most were quick to see that the bigger story was not about paper but about the transmission of information — and being able to unlock what was written on the page for the largest scientific audience possible.
New webinar
Watch the first edition of the four-part webinar series AI in innovation: Unlocking R&D with data-driven AI. Mark Sheehan joins the expert panel to explore the perils, pitfalls and promise of generative AI for R&D opens in new tab/window
Second wave: Leveraging the data
So the average Elsevier employee proved to be less obsessed with books than with information?
Once they saw that the internet wasn't a threat and wasn't taking anything away but rather just changing how we disseminated the information, the shift was clear for all of us. And as individual consumers, we were all evolving with the times: getting iPhones, Kindles, laptops, etcetera. Everyone was aware of where the world was going because we were part of that world in the middle of this amazing shift in society. And yes, some worried about whether their job would become obsolete. But they soon recognized that their skills were still valuable and/or adaptable in tandem with these changes.
But the true tipping point, the payback, only came later — when e-revenue eclipsed print revenue.
Yes, every year the digital revenue grew and grew, and suddenly there was this tipping point when we stopped focusing exclusively on when the book hit the warehouse for sale but also when it appeared on ScienceDirect or was available on the Amazon store. Basically, we were leveraging what we already had spent years building: a solid digitized foundation. So this second wave was much less bumpy; it was more about providing more and more different types and variations of deliverables for different consumers.
And as this reach continued to extend, the way people consumed our content fundamentally changed as well. It was no longer exclusively about browsing through the library and reading many physical copies to find what you needed. It was becoming more about how to optimize your search across many online sources and databases. So the challenge became more about streamlining the finding of the digital needle in the exponentially expanding haystack.
Third wave: From manual to automated
The next step was to leverage the digitalized text even further: applying data science and AI to enrich and extract from this information in whole new ways to help with digital search and discovery.
My department first came together in Elsevier seven or eight years ago to look into what we could do to move the needle on adoption of data science in the life sciences. We discovered early on that we could do a lot of things to automate our traditional manual curating processes, which previously involved very smart people reading all these articles, literally page by page, and doing all sorts of clever annotations based on following these dense indexing “rule books.” But we also discovered that no matter how much technology you use, there’s a human limit to how much you can scale this approach.
So, to move forward, we began to wonder what would happen if we could teach the machine to read it all, do some initial enrichment that would update our customers quickly on new research, and also flag any interesting material to be read and indexed in detail by an expert. Luckily, Elsevier had some in-house tech pioneers who had already started building automated tools and paths that we could quickly extend for processing at scale — particularly for our chemistry database, Reaxys, and our biomedical literature database, Embase opens in new tab/window.
Can you tell us more about how the humans-meet-machine-learning axis was applied in Reaxys?
Basically, we realized two things. First, that a single “silver bullet” technology doesn’t exist for our use cases. It certainly won't give you the range of what a human being can do, nor what your customers are asking for. But if you stack different and complimentary technologies together, they can work to cover different elements and you can get a much better view — “more sides of the elephant,” as it were.
Second: When this department first started, we had a team of about 20 PhD-level chemists and biologists who, in some cases, had been carefully curating these enrichment flows for 30-plus years. We then brought these “manual” domain experts together with our data scientists and analysts to work together to “train the machine.” This led to amazing results: In that first year alone, we were able to deliver new automated capabilities for Reaxys that could enrich articles from 16,000 journal titles per year versus the 400-odd we processed previously. And today, our customers can search across hundredsof millions of documents.
Next-level magic
What do you see as the next huge leap forward?
I’m assuming you want me to talk about GenAI — large language models, LLMs? Well, let’s just say we’re exploring various exciting avenues with various partners. But we have to be very diligent and continue to put quality first since our customers rely on us to follow regulatory protocols such as transparency, reproducibility, explainability, etcetera — all of which are not immediate strong suits for LLMs. That said, do stay tuned for future announcements!
But meanwhile, if I can backtrack a moment. We’ve been talking about the digital journey from content (such as the books and journals on ScienceDirect) to data (such as the facts and concepts indexed from those books and journals for easier search and discovery). But the power of machine learning can also be used for predictions — to, in effect, teach the machine chemistry by feeding it massive volumes of complex chemical reactions and facts.
For example, machine learning models can be used to not only correctly identify well-established paths to create a certain compound as well as a trained chemist, but it can also suggest previously unknown paths to synthesize that compound — paths that can be cheaper, faster, and more environmentally friendly.
Already, the Reaxys Predictive Retrosynthesis tool, which is now fully embedded in Reaxys, helps even very experienced chemists by suggesting new synthetic paths using a range of best-in-class proven predictive models. Meanwhile, we are continuing to work with a number of leaders in the field of predictive retrosynthesis, including eminent researchers like Prof Mark Waller opens in new tab/window, who published a very famous paper for Nature, “Planning chemical syntheses with deep neural networks and symbolic AI,” opens in new tab/window which provided some of the foundations for our capabilities on Reaxys.
And this process of predictive modeling can continuously improve as we add more enriched data, and as our human experts validate the outputs of the machine learning models. There’s still so much more opportunity in this space, and research continues to move forward all the time.
What projects excite you most in terms of furthering this continuous improvement?
Well, without giving away too many company secrets, we have made some fantastic advances in recent years to mine the full text of chemistry-related journal articles. And since Elsevier acquired SciBite a couple of years ago, we now have their powerful semantic technologies into our “data science toolkit” for further advances, particularly in the biomedical space. We’re also able to customize our data and semantic tooling synergy for the benefit of mutual customers — for example, helping them to enrich their content and then align this with Reaxys data for a best-in-class dataset to use for model training. But in general, we will continue to expand our automation capabilities while also moving deeper into the predictive chemistry space I mentioned earlier.
We’re also continuing our very productive research collaboration with Prof Karin Verspoor opens in new tab/window and her doctoral team in Australia, building on the successes of our ChEMU opens in new tab/window (Cheminformatics Elsevier Melbourne University) collaboration, which explored automating ways to extract information about chemical reactions in chemistry patents. As a result, we are making real headway into the many and varied challenges of training a machine to read tables and accurately extract information from them — which is very valuable for chemists and other researchers, as you can well imagine.
Does moving forward mean partnering?
Absolutely. With all these different directions and partnerships taking place in this predictive space, we are part of a large research community. And certainly, the life sciences require more collective engagement than many other sectors — in terms of involving academia, the business world, policymakers and regulatory bodies.
And we are very much a part of this community — whether it’s through commercial and academic research partnerships, supporting researchers at all stages of their career, expanding our interns program, or inspiring the younger generations with such initiatives as Amsterdam Data Science opens in new tab/window. Only together can we really build a healthier future.