Skip to main content

Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox, Microsoft Edge, Google Chrome, or Safari 14 or newer. If you are unable to, and need support, please send us your feedback.

We'd appreciate your feedback.Tell us what you think!

Publish with us

AI in R&D

Explore the building blocks of effective AI projects

Two men viewing a large screen of information

AI building blocks

Across industries, R&D leaders are exploring how artificial intelligence (AI) can reinvent scientific research. AI, including machine learning (ML) and generative AI (GAI), has the potential to shorten cost- and resource-intensive processes in many disciplines, including:

Drug discovery
Horizon scanning and competitive intelligence
Materials discovery
Pre-clinical and clinical R&D
Post-market surveillance

There is real business impact being discovered with AI, and companies at the leading edge will reap the benefits. To meet this AI future head-on, organizations need to combine high-quality data with the besttooling and data infrastructure. This solid foundation for AI comprises four essential building blocks:

The most relevant and comprehensive data
Domain expertise from humans in the loop
Robust data structuring and rich ontologies
Data management and enrichment

By applying these building blocks, R&D organizations will build a solid innovation base and counteract potential pitfalls in the rush to embrace AI.

Read the GenAI report

The right data for AI

The nuance of scientific questions in fields from drug discovery to materials development demands high-quality, verified training data. The right data provides greater confidence in AI outcomes.

Sourcing the best quality data

AI models require high-quality data from a variety of sources that are relevant to research questions. Sources can include third-party databases/datasets, open source/public access databases, published literature, and internal and proprietary data. For example, a predictive AI chemistry model requires a breadth of inputs that includes not only internal data, such as information on failed reactions, but also published literature. A model informed by incomplete data will produce inferior results whose shortcomings may not be immediately identified, leading to incorrect and expensive decisions.

AI without human oversight poses significant risk, and could lead to real losses for businesses that are not judicious with AI implementation. LLMs and GAI models are known to “hallucinate” to fill gaps in data. There are also potential safety and health risks of missing important information in areas such as drug discovery.

The importance of data provenance

Reusability issues have been a common stumbling block in R&D for many decades. Researchers operating in highly regulated industries such as life sciences must be confident in the provenance of the high-quality datasets they source for use in AI models. This helps ensure reusability and provides an auditable data trail of evidence-based decision-making to regulatory agencies. Organizations require detailed background on the source of datasets, which means creating policies and practices that codify responsible AI practices and clearly document the origin of datasets from third-party providers as well as internal data. This is essential for producing trustworthy, verifiable and reusable research.

Co-occurrences of concepts — On the left: The number of gene-disease-compound relationships in Alzheimer’s disease are shown. In abstracts only, there are 23 co-occurrences. With abstracts plus full text, that number increases to 117 co-occurrences.
On the right: Elsevier compared 23 million PubMed abstracts and 2.5 million full-text articles from Elsevier's biological publications. 57% of co-occurrences first published 2003-2012 (2.6 million) were found only in the full-text articles, not in the abstracts.

Avoiding the use of only abstracts

Scientific literature is an essential source of data for AI model building. However, models should be trained on the full text of articles and not only on abstracts. Often, a paper’s abstract does not represent all the findings contained in an article, and certain types of information — such as adverse events, mutations and cell processes — are less likely to be included in an abstract. Moreover, when a data pipeline pulls only from abstracts, co-occurrences that only appear in the full text and can take years to appear in other abstracts are missed.

Over-reliance on public and open access data

Repositories of publicly available data are often used for AI training. Similarly, open access literature and data can be employed for model training. These sources are valuable but limited; using only publicly available or open access data risks missing important information contained elsewhere. For example, one study found that 45% of relationships relevant to drug repurposing projects for rare diseases can be found only in controlled access sources.

A chart showing how if your data pipeline only pulls data from abstracts, you will miss co-occurrences that only appear in the full text. — Analysis of time taken for co-occurrences that appear in the body of an article to appear in the abstracts of other articles. (Source: Elsevier)

Domain expertise and knowledge from humans in the loop

Scientific research demands domain expertise. Similarly, AI success demands subject matter expertise not just in your research area, but in how AI can help solve business problems.

Identifying and defining problems that will benefit from AI

The decisions and predictions that researchers make — such as which protein site a molecule will bind to — involve precise variations and require a high degree of accuracy and specificity. Finding answers to these questions starts with investing in domain-specific knowledge to identify which use cases can benefit from the application of AI in the first place.

Determining required capabilities and research context

Data scientists who also have domain expertise understand the context of questions asked in relation to the data available. Their insights enable research organizations to better understand which AI approaches will be effective and which are likely to fail. (Holzinger et al.) By tapping into expert technical knowledge, companies avoid spending time and money building solutions that will not actually solve problems. Domain experts further ensure vocabularies and ontologies are constructed to structure datasets so that queries return relevant results without missing essential data.

Collaborating with researchers to access relevant datasets

Technologists with knowledge of a scientific domain can advise research organizations on where to source the best datasets to build a specialist model. They can then further refine and improve datasets to make them machine-readable because they have the chemistry, biology and materials understanding to know which facts are relevant. The other important area of domain expertise is a comprehensive understanding of data licensing, copyright and intellectual property legislation. This avoids legal or regulatory issues emerging — for example, companies may be unaware they lack text-mining rights on a third-party dataset.

Potential sources of datasets to power AI projects in R&D organizations include public datasets, CROs, service providers, software vendors, academic groups, commercial data providers, regulatory authorities and more. — Potential sources of datasets to power AI projects in R&D organizations (Source: SciBite)

Robust data structuring and rich ontologies

Complex datasets from multiple sources used in scientific R&D workflows require structuring and normalizing before insights can be revealed and applied. It is not a matter of simply taking the data and plugging it in.

The power of ontologies in AI

Much of the data that R&D organizations source are not AI-ready. Data are siloed and stored in myriad formats with insufficient metadata, making it difficult to retrieve, analyze and use in AI applications. Ontologies are human-generated, machine-readable descriptions of categories and their associated sub-types. Ontologies also define semantic relationships to other classes and capture synonyms, which is essential where there are multiple ways to describe the same entity in scientific literature and other datasets.

In the life sciences, for example, the same gene can be referred to in different ways. Consider PSEN1, which can also be PSNL1 or Presenilin-1. The controlled language and vocabulary delivered by rich ontologies harmonizes data to make it ready for AI model building.

Constructing domain-specific taxonomies and knowledge graphs

Whereas ontologies define multidimensional relationships, taxonomies define and group classes within a single specific domain. Taxonomies and ontologies are used in the creation of knowledge graphs, a powerful method of data science representation that connects data to visually represent a network of facts using entities and relationships. Knowledge graphs are a purpose-built solution that can handle domain-specific terminology and deliver results that go beyond the “flat” search of a relational database. There can be considerable interplay between knowledge graphs and large language models (LLMs) to the benefit of researchers.⁶ LLMs aid in the generation of a knowledge graph and lower the barrier to entry when it comes to the interrogation of graphs, enabling users of all experience levels to benefit.

Read more about knowledge graphs and their role in R&D.

The role of data science and technology experts

Structuring data using ontologies and taxonomies is highly specialized work. Few R&D organizations have employees with the right mix of skills needed for these tasks, and many organizations lack the technological maturity at the required scale.

For example, organizations may have the right dataset and knowledgeable chemists or biologists who can understand the inputs. These people are experts in their field but have little experience in data-specific tasks. External data scientists play a crucial role in aiding companies to structure data for successful AI, particularly for niche and specific use cases, such as drug repurposing or new materials development. Collaborating with technology experts also ensures that proprietary data and IP are well protected and remain within the organization’s “firewall,” preventing the inadvertent sharing of data in insecure public platforms.

Systems of description range from the weak semantics of a list to the strong semantics of an ontology, which is a formal description of a domain with classes, relationships and logical axioms. — Implementing taxonomies and ontologies improves data quality and usability. (Source: Copyright Clearance Center and SciBite)

Data management and enrichment

Continuous data management and enrichment ensures long-term AI success and reduces the time and resources needed to clean and prepare data for model building.

Mitigating challenges around data integration

Datasets come at different levels of AI readiness and in multiple formats and structures. For example, formats could include experimental data in an electronic lab notebook, real-world data from a clinical study, textual references from scientific literature, instrument readings from a machine sensor, or patent data. R&D teams must embed data management practices that normalize and integrate both internal and external data. Creating such a data lifecycle means investing in frameworks for data management, including ontologies and taxonomies.

Data and semantic enrichment to enhance results

Semantic enrichment empowers R&D organizations to release the full potential of data in structured and unstructured public and proprietary datasets. The process transforms text into clean, contextualized data, free from ambiguities, through annotation, tagging concepts and metadata. For example, semantic enrichment software can recognize and extract relevant terms or patterns in text and harmonize synonyms, such as “heart attack” and “myocardial infarction.” This approach eliminates “noise” and reduces AI hallucinations.

The role of custom services in data management

In the same way that building ontologies and taxonomies typically requires the help of outside experts, so does the process of managing and semantically enriching datasets. Domain experts can create training datasets and build context-aware custom vocabularies that implement a shared language across all research functions. Vocabularies can include an organization’s bespoke terms, such as product names, as well as recognized concepts and terms used in its scientific discipline and industry, including by regulatory bodies. This approach ensures that R&D organizations use new data in their AI applications and unlock the value from legacy data that may go back many years.

Agentic AI: Giving AI more autonomy

One of the hottest AI topics in 2025 is Agentic AI; models that have been given more agency to make decisions on behalf of the user. Obviously, this presents enormous potential, but it only increases the need for organizations to make careful decisions and stay in control of their AI strategy.

In life sciences in particular, agentic AI can significantly streamline target identification, lead identification and more. However, the consequences of an incorrect decision can be catastrophic. Organizations pursuing agentic AI need to consider the following:

Start with the problem: What are you actually trying to solve?
Transparency is paramount: Keep asking yourself questions to maintain clarity: Why were decisions made? What tools were used? What data was reviewed?
Accessibility is key: Necessary resources must be made accessible to the agent, with clear documentation.
Data rules: Outputs are only as good as the quality of data that they have access to.
Adding scientific meaning: By providing an LLM that understands language with scientific knowledge captured in ontologies, human explainability of outcomes can be provided as part of the transparency step.
LLMs or SMEs: Be careful how much autonomy you give your agent and consider that an expert’s input and review may be needed for some steps.
Start with the smallest blocks: By focusing on having a clear set of tools or functions available, one can build a multitude of different agents.
Guardrails: Make it clear what the agent can and cannot do.
Life science problems do not require purely technical solutions: The subject matter expert must remain a key part of the development of any solution.

Embracing AI research tools

In one Elsevier survey, 96% of corporate researchers believed that AI will accelerate knowledge discovery and allow them to focus on higher-value projects. Using AI for research can help address the challenges that R&D-intensive businesses face every day, while optimizing workflows through:

Time savings: AI-powered search and summarization reduce time spent sifting through literature.
Improved confidence: AI-driven insights and trusted content combine to help researchers and innovators make better-informed decisions.
Enhanced productivity: Time-intensive literature reading tasks are streamlined, freeing scientists to focus on innovation and critical projects.
Increased competition: AI-driven research can help companies innovate faster, reducing lost market opportunities.

The same survey found that 89% of researchers would use AI to generate a synthesis of articles, while 94% of clinicians said they would rely on AI to assess symptoms. The catch? These AI tools would need to be backed by trusted content, quality controls and responsible AI principles.

Discover ScienceDirect AI

ScienceDirect AI

ScienceDirect AI

Watch now

|

ScienceDirect AI

Discover, innovate and develop with confidence

For AI to deliver real value in professional settings, trust is essential. That means tools built on reliable data, rigorous governance and responsible AI practices you can depend on. Elsevier helps organizations navigate this complexity with a portfolio of solutions to support and optimize innovation outcomes. Enable faster, better decision-making across R&D initiatives with:

Trusted quality information, including peer-reviewed scientific literature, domain-specific data
Innovative technology that powers data transformation and analytical and predictive tools
Domain and data science expertise to solve complex problems with data solutions for R&D

Let’s shape progress together.

Subscribe to "Foundations for Effective AI"

Sign up to receive a quarterly email from Elsevier on AI in R&D, best practices, building blocks and more. Plus, stay up to date on upcoming webinars and events.

Critical insights for better outcomes. Learn more about datasets.

How can we help?

Lab technician reviewing data