Skip to main content

Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox, Microsoft Edge, Google Chrome, or Safari 14 or newer. If you are unable to, and need support, please send us your feedback.

Elsevier
Publish with us

AI in R&D

Explore the building blocks of effective AI projects

Foundations for effective AI in R&D mobile banner

AI building blocks

Across industry, R&D leaders are exploring how artificial intelligence (AI) can reinvent scientific research. AI, including machine learning (ML) and generative AI (GAI), has the potential to considerably shorten cost- and resource-intensive processes in many disciplines.

To meet this AI future head-on, organizations need to combine high-quality data with the best data tooling and data infrastructure. This solid foundation for AI comprises four essential building blocks:

  1. The most relevant and comprehensive data

  2. Domain expertise from humans in the loop

  3. Robust data structuring and rich ontologies

  4. Data management and enrichment

By applying these building blocks, R&D organizations will build a solid innovation base and counteract potential pitfalls in the rush to embrace AI.

The right data for AI

The nuance of scientific questions in fields from drug discovery to materials development demands high-quality, verified training data. The right data provides greater confidence in AI outcomes.

Sourcing the best quality data

AI models require high-quality data from a variety of sources that are relevant to research questions. Sources can include third-party databases/datasets, open source/public access databases, published literature, and internal and proprietary data. For example, a predictive AI chemistry model requires a breadth of inputs that includes not only internal data, such as information on failed reactions, but also published literature. A model informed by incomplete data will produce inferior results whose shortcomings may not be immediately identified, leading to incorrect and expensive decisions.

Some estimates suggest that asset damage stemming from decisions made by AI agents without human oversight could reach $100 billion by 2030 opens in new tab/window.1 This is a significant fear for businesses, causing them to want AI to serve as a co-pilot rather than a “black box” autopilot. Moreover, LLMs and GAI models are known to “hallucinate” to fill gaps in data. There are also potential safety and health risks of missing important information in areas such as drug discovery.

The importance of data provenance

Reusability issues have been a common stumbling block in R&D for many decades. Researchers operating in highly regulated industries such as life sciences must be confident in the provenance of the high-quality datasets they source for use in AI models. This helps ensure reusability and provides an auditable data trail of evidence-based decision making to regulatory agencies. Organizations require detailed background on the source of datasets, which means creating policies and practices that codify responsible AI practices and clearly document the origin of datasets from third-party providers as well as internal data.2 This is essential for producing trustworthy, verifiable and reusable research.3

Co-occurrences of concepts

On the left: The number of gene-disease-compound relationships in Alzheimer’s disease are shown. In abstracts only, there are 23 co-occurrences. With abstracts plus full text, that number increases to 117 co-occurrences.

On the right: Elsevier compared 23 million PubMed abstracts and 2.5 million full-text articles from Elsevier's biological publications. 57% of co-occurrences ­first published 2003-2012 (2.6 million) were found only in the full-text articles, not in the abstracts.

Avoiding the use of only abstracts

Scientific literature is an essential source of data for AI model building. However, models should be trained on the full text of articles and not only on abstracts. Often, a paper’s abstract does not represent all the findings contained in an article, and certain types of information — such as adverse events, mutations and cell processes — are less likely to be included in an abstract. Moreover, when a data pipeline pulls only from abstracts, co-occurrences that only appear in the full text and can take years to appear in other abstracts are missed.4

Over-reliance on public and open access data

Repositories of publicly available data are often used for AI training. Similarly, open access literature and data can be employed for model training. These sources are valuable but limited; using only publicly available or open access data risks missing important information contained elsewhere. For example, one study found that 45% of relationships relevant to drug repurposing projects for rare diseases can be found only in controlled access sources.

A chart showing how if your data pipeline only pulls data from abstracts, you will miss co-occurrences that only appear in the full text.

Analysis of time taken for co-occurrences that appear in the body of an article to appear in the abstracts of other articles. (Source: Elsevier)

Checklist: The most relevant and comprehensive data

☑ The quantity and diversity of data will ensure confidence in model training. ☑ The data sources are high quality, up to date and verified. ☑ There is no over-reliance on abstracts and open-source data.

Domain expertise and knowledge from humans in the loop

Scientific research demands domain expertise. No off-the-shelf, general-purpose AI can solve specific research questions and problems. Similarly, AI for AI’s sake will only lead to time and money spent without relevant business outcomes.

Identifying and defining problems that will benefit from AI

The decisions and predictions that researchers make — such as which protein site a molecule will bind to — involve precise variations and require a high degree of accuracy and specificity. Finding answers to these questions starts with investing in domain-specific knowledge to identify which use cases can benefit from the application of AI in the first place.

For example, technology experts who understand complex metadata used in a field such as biology can construct relevant models. Metadata could include “…the solubility and stability of compounds, possible contaminants, variation in temperature and humidity during the experiments, sources of reagents and other materials, and expiration dates.” (Makarov et al.)

Determining required capabilities and research context

Data scientists who also have domain expertise understand the context of questions asked in relation to the data available. Their insights enable research organizations to better understand which AI approaches will be effective and which are likely to fail. (Holzinger et al.) By tapping into expert technical knowledge, companies avoid spending time and money building solutions that will not actually solve problems. Domain experts further ensure vocabularies and ontologies are constructed to structure datasets so that queries return relevant results without missing essential data.

Collaborating with researchers to access relevant datasets

Technologists with knowledge of a scientific domain can advise research organizations on where to source the best datasets to build a specialist model. They can then further refine and improve datasets to make them machine-readable because they have the chemistry, biology and materials understanding to know which facts are relevant. The other important area of domain expertise is a comprehensive understanding of data licensing, copyright and intellectual property legislation. This avoids legal or regulatory issues emerging — for example, companies may be unaware they lack text-mining rights on a third-party dataset.

Potential sources of datasets to power AI projects in R&D organizations include public datasets, CROs, service providers, software vendors, academic groups, commercial data providers, regulatory authorities and more.

Potential sources of datasets to power AI projects in R&D organizations (Source: SciBite opens in new tab/window)

Checklist: Domain expertise and knowledge from humans in the loop

☑ R&D experts are working with AI as a co-pilot and feel augmented by technology. ☑ You have identified specific use cases and workflows that will benefit from AI. ☑ You have written and implemented responsible AI and robust data provenance policies. ☑ Domain experts and data scientists collaborate for data access and data skills as required. ☑ You have implemented scale and KPIs to measure and quantify outcomes.

Robust data structuring and rich ontologies

Complex datasets from multiple sources used in scientific R&D workflows require structuring and normalizing before insights can be revealed and applied. It is not a matter of simply taking the data and plugging it in.

The power of ontologies in AI

Much of the data that R&D organizations source are not AI-ready. Data are siloed and stored in myriad formats with insufficient metadata, making it difficult to retrieve, analyze and use in AI applications. Ontologies are human-generated, machine-readable descriptions of categories and their associated sub-types.5 Ontologies also define semantic relationships to other classes and capture synonyms, which is essential where there are multiple ways to describe the same entity in scientific literature and other datasets.

In the life sciences, for example, the same gene can be referred to in different ways. Consider PSEN1, which can also be PSNL1 or Presenilin-1. The controlled language and vocabulary delivered by rich ontologies harmonizes data to make it ready for AI model building.

Constructing domain-specific taxonomies and knowledge graphs

Whereas ontologies define multidimensional relationships, taxonomies define and group classes within a single specific domain. Taxonomies and ontologies are used in the creation of knowledge graphs, a powerful method of data science representation that connects data to visually represent a network of facts using entities and relationships. Knowledge graphs are a purpose-built solution that can handle domain-specific terminology and deliver results that go beyond the “flat” search of a relational database. There can be considerable interplay between knowledge graphs and large language models (LLMs) to the benefit of researchers.6 LLMs aid in the generation of a knowledge graph and lower the barrier to entry when it comes to the interrogation of graphs, enabling users of all experience levels to benefit.

Read more about knowledge graphs and their role in R&D. opens in new tab/window

The role of data science and technology experts

Structuring data using ontologies and taxonomies is highly specialized work. Few R&D organizations have employees with the right mix of skills needed for these tasks, and many organizations lack the technological maturity at the required scale.

For example, organizations may have the right dataset and knowledgeable chemists or biologists who can understand the inputs. These people are experts in their field but have little experience in data-specific tasks. External data scientists play a crucial role in aiding companies to structure data for successful AI, particularly for niche and specific use cases, such as drug repurposing or new materials development. Collaborating with technology experts also ensures that proprietary data and IP are well protected and remain within the organization’s “firewall,” preventing the inadvertent sharing of data in insecure public platforms.

Systems of description range from the weak semantics of a list to the strong semantics of an ontology, which is a formal description of a domain with classes, relationships and logical axioms.

Implementing taxonomies and ontologies improves data quality and usability. (Source: Copyright Clearance Center and SciBite opens in new tab/window)

Checklist: Robust data structuring and rich ontologies

☑ You have implemented a framework for data integration and normalization. ☑ You used domain-specific ontologies to structure data. ☑ You constructed and applied domain-specific taxonomies.

Data management and enrichment

Continuous data management and enrichment ensures long-term AI success and reduces the time and resources needed to clean and prepare data for model building.

Mitigating challenges around data integration

Datasets come at different levels of AI readiness and in multiple formats and structures. For example, formats could include experimental data in an electronic lab notebook, real-world data from a clinical study, textual references from scientific literature, instrument readings from a machine sensor, or patent data. R&D teams must embed data management practices that normalize and integrate both internal and external data. Creating such a data lifecycle means investing in frameworks for data management, including ontologies and taxonomies.

Data and semantic enrichment to enhance results

Semantic enrichment empowers R&D organizations to release the full potential of data in structured and unstructured public and proprietary datasets. The process transforms text into clean, contextualized data, free from ambiguities, through annotation, tagging concepts and metadata. For example, semantic enrichment software can recognize and extract relevant terms or patterns in text and harmonize synonyms, such as “heart attack” and “myocardial infarction.” This approach eliminates “noise” and reduces AI hallucinations.

The role of custom services in data management

In the same way that building ontologies and taxonomies typically requires the help of outside experts, so does the process of managing and semantically enriching datasets. Domain experts can create training datasets and build context-aware custom vocabularies that implement a shared language across all research functions. Vocabularies can include an organization’s bespoke terms, such as product names, as well as recognized concepts and terms used in its scientific discipline and industry, including by regulatory bodies. This approach ensures that R&D organizations use new data in their AI applications and unlock the value from legacy data that may go back many years.7

Checklist: Data management and enrichment

☑ There is a clear strategy for continuous data life cycle management. ☑ Datasets are semantically enriched and contextualized. ☑ You are able to effectively use legacy and existing data in new applications.

  1. What Generative AI Means for Business, Gartner. https://www.gartner.com/en/insights/generative-ai-for-business

  2. Vladimir A. Makarov, Terry Stouch, Brandon Allgood, Chris D. Willis, Nick Lynch, Best Practices for artificial intelligence in life sciences research, Drug Discovery Today, Vol 25, Issue 5, 2021. https://www.sciencedirect.com/science/article/abs/pii/S1359644621000477

  3. Andreas Holzinger, Katharina Keiblinger, Petr Holub, Kurt Zatloukal, Heimo Müller, AI for life: Trends in artificial intelligence for biotechnology, New Biotechnology, Volume 74, 2023. https://www.sciencedirect.com/science/article/pii/S1871678423000031

  4. Full-text scientific literature data from Elsevier, Elsevier.com. https://www.elsevier.com/en-gb/solutions/datasets/full-text-journals-data

  5. Ann-Marie Roche, Harnessing ontologies for pharma: Dr Jane Lomax on the synergy of AI and scientific expertise, Elsevier Connect, February 2024. https://www.elsevier.com/en-gb/connect/harnessing-ontologies-for-pharma-dr-jane-lomax-on-the-synergy-of-ai-and-scientific-expertise

  6. Joe Mullen, How knowledge graphs can supercharge drug repurposing, Elsevier Connect, February 2024. https://www.elsevier.com/en-gb/connect/how-knowledge-graphs-can-supercharge-drug-repurposing

  7. Ann-Marie Roche, Worth its weight in gold: getting your legacy data in order, Elsevier Connect, March 2024. https://www.elsevier.com/connect/worth-its-weight-in-gold-getting-your-legacy-data-in-order