Skip to main content

Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox, Microsoft Edge, Google Chrome, or Safari 14 or newer. If you are unable to, and need support, please send us your feedback.

Publish with us

AI in small molecule drug discovery

What medicinal and synthetic chemists need to know.

Blue background of molecules and data

The evolution of AI in small molecule drug discovery

Traditionally, it takes a lot of time and scientific expertise to discover and synthesize a small molecule that becomes a preclinical candidate. But advances in AI are reinventing the processes involved and accelerating drug discovery. Chemists must transform alongside this shift, acquiring the skills and knowledge to apply AI in their work while collaborating more closely than ever with computational chemistry and data science teams.

In a short space of time, artificial intelligence (AI) and associated techniques like machine learning (ML) and generative AI (GenAI) had a considerable impact on chemistry and small molecule discovery. One study1 found that, “…biotech companies using an AI-first approach have more than 150 small molecule drugs in discovery and more than 15 already in clinical trials.”

Estimates2 suggest it takes 10–15 years and typically costs up to US $2.8 billion to develop a new drug, and a huge proportion (80–90%) of candidates fail in the clinic. The early discovery stage alone takes an average3 of 3–6 years and accounts for 42% of total capitalized costs in the development of a new drug. Given the cost and time of creating a drug, it’s not surprising that interest in AI-enabled small molecule drug discovery is growing.

Schematic representation of the preclinical drug discovery process highlighting opportunities for AI to be applied across the drug discovery continuum.

Opportunities for the application of AI techniques in drug discovery. Source: “Enhancing preclinical drug discovery with artificial intelligenceopens in new tab/window,” Drug Discovery Today, Volume 27, 2022.

In early phase medicinal and synthetic chemistry, small molecule discovery has evolved from the use of manual testing and assays to the adoption of high throughput screening. Subsequently, computational methods were introduced alongside virtual screening, followed by the increasingly sophisticated AI and ML techniques of today.

There are multiple opportunities for the application of AI in small molecule drug discovery to increase speed, lower cost, improve success rates and boost innovation. For instance, by applying AI models to databases of known compounds and reactions during lead identification and optimization stages, medicinal and synthetic chemists can quickly carry out the following tasks and notably shorten these phases:

  • Predict/understand structure-activity relationships

  • Make accurate ADME and toxicity predictions

  • Accelerate synthesis planning for novel compounds/de novo design

  • Improve route optimization for known compounds

Timelines and events for a recent AstraZeneca project where the candidate drug (CD) had already been synthesized during the lead generation (LG) phase.

Timelines and events for a recent AstraZeneca project where the candidate drug (CD) had already been synthesized during the lead generation phase. Source: “Accelerated drug discovery by rapid candidate drug identification”,opens in new tab/window Drug Discovery Today, Volume 24, 2019.

The application of AI, combined with a clear understanding of the preclinical candidate profile, could unlock gains of up to 30 months. Recognizing the power and potential benefits of the approach, many pharma companies are already employing AI or are beginning to explore the use of AI in their discovery projects. With the AI-fueled small molecule drug discovery pipeline estimated4 to be expanding by almost 40% yearly, medicinal and synthetic chemists must transform the way they work.

Upskilling medicinal and synthetic chemists in the AI era

AI can’t replace medicinal and synthetic chemists’ intuition, nuance and creativity, but chemists who are AI-enabled will ultimately supersede chemists who are not.

A “human in the loop” (HITL) approach eliminates manual time-consuming tasks in compound screening or synthesis route planning, while maintaining critical human oversight and insight. Importantly, an HITL system does not create a “black box” of unexplainable decisions. It keeps human chemists in place to validate outputs and provide feedback to AI models using their expert knowledge and experience.

Word cloud of skills and areas of exploration for the AI-enabled chemist of the future

Skills and areas of exploration for the AI-enabled chemist of the future

Pharmaceutical companies are at different stages in their AI adoption journey. Larger organizations with more data scientists on staff are typically more mature in applying AI. In smaller organizations with fewer internal experts, adoption can be more challenging. For all types of organizations, upskilling is critical. Both companies and chemists need a clear strategy for AI adoption or risk falling behind.

Steps to successful AI adoption for companies:

  • Understand the investment areas that have the greatest impact

  • Develop multidisciplinary teams for effective use of different areas of expertise

  • Build a strategy for the training of chemists vs. acquisition of AI skills externally

  • Create an adoption plan to manage resistance to change or fear of “new” technology

Steps for succeeding in the AI era for chemists:

  • Be open to collaboration with peers and colleagues outside of your immediate area of expertise

  • Familiarize yourself with the impact that AI will have on medicinal and synthetic chemistry

  • Learn about the key AI methods and models identified in research

  • Embrace change and the “art of the possible” to be open to new technologies

AI in drug discovery webinars

Interested in hearing from chemists and data scientists who are applying AI in small molecule drug discovery? Explore the following webinars:

Drug discovery with AI at AstraZeneca — from generative models to reaction predictionopens in new tab/window

  • Eva Nittinger, associate principal scientist in computational chemistry in Respiratory and Immunology, AstraZeneca R&D

  • Samuel Genheden, leader of the Deep Chemistry team in Discovery Sciences, AstraZeneca R&D

AI advancing drug discovery research in the pharmaceutical industry and academiaopens in new tab/window

  • Raquel Rodríguez-Pérez, principal scientist at Novartis Institutes for Biomedical Research

  • Jessica Lanini, data scientist at Novartis Institutes for Biomedical Research

Connectivity between documents, structures and bioactivity dataopens in new tab/window

  • Christopher Southan, FBPharmS, FRSC, Medicines Discovery Catapult

  • Anindya Ghosh Roy, product manager, Reaxys

  • Aurora Costache, customer engagement manager, Elsevier Life Sciences Professional Services Group

AI in the chemistry DMTA cycle

Designing new molecules likely to interact with a target, synthesizing, then testing those molecules to identify the most promising candidates is time and resource intensive. There is significant potential for the application of AI to accelerate the design-make-test-analyze (DMTA) cycle and reduce the number of iterations. At each stage of DMTA, AI can be used to:

  • Design: Protein structure prediction, de-novo library design, virtual screening, synthetic accessibility and molecular property prediction.

  • Make: Plan synthesis of new molecules and predict their yield and purity, identify problems with the synthesis process.

  • Test: Screen new molecules for ability to interact with target proteins, predict efficacy and toxicity, identify most promising drug candidates.

  • Analyze: Process large volumes of test data to identify correlations and trends, design further experiments to test most promising drug candidates.

Design-Make-Test-Analyze cycle with assigned tasks for the different stages

Applications of AI/ML in the DMTA cycle for medicinal chemistry. Source: “Chapter 4 - Approaches using AI in medicinal chemistry”,opens in new tab/window Computational and Data-Driven Chemistry Using Artificial Intelligence, 2022.

A study5 from McKinsey identified the following gains in time and efficiency in lead identification and optimization, the most expensive and time-consuming phases of preclinical drug discovery:

  • Hit identification: 30 to 50 percent acceleration in small molecule high-throughput screening, using approaches such as molecular property prediction in an iterative screening loop (versus the existing approach of randomized selection of compounds).

  • Lead optimization: more than double improvement over baseline on the key metric of “efficacy observed,” over 100 times the number of in silico experiments possible compared with previous screening, and faster design of compounds for optimization of drug delivery efficacy in lead optimization.

How can we accelerate DMTA in drug discovery

How AstraZeneca applies AI to accelerate the DMTA cycle – including synthesis planning, condition prediction and molecular ideation. View the webinar: Drug discovery with AI at AstraZeneca - from generative models to reaction predictionopens in new tab/window.

AI techniques and models in small molecule discovery

Broadly, two types of AI techniques are applied today in small molecule discovery:

  • Machine and deep learning: ML algorithms are trained on large datasets and allow computers to “learn” and make predictions and decisions without being explicitly programmed. Deep learning (DL) uses artificial neural networks inspired by the human brain to “learn” complex new patterns from data. For example, ML methods are applied for predictive retrosynthesis, accelerating synthesis planning of novel molecular entities.6

  • Generative AI: GenAI generates new knowledge or content that is similar to its training data. Algorithms are trained on large datasets of existing drug molecules that learn to identify patterns and relationships common to these molecules. For example, GenAI models are applied for the generation of libraries of new molecular entities based on drug-like molecules, and to assess their synthetic accessibility scores.

Based on these techniques, several key AI models have emerged.

  • Protein structure prediction: These models predict7 the three-dimensional structure of a protein from its amino acid sequence. Protein structure prediction helps chemists understand the active site and optimize compound design to modulate desired interactions.

  • De novo molecular design for virtual library screening: Generative models are used to create compound libraries, and the chemist prompts the model with expert questions. Compared to the traditional approach of searching a large database for a small number of relevant compounds, virtual library screening proposes novel chemical compounds to create a virtual compound library.8

  • Property prediction: Using DL, these models forecast the properties of a molecule based on its structure. This is an important element of the drug discovery process because a compound’s structure determines how it interacts with other molecular mechanisms that take place within a person.9

  • Quantitative structure-activity relationship (QSAR): This statistical model is based on training data that pairs chemical structures and biological activities. QSAR is used to predict the biological activity of a chemical compound from its structure – including toxicity, drug efficacy and ADME properties (absorption, distribution, metabolism, excretion).10 There are two types of QSAR models:

    • Linear QSAR models use simple mathematical formulae that assume the biological activity of a molecule is always linearly related to its chemical structure.

    • Nonlinear QSAR models use more complex mathematical equations that allow for the relationship between biological activity and chemical structure to be nonlinear.

  • Quantitative structure-property relationship (QSPR): Using machine learning, this model relates molecular structures to compound properties and accelerates the DMTA cycle. ML algorithms are applied to find structural or chemical patterns that correlate with specific compound properties. For example, to predict activity against the target of interest and to predict properties such as reactivity, solubility and adsorption.11

  • Synthetic Accessibility: Synthetic accessibility models based on ML and DL, are key to post-processing filters for de-novo design campaigns. These models score the ease of synthesis of compounds, allowing chemists to narrow down libraries to sets of synthetically accessible compounds.12

  • Computer aided synthesis prediction (CASP): CASP saves medicinal and synthetic chemists time, improves accuracy, and helps control costs by reducing synthesis failures and validating proof of concept at the earliest stage. Read more about CASP in the next section.

Predictive retrosynthesis

Getting from a desired novel molecule to an optimal synthesis route using traditional methods takes considerable effort and expertise. Also known as predictive retrosynthesis, CASP augments the retrosynthetic analysis traditionally conducted by an experienced chemist using their knowledge of chemical reactions.13 CASP combines high-quality reaction data, references and experimental procedures with ML and DL to enable medicinal and synthetic chemists to:

  • Find routes to synthesize novel compounds and optimize routes for existing chemical compounds

  • Predict synthesis routes much faster than traditional approaches; routes can be generated for novel compounds in 10 minutes

  • Access the background literature that informed predicted synthesis routes and experimental procedures to decide what chemical compound to make and how to make it

  • In sophisticated CASP solutions, determine commercial availability and pricing of starting materials as synthesis routes terminate in purchasable starting materials

  • Combine external data with internal and proprietary reaction data for highly relevant results, as well as with starting materials and stockroom compounds for ease of synthesis

A tool for retrosynthetic analysis

Watch a short demo of a tool that offers the ability to edit retrosynthesis parameters and choose libraries of building blocks. See published and predicted routes for a target molecule in one view and easily identify the starting materials needed for synthesis.

Reaxys Predictive Retrosynthesis

Checklist for successful predictive retrosynthesis

The following checklist can help chemists apply predictive retrosynthesis successfully:

☑ Do you understand the types/variety of data you need access to?

☑ Are you able to access all of these data sources both internally and externally?

☑ Are you confident data is trustworthy and high quality?

☑ Can you define the problem and prediction task clearly?

☑ How will you validate and determine the predictive performance of a model?

☑ Do you understand the model’s interface parameters so you can build a search that leads to desired results?

☑ Do you fully understand the structure and functionality of the target molecule?

☑ Have you identified any potential challenges associated with its synthesis?

☑ Have you decided parameters for selecting a synthesis route, including cost, yield, environmental impact?

☑ Have you identified partners and experts to collaborate with?


  1. Madura K.P. Jayatunga, Wen Xie, Ludwig Ruder, Ulrik Schulze, Christoph Meier, AI in small-molecule drug discovery: a coming wave?, Nature Reviews Drug Discovery, 2022.

  2. Olivier J. Wouters, Martin McKee, Jeroen Luyten, Research and Development Costs of New Drugs, JAMA, 2020.

  3. Marcela Vieira, Suerie Moon, Costs of Pharmaceutical R&D, Knowledge Portal on Innovation and Access to Medicines, Graduate Institute of Geneva, 2020.

  4. Madura K.P. Jayatunga, Wen Xie, Ludwig Ruder, Ulrik Schulze, Christoph Meier, AI in small-molecule drug discovery: a coming wave?, Nature Reviews Drug Discovery, 2022.

  5. Alex Devereson, Erwin Idoux, Matej Macak, Navraj Nagra, and Erika Stanzl, AI in biopharma research: A time to focus and scale, McKinsey, 2022.

  6. Andrea Volkamer, Sereina Riniker, Eva Nittinger, Jessica Lanini, Francesca Grisoni, Emma Evertsson, Raquel Rodríguez-Pérez, Nadine Schneider, Machine learning for small molecule drug discovery in academia and industry, Artificial Intelligence in the Life Sciences, Volume 3, 2023.

  7. Bin Huang, Lupeng Kong, Chao Wang, Fusong Ju, Qi Zhang, Jianwei Zhu, Tiansu Gong, Haicang Zhang, Chungong Yu, Wei-Mou Zheng, Dongbo Bu, Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms, Genomics, Proteomics & Bioinformatics, 2023.

  8. Megan Stanley, Marwin Segler, Fake it until you make it? Generative de novo design and virtual screening of synthesizable molecules, Current Opinion in Structural Biology, Volume 82, 2023.

  9. Geethu S., Vimina E.R., Protein Secondary Structure Prediction Using Cascaded Feature Learning Model, Applied Soft Computing, Volume 140, 2023.

  10. Samuel J. Belfield, James W. Firman, Steven J. Enoch, Judith C. Madden, Knut Erik Tollefsen, Mark T.D. Cronin, A review of quantitative structure-activity relationship modelling approaches to predict the toxicity of mixtures, Computational Toxicology, Volume 25, 2023.

  11. Jian-Feng Zhong, Abdul Rauf, Muhammad Naeem, Jafer Rahman, Adnan Aslam, Quantitative structure-property relationships (QSPR) of valency based topological indices with Covid-19 drugs and application, Arabian Journal of Chemistry, Volume 14, Issue 7, 2021.

  12. Maud Parrot, Hamza Tajmouati, Vinicius Barros Ribeiro da Silva, Brian Ross Atwood, Robin Fourcade, Yann Gaston-Mathé, Nicolas Do Huu, Quentin Perron, Integrating synthetic accessibility with AI-based generative drug design, Journal of Cheminformatics, Volume 15, Article 83, 2023.

  13. Simon Johansson, Amol Thakkar, Thierry Kogej, Esben Bjerrum, Samuel Genheden, Tomas Bastys, Christos Kannas, Alexander Schliep, Hongming Chen, Ola Engkvist, AI-assisted synthesis prediction, Drug Discovery Today: Technologies, Volumes 32–33, 2019.