QSAR modeling of ErbB1 inhibitors using genetic algorithm-based regression

Gaining efficiency in quantitative structure–activity relationships

ErbB1 kinase is the cell-surface receptor for epidermal growth factor. Its dysfunction has been implicated in diseases such as cancer.

QSAR relates molecular descriptors to biological activity for known compounds

Reaxys Medicinal Chemistry can provide computational chemists with large and diverse datasets for building predictive models. These models can be used for virtual screening or during lead optimization. Concisely, QSAR (quantitative structure–activity relationships) modeling is based on a mathematical equation that relates molecular descriptors to biological activity for known series of compounds to create a model for evaluating the activity of new chemical entities (1, 2) 1. Engel, T. (2006) Basic Overview of Cheminformatics. J. Chem. Inf. Model. 46: 2267–2277.

2. Polanski, J. Bak, A., Gielciak, R. and Magdiarz, T. (2006) Modeling robust QSAR. J. Chem. Inf. Model. 46: 2310–2318.
. This paper illustrates the use of the Reaxys Medicinal Chemistry knowledgebase to build efficient QSAR models. ErbB1 kinase will be used as an example.

Dataset selection

The first step in building a suitable dataset for QSAR modeling is to collect all pertinent chemical and biological information relating to the ErbB1 target. Searching by target name followed by selecting the specific isoform enables easy retrieval of all the relevant chemical and biological data for ErbB1 (Figure 1). Similar searches were conducted for the other ErbB kinases to illustrate the extent of chemical and biological data available in Reaxys Medicinal Chemistry (Table 1).

To facilitate comparisons of bioactivity data from different publications and assay types, all in vitro data points in Reaxys Medicinal Chemistry have pX values. pX values are calculated by transforming parameters such as EC50, IC50, Ki into the –log equivalent (pEC50, pIC50, pKi). These are normalized values assigned to the data that enable easy quantification of compound–target affinity and comparison of information from all around the world.

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression | Elsevier R&D Solutions
Figure 1. EGFR (ErbB) kinase query (Note: VEGFR not included in this search)
Table 1. Statistics on the EGFR (ErbB) kinase data available in Reaxys Medicinal Chemistry. *VEGFR not included, but ErbB variants, isoforms and mutants are included
Target Number of bioactivities Number of substances Number of citations Number of bioactivities with pX > 7 Number of substances with pX > 7
EGFR all* 201,803 80,905 3,147 23,049 10,099
ErbB1 4,255 3,776 271 749 659
ErbB2 57,687 34,251 1,190 4,494 3,154
ErbB3 2,295 2,201 103 42 20
ErbB4 15,295 11,252 485 344 272

An activity profile for the most potent inhibitors of all EGFR (ErbB) receptors with a pX value of greater than 7.0 (affinity <100 nM) can be generated and viewed as a Heatmap (Figure 2). The Heatmap visualizes the relationships between compounds and their targets in terms of key parameters, allowing rapid identification of relevant compound–target interactions. The highest pX values were selected for display in the Heatmap.

Inhibitors of all EGFR kinases - Industry Insights | Elsevier R&D Solutions
Figure 2. Heatmap for inhibitors of all EGFR kinases with pX activity above 7.0 (affinity <100 nM)

In the Heatmap, biological affinities or activities are quantified as a pX value and displayed from 1 (low activity) in blue to 15 (high activity) in red. The color of the Heatmap cells represents the maximal pX retrieved for a given compound (line) against a given target (column). The thumbnail provides an overview of the entire Heatmap with a panel highlighting the section of the map currently displayed on screen. The dataset can be analyzed using the data density display, which shows the number of compounds retrieved per target.

At the time of the query, a total of 3,776 compounds were recorded in Reaxys Medicinal Chemistry as being inhibitors of the ErbB1 (EGFR-1) kinase. 4,255 biological activities coming from 271 citations were registered. When the query was limited to humans, Reaxys Medicinal Chemistry retrieved 880 compounds with 1,227 associated bioactivities (Figure 3).

AHeatmap for inhibitors of ErbB1 kinase - QSAR Model | R&D Solutions

BMost potent ErbB1 inhibitors with pX activity above 7.0 - Industry Insights | R&D Solutions

Figure 3. A. Heatmap for inhibitors of ErbB1 kinase. The dataset can be filtered by target species as highlighted. B. The most potent ErbB1 inhibitors with pX activity above 7.0 (affinity < 100 nM) were selected and the Heatmap was sorted by activity against ErbB1 with the compound highlighted having a pX = 9.5 (IC50 = 0.33 nM).

Filtering the dataset further by parameter and selecting only those with IC50values, 746 compounds with 982 associated bioactivities were retrieved. Finally, the most potent ErbB1 inhibitors were selected (those with a pX >7; i.e., IC50< 100 nM). Using the Substances tab, detailed information can be obtained for each compound (Figure 4).

ErbB1 kinase inhibitors - Reaxys Medicinal Chemistry | Elsevier R&D Solutions
Figure 4. Selection of ErbB1 kinase inhibitors as shown in the Substances tab in Reaxys Medicinal Chemistry


Prior to building models, 56 2D molecular descriptors were computed for a set of 66 ErbB1 compounds. They included 32 P_VSA based descriptors, 12 BCUT descriptors and 12 GCUT descriptors (3) 3. Szántai-Kis, C., Kövesdi, I., Eros, D., Bánhegyi, P., Ullrich, A. (2006) Prediction oriented QSAR modelling of EGFR inhibition. Current Medicinal Chemistry 13: 277–287.. These descriptors encode the three main physicochemical properties: hydrophobicity (SlogP_VSA, BCUT_SlogP, GCUT_SlogP), polarizability (SMR_VSA, BCUT_SMR, GCUT_SMR) and electrostatic interactions (PEOE_VSA, BCUT_PEOE, GCUT_PEOE). To build present regression models, QuaSAR-Evolution was used, which is a genetic-based algorithm implemented in the MOE cheminformatic suite. QuaSAR-Evolution applies the genetic algorithm to the problem of descriptor selection in QSAR. Descriptors selected at random are combined and a population of regression models is generated. The default setting was used for descriptor selection and the initial length (the number of descriptors) was set to 4.

Results and discussion

An initial model was built considering the set of 66 ErbB1 compounds as a training set. The data training set was selected by removing ErbB1 variants and mutants and a subset of bioassays was used. Among the resulting QSAR equations, the best-performing one was:

pIC50 = 11.195 + 13.613GCUT_SlogP_1 – 0.015PEOE_VSA+0 – 0.177PEOE_VSA-3 + 0.037SMR_VSA4

N = 66 ; n = 4 ; R2 = 0.84 ; RMSE = 0.47.

N is the number of data points; n is the number of molecular descriptor (initial length). Despite the chemical diversity of the training set, only 4 descriptors were sufficient to explain the biological activity leading to a satisfactory goodness of fit R2 of 0.84 (Figure 5).

Results and Discussion Figure 5A | Elsevier

Results and Discussion Figure 5B | Elsevier

Figure 5. Predicted pIC50 vs observed pIC50 in model 1 (left) and model 2 (right)

The selected descriptors exhibit a very small inter-correlation, as can be shown on their correlation matrix (Figure 6).

Results and Discussion Figure 6 | Elsevier
Figure 6. Matrix correlation of four molecular descriptors selected by the GA

The leave-one-out (LOO) cross-validation led to a cross-validated Q2 of 0.81, revealing the model validity. Examination of selected descriptors indicates a strong contribution of hydrophobic interactions expressed by GCUT_SlogP_1 descriptor. As mentioned previously, the full set of 66 molecules were used in this initial model to include as much structural information as possible. However, it is worthwhile to validate the approach using an external dataset. The set of 66 ErbB1 inhibitors were therefore divided into two sets: a training set of 50 molecules for modeling and a test set of 16 molecules for prediction. The best-performing equation was:

pIC50 = 11.1061 + 13.4414GCUT_SLOGP_1 – 0.0124712PEOE_VSA+0 – 0.222155PEOE_VSA-3 + 0.0352586SMR_VSA4

N = 66 ; n = 4 ; R2 = 0.86 ; RMSE= 0.46 ; RMSE_Test = 0.49.

The equation was found to be formulated with the same set of molecular descriptors and associated coefficients close to those obtained in model 1. This may be because the training set comprised the top 50 dissimilar molecules, covering all the chemical diversity of the entire dataset.

Szántai-Kis et al. reported on a 2D QSAR model using automatic Variable Subset Selection by Genetic Algorithm (3) 3. Szántai-Kis, C., Kövesdi, I., Eros, D., Bánhegyi, P., Ullrich, A. (2006) Prediction oriented QSAR modelling of EGFR inhibition. Current Medicinal Chemistry 13: 277–287.. However, these authors considered all the ligands tested on EGFR kinase without any further specification regarding the target (ErbB1, ErbB2, ErbB3 or ErbB4), target type or species. These additional filters typically help to extract homogeneous datasets and minimize errors that may arise from biological data. Many other EGFR QSAR studies have been reported but all have used 3D approaches with structurally close chemical cores. Regarding the chemical diversity covered by this analysis, the local/global character of the present model cannot be determined due to the dataset being limited to 66 ErbB1 compounds. Yet, the selected ErbB1 dataset shows reasonable chemical diversity (Figure 7).

Results and Discussion Figure 7 | Elsevier
Figure 7. ErbB1 dataset projected on the first three principal components computed from used 2D molecular descriptors

Why use Reaxys Medicinal Chemistry for QSAR modeling?

The database of this research solution is organized around compounds, targets and biological activities. Each element is described and organized into logical hierarchies according to experimental protocols appearing in the literature. Taken as a whole, the information compiled by Reaxys Medicinal Chemistry creates a global pharmacology space encompassing more than 447,000 literature sources, 12,700 targets and over 29 million biological activities.

The user interface enables scientists to navigate this pharmacology space and conduct a variety of searches, including substructure, chemical similarity and target-specific searches to explore the bioactivity profile for targets, cell lines or drugs/compounds. Retrieved data is displayed in the interactive results screen, where the user can filter data to exclude or limit certain data sets, focus on compounds or targets or view a complete list of citations.

Reaxys Medicinal Chemistry displays compound–target data as an interactive Heatmap so that researchers can rapidly visualize, navigate and filter results based on various parameters such as activity, species, bioassay protocols, publication type and standard target classification hierarchies.

Importantly for QSAR and other chemoinformatics modeling, Reaxys Medicinal Chemistry provides multiple options to use and integrate the content into existing tools and in-house systems. Data can easily be exported for use in a number of popular modeling packages such as MOE™ (Molecular Operating Environment, Chemical Computing Group), Schrödinger® Small-Molecule Drug Discovery Suite, Biovia® (Dassault Systèmes) and ChemAxon products or integrated into in-house software. The extensive database can be used to conduct pharmacophoric similarity searches, chemical space analyses, structural analog searching, virtual screening and quantitative structure–activity or structure–property relationship (QSAR/QSPR) modeling.

Essential drug discovery solution

Reaxys Medicinal Chemistry is an extensive database containing chemical information linked to in vitro and in vivo biological activities extracted from over 300,000 articles, 90,000 patents and 5,000 journals. More than 6 million chemical compounds are associated with their biological data (> 29 million bioactivity data points) and linked to information on 12,700 pharmacological targets, allowing the scientists to reveal connections between compounds, effects and targets. The data is indexed and normalized for maximum searchability and consistency.

Note: This application note is for illustrative purposes only: the information in this report should not be referenced or relied upon as a basis for further research and development.


  1. Engel, T. (2006) Basic Overview of Cheminformatics. J. Chem. Inf. Model. 46: 2267–2277.
  2. Polanski, J. Bak, A., Gielciak, R. and Magdiarz, T. (2006) Modeling robust QSAR. J. Chem. Inf. Model. 46: 2310–2318.
  3. Szántai-Kis, C., Kövesdi, I., Eros, D., Bánhegyi, P., Ullrich, A. (2006) Prediction oriented QSAR modelling of EGFR inhibition. Current Medicinal Chemistry 13: 277–287.