Results and Discussion

Application note

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

Results and Discussion

An initial model was built considering the set of 66 ErbB1 compounds as a training set. The data training set was selected by removing ErbB1 variants and mutants and a subset of bioassays was used. Among the resulting QSAR equations, the best-performing one was:

pIC50 = 11.195 + 13.613GCUT_SlogP_1 – 0.015PEOE_VSA+0 – 0.177PEOE_VSA-3 + 0.037SMR_VSA4

N = 66 ; n = 4 ; R2 = 0.84 ; RMSE = 0.47.


N is the number of data points; n is the number of molecular descriptor (initial length). Despite the chemical diversity of the training set, only 4 descriptors were sufficient to explain the biological activity leading to a satisfactory goodness of fit R2 of 0.84 (Figure 5).

Results and Discussion Figure 5A | ElsevierResults and Discussion Figure 5B | Elsevier

Figure 5. Predicted pIC50 vs observed pIC50 in model 1 (left) and model 2 (right)

The selected descriptors exhibit a very small inter-correlation, as can be shown on their correlation matrix (Figure 6).

Results and Discussion Figure 6 | Elsevier
Figure 6. Matrix correlation of four molecular descriptors selected by the GA

The leave-one-out (LOO) cross-validation led to a cross-validated Q2 of 0.81, revealing the model validity. Examination of selected descriptors indicates a strong contribution of hydrophobic interactions expressed by GCUT_SlogP_1 descriptor. As mentioned previously, the full set of 66 molecules were used in this initial model to include as much structural information as possible. However, it is worthwhile to validate the approach using an external dataset. The set of 66 ErbB1 inhibitors were therefore divided into two sets: a training set of 50 molecules for modeling and a test set of 16 molecules for prediction. The best-performing equation was:


pIC50 = 11.1061 + 13.4414GCUT_SLOGP_1 – 0.0124712PEOE_VSA+0 – 0.222155PEOE_VSA-3 + 0.0352586SMR_VSA4

N = 66 ; n = 4 ; R2 = 0.86 ; RMSE= 0.46 ; RMSE_Test = 0.49.


The equation was found to be formulated with the same set of molecular descriptors and associated coefficients close to those obtained in model 1. This may be because the training set comprised the top 50 dissimilar molecules, covering all the chemical diversity of the entire dataset.

Szántai-Kis et al. reported on a 2D QSAR model using automatic Variable Subset Selection by Genetic Algorithm (3) 3. Szántai-Kis, C., Kövesdi, I., Eros, D., Bánhegyi, P., Ullrich, A. (2006) Prediction oriented QSAR modelling of EGFR inhibition. Current Medicinal Chemistry 13: 277–287.. However, these authors considered all the ligands tested on EGFR kinase without any further specification regarding the target (ErbB1, ErbB2, ErbB3 or ErbB4), target type or species. These additional filters typically help to extract homogeneous datasets and minimize errors that may arise from biological data. Many other EGFR QSAR studies have been reported but all have used 3D approaches with structurally close chemical cores. Regarding the chemical diversity covered by this analysis, the local/global character of the present model cannot be determined due to the dataset being limited to 66 ErbB1 compounds. Yet, the selected ErbB1 dataset shows reasonable chemical diversity (Figure 7).

Results and Discussion Figure 7 | Elsevier
Figure 7. ErbB1 dataset projected on the first three principal components computed from used 2D molecular descriptors