Results and Discussion
QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression
Results and Discussion
An initial model was built considering the set of 66 ErbB1 compounds as a training set. The data training set was selected by removing ErbB1 variants and mutants and a subset of bioassays was used. Among the resulting QSAR equations, the best-performing one was:
pIC50 = 11.195 + 13.613GCUT_SlogP_1 – 0.015PEOE_VSA+0 – 0.177PEOE_VSA-3 + 0.037SMR_VSA4
N = 66 ; n = 4 ; R2 = 0.84 ; RMSE = 0.47.
N is the number of data points; n is the number of molecular descriptor (initial length). Despite the chemical diversity of the training set, only 4 descriptors were sufficient to explain the biological activity leading to a satisfactory goodness of fit R2 of 0.84 (Figure 5).
The selected descriptors exhibit a very small inter-correlation, as can be shown on their correlation matrix (Figure 6).
The leave-one-out (LOO) cross-validation led to a cross-validated Q2 of 0.81, revealing the model validity. Examination of selected descriptors indicates a strong contribution of hydrophobic interactions expressed by GCUT_SlogP_1 descriptor. As mentioned previously, the full set of 66 molecules were used in this initial model to include as much structural information as possible. However, it is worthwhile to validate the approach using an external dataset. The set of 66 ErbB1 inhibitors were therefore divided into two sets: a training set of 50 molecules for modeling and a test set of 16 molecules for prediction. The best-performing equation was:
pIC50 = 11.1061 + 13.4414GCUT_SLOGP_1 – 0.0124712PEOE_VSA+0 – 0.222155PEOE_VSA-3 + 0.0352586SMR_VSA4
N = 66 ; n = 4 ; R2 = 0.86 ; RMSE= 0.46 ; RMSE_Test = 0.49.
The equation was found to be formulated with the same set of molecular descriptors and associated coefficients close to those obtained in model 1. This may be because the training set comprised the top 50 dissimilar molecules, covering all the chemical diversity of the entire dataset.
Szántai-Kis et al. reported on a 2D QSAR model using automatic Variable Subset Selection by Genetic Algorithm (3) 3. Szántai-Kis, C., Kövesdi, I., Eros, D., Bánhegyi, P., Ullrich, A. (2006) Prediction oriented QSAR modelling of EGFR inhibition. Current Medicinal Chemistry 13: 277–287.. However, these authors considered all the ligands tested on EGFR kinase without any further specification regarding the target (ErbB1, ErbB2, ErbB3 or ErbB4), target type or species. These additional filters typically help to extract homogeneous datasets and minimize errors that may arise from biological data. Many other EGFR QSAR studies have been reported but all have used 3D approaches with structurally close chemical cores. Regarding the chemical diversity covered by this analysis, the local/global character of the present model cannot be determined due to the dataset being limited to 66 ErbB1 compounds. Yet, the selected ErbB1 dataset shows reasonable chemical diversity (Figure 7).