Shape Matters: Improving Docking Results by Prior Analysis of Geometric Attributes of Binding Sites
- 1. Department of Biotechnology, Roxbury Community College, USA
- 2. Department of Chemistry, University of Massachusetts, USA
- 3. Department of Chemistry, Burnham Institute for Medical Research, USA
Abstract
An important improvement for selection of docking programs has been found. Correlating the attributes of ligand binding pocket shape with an appropriate program in the early stages of automated docking has been proven to increase the success of the procedure. This potentially constitutes an important improvement in structure-based drug design process. A two-stage approach: (1) computing attributes of the binding site and (2) running the appropriate docking algorithm, has been used to screen ~one hundred structures from the Protein Data Bank (PDB). The attributes of the binding pockets used in this study were: the ratio of the volume of the solvent accessible surface to the volume of the molecular surface (Vsa/ms) and the area of the solvent accessible surface to the area of the molecular surface (Asa/ms). This study doesn’t look at charges and H-bonding or hydrophobic interactions. However it is still very useful to aid in choosing the best docking program possible. The results of initial screening within the bounds of optimally selected parameters indicated that, it is possible to use an algorithm that performs better than others. The study shows that for high numerical values of both ratios all the docking computer programs produced poor results, for medium and medium high values of those ratios, Auto dock and DOCK were the best choice. However, with small values of the ratios all four programs GOLD, Surflex, DOCK and Auto dock produced agreement within 10% difference comparing RMSD of docked versus crystallographic ligand.
INTRODUCTION
Structure-based virtual screening has been thoroughly tested, with substantial success. The automated docking procedure is a vital part of virtual screening and is now a widely used technique in the early stages of drug discovery in most academic and commercial (Big Pharmacy) environments. Improvements in computer processing speeds and multiprocessing methods as well as distributed computational methods using multiple workstations permit en masse screening with storage of large quantities of data. This enables investigators to screen large numbers of compounds available in online libraries [1-3] by docking them into the binding pocket of the target enzyme. Concurrently, the experimental emergence of the ‘high-throughput’ automated X-ray crystallographic screening techniques has dramatically increased the rate at which researchers can progress from the over-expression of a target protein to an inhibitor protein complex. The combination of high-throughput X-ray crystallography and molecular docking techniques allows investigators to work efficiently to design potent inhibitors to medically relevant enzyme targets. In automated docking procedures the success critically depends on the accuracy and precision of the process of in silico molecular docking. This, in turn, depends on the choice of software, the protein, how the binding site and search space is defined, and the ligands. The software choice may become the weak link in the process. Many other research groups [4-21] worked on comparisons, and improvement to correlate predicted docking results of ligands with experimental results (x-ray data). The problem of disagreement between virtual versus actual screening of small molecules for tight binding to proteins, remains an active area of investigation. Reliance on software that is less than optimal in predictions continues to present an obstacle during the research process. This study has investigated whether geometric attributes of binding site pockets that can be defined quickly with on line tools; can be used as one of the several indicators to determine which program is optimal for screening small molecules for large scale docking. Here, the shape of binding pockets is evaluated as an independent indicator of docking performances of several molecular docking programs.
Citation
Keyes RM, Pejo E, Katagiri K, Huynh K, Rudnitskaya A, et al. (2016) Shape Matters: Improving Docking Results by Prior Analysis of Geometric Attributes of Binding Sites. JSM Chem 4(1): 1020.
METHODS
One hundred different, unrelated structures were used, to achieve an accurate representation of a variety of proteins from the protein data bank (PDB) as previously described in the Astex study [22]. The three-dimensional protein structures were selected according by the following criteria: (1) The structure had a resolution higher than 2.5 Å; (2) The R factor was less than 0.20; (3) The difference between the R factor and the R free was less than 3% (low model bias); (4) more than 95% of the residues were in allowed regions of the Ramachandran plot, indicating good stereochemistry of the model [23] and the ligand had full occupancy and reasonable average B factor.
Defining Protein Binding Pockets
In order to identify pockets and calculate pocket volume and mouth area from the protein structure, the computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues (CASTp) [24] was used. The CASTp server uses the weighted Delaunay triangulation and the alpha complex method for quantifying the shape measurements of surface accessible pockets and interior cavities. The pocket can be defined by Voroni diagrams and characterized by Delaunay triangulation and alpha shape descriptors using Liang’s method [24]. The Delaunay triangulation method sums together adjacent triangles and it is similar to cubic or rectangular definition of the binding pocket cavity. In contrast, alpha shape measurement utilizes concentric circles. Alpha shape approximates a spherical description of the binding pocket in three dimensions. Voroni and Delaunay shape descriptors utilize a method of calculation known as discrete flow, to identify and measure the pockets. In CASTp the molecular surface (ms) is defined as the total surface area and volume of the molecule which is void (cavity) and the pocket surface area and volume of the molecule which is considered solvent accessible (sa) [25]. Thus the ratio of the area or volume of the pocket that is solvent accessible to the total molecular surface area or volume (Asa/ms or Vsa/ms) yields the fraction or percent solvent accessible area or volume. These two attributes Asa/Ams and Vsa/Vms accurately predict the shape of the protein binding pocket. When the solvent accessible area or volume is low relative to total molecular surface, the protein binding site shape is highly encapsulated (Figure 1A). The converse is also true (Figure 1B) when the calculated surface accessible area or volume is high relative to total molecular surface, the protein binding pocket is more open with a wide mouth to the binding pocket.
The list of interacting pocket residues was matched against pocket data generated from the CASTp server to assign proteinligand interactions to a pocket ID. The calculated characteristics including partitioning of volume and area of the pockets are shown in Supplemental Materials. Pockets that were identified to contain greater than 70% of the total interactions of the protein or two or more pockets in tandem with greater than 35% of the total interactions of the protein and a crystallographic ligand were used for this study. PDBs that had fewer than 70% of the proteinligand interactions within the pocket described by CASTp were reevaluated with a similar volume calculation method using the CCP4 (Collaborative Computational Project, 1994) [25] program before they were removed from the study.
Preparation of Ligands
Manual evaluations of crystallographic ligands using the Ligand Explorer software were also performed to create text files of ligand-protein interactions to be compared to pockets created in CASTp. Upon further analysis, the failure in CASTp is due to three factors: (1) The binding pocket for the ligand is composed of two or more subunits of the protein and an incomplete or fragmented pocket is defined; (2) The binding pocket for the ligand is enclosed only on three sides and has a large solvent-accessible entry to the binding site (no mouth); and (3) The critical residues that are in the active site, are on mobile elements such as loops, have large conformational changes upon substrate/natural ligand binding, and are not properly positioned with substrate analogues to be included in the binding pocket. In summary, 4% of the proteins evaluated did not have assigned pockets for ligand binding sites and another 11% of the proteins evaluated had pockets but no detectable mouths to the pockets. Therefore 85 out of 100 PDB’s originally selected are included in this study.
Once protein and ligand structures were prepared for computational docking, they were processed in four readily available docking programs: AutoDock4.2 [26], DOCK6.1 [27] and GOLD [28], and Surflex [29]. These four different programs were chosen because they use different approximations for treating the ligands. The Auto Dock 4 performs the docking of the ligand to a set of grids describing the target protein. The Dock algorithm (utilized by Dock6.1) uses a geometric matching system to place a ligand in the “negative image” of the binding pocket. GOLD calculates docking modes of small molecules in protein binding sites and offers full ligand flexibility. In addition, GOLD has improved handling and control of bound water molecules and metal coordination geometries. The Surflex program works by fragmenting the ligand into stable functional groups to be fit into the binding pocket of the protein (protomol).
Ligand Evaluation
The results and docked ligands were compared with the actual crystallographic ligand. Root mean square deviation (RMSD) of the two molecules was the main criterion for comparison between the docked and actual ligand conformations. The RMSD was calculated in the Quanta software and confirmed in Qmol [30]. Since the binding strength between the protein and the ligand is determined by electrostatic, hydrophilic and hydrophobic interactions, X-Score [31,32], a primary function was used as a common “yardstick” to compare binding energies of top scoring ligands. X-Score was utilized as a common unit of measurement for all four docking programs and is reported in the Supplemental Materials.
RESULTS AND DISCUSSION
Analysis shows that the binding sites are usually located at large indentations (pockets) on the surface of the proteins. Once the binding pocket is identified through the method described in previous section, a detailed characterization of the pocket shape such as volumes and areas of molecular surface and solvent accessible areas were obtained. The CASTp parameters associated with such an arrangement of atoms on the surface of the protein, and found Vsa/ms, a ratio of the volume of solvent accessible to volume of total molecular surface, a useful descriptor for binding sites associated with the effectiveness of docking. We also identified the similar parameter Asa/ms, a ratio of area solvent accessible surface to area of total molecular surface. The Vsa/ms and Asa/ms ratios of the pocket and mouth, respectively, were analyzed in clusters of 0.5 Å as ligand docking results within this distance are considered equivalent. A. In the supporting material, the Vsa/ms and Asa/ms values are shown with RMSDs for each of the four programs. A frequency count for RMSD in 0.5 A2 increments is shown in Table 1. This organization was chosen because Auto dock output files group ligand conformations with 0.5 A2 as equivalent27. Both AutoDock4 and DOCK6.1 showed the highest frequency of success in the RMSD range of 0.5-1.5 A2 . More specifically, out of 85 PDB’s tested; Auto dock produced 33 structures which had a RMSD 1.5 A2 or less from the crystallographic position of the ligand. Dock6 followed closely with 31 structures, and then Surflex with 18 and GOLD with only 13 structures with an RMSD of 1.5 A2 or less from the crystallographic position of the ligand (Table 1).
The data presented in Supplemental Materials shows that many ligand-bound protein pockets tend to have a relatively large molecular surface volume relative to the total solvent-accessible volume. This is consistent with Liang’s computational method of differentiating binding pockets from indentations on the protein surface. Neither GOLD nor Surflex showed a clear correlation of docking performance with shape descriptor ratios, and overall both programs had lower frequency of success than either Autodock4 or DOCK6.1. Pockets that have a high Vsa/ms ratio are more similar to invaginations or indentations and thus perform poorly for predictive automated docking for all four programs. This may be due to the fact that unless specific water molecules are modeled, the low total number of interactions results in a lower predicted binding in the absence of the mathematical consideration of induced fit. In contrast, when the Vsa/ms ratio is low, the pocket is more like a channel for small ions than a pocket for ligand binding. As shown in (Figure 1A) the performance of DOCK 6.1 was acceptable with highly encapsulating docking site (low Vsa/ms), whereas in Figure 1B Auto dock 4.2 performed well in a more open docking site (high Vsa/ms). (Figure 2A and 2B) show the negative and positive correlation Auto dock and Dock6 docking performance has with percent solvent accessible volume. Neither Surflex nor GOLD showed positive or negative correlation (See Supplemental Material). Overall, Dock6 showed the highest correlation of docking with shape descriptors as the R2 of a linear regression of (Figure 2B) was ~ 0.72.
CONCLUSION
Drug development studies focus on inhibitor binding to protein targets. Many Drug companies use molecular docking to find lead compounds to initiate drug discovery process. As much as possible, an accurate starting point for either synthesized or purchased compounds is necessary. Existing databases of small molecule compounds have thousands/millions of commercially available compound to test their potency against every disease imaginable. These databases are mirrored by the in silico databases (Zinc database is one which has Sigma Aldrich catalogue ~ 3 million compounds) that have 3D coordinates of ligands/inhibitors. The docking programs have limited capacity for careful testing by docking of available ligands/inhibitors. We presented here a procedure to limit number of docking experiments and increase their success by investigating the dependence of this success on the shape of the protein target. When large numbers of small organic molecules are going to be screened to make sure the best docking program is selected, protocols such as the ones described here, need to be developed to choose the best strategy to perform optimally on a particular binding site. This study provides a recipe, a protocol to choose the best performing docking program based on the shape of the binding site.
In summary of the results, the detection of the shape of the docking site is a complex problem for which there is no comprehensive solution providing a simple geometric shape descriptor. Both descriptors (area and volume) proved to be useful in discriminating different programs in their success rate in the entire descriptor space. This result holds even when descriptors are sometimes significantly correlated. The use of a single descriptor offers significant advantages in proper selection of the program for maximum effectiveness especially when using a two-staged approach; first by computing the geometric descriptor of the docking site and then by selecting the most efficient docking algorithm for this particular range of the descriptors.
Table 1: Frequency of Docking Hits by RMSD Cluster.
RMSD Cluster (Å2) |
Auto dock | Dock6 | GOLD | Surflex |
0-0.5 | 1 | 3 | 0 | 1 |
0.5-1.0 | 12 | 11 | 4 | 6 |
1.0-1.5 | 20 | 17 | 9 | 11 |
1.5-2.0 | 11 | 11 | 10 | 9 |
2.0-2.5 | 7 | 10 | 8 | 5 |
2.5-3.0 | 9 | 8 | 11 | 8 |
3.0-3.5 | 14 | 13 | 12 | 18 |
3.5-4.0 | 10 | 8 | 15 | 10 |
4.0-4.5 | 1 | 3 | 5 | 6 |
4.5-5.0 | 0 | 1 | 4 | 2 |
RMSD > 5.0 | 0 | 0 | 79 |
ACKNOWLEDGMENT
The authors thank the developers of all the programs used during the preparation of this paper, especially for the trial version of Surflex-DockTM 2.0.