Challenges in Docking: Mini Review
- 1. Department of Pharmaceutical Chemistry, Jamia Hamdard, India
Abstract
Docking is a computational technique that helps in understanding of ligand macromolecule interactions by placing the ligand in the binding site of macromolecular target. Understanding the ruling principles whereby protein receptors recognize, interact, and associate with molecular substrates and inhibitors is of paramount importance in drug discovery efforts. Despite enormous advances in the field of computational science and biology over the last few decades and the widespread application of docking methods, several pitfalls still exist. In this review the key concepts of protein-ligand docking methods are outlined, with emphasis on challenges faced currently are highlighted. In particular, ligand and protein flexibility—a critical aspect for a thorough understanding of the principles that guide ligand binding in proteins has been discussed. Developments related to receptor representation like flexibility, water molecules, ligand representation like protonation, tautomerism and stereoisomerism have been touched. This review is strongly focused on docking challenges in the context of drug design, specifically structure based drug design.
Keywords
• Computational docking
• Receptor flexibility
• Ligand flexibility
• Protein-ligand complexes
Citation
Akhter M (2016) Challenges in Docking: Mini Review. JSM Chem 4(2): 1025.
ABBREVIATIONS
RNA: Ribonucleic Acid; DNA: Deoxyribonucleic Acid; GA: Genetic Algorithms; PSO: Particle Swarm Optimisation Algorithms; ACO: Ant Colony Optimisation Algorithms; TS: Tabu Search Algorithms; MC: Monte Carlo Algorithms; NMR: Nuclear Magnetic Resonance
INTRODUCTION
Molecular docking is a key tool in structural molecular biology and computer-assisted drug design. Post-genomic era, docking has becomes an indispensable part of the drug discovery and genome informatics. Docking is a computational method that attempt to find the “best” matching between two molecules a small molecule (ligand) and a macromolecule (Protein, DNA, RNA) or between two macromolecules of different size (Figure 1). From the very first docking program ‘DOCK’ developed by Kuntz in 1982 [1] to dock rigid molecules one by one to rigid proteins to state-of-the-art programs that can dock complete libraries of highly flexible small molecules to flexible proteins in the presence of water and metals molecules, large amount of work has been reported and the complexity of the programs has gradually increased. Broadly two docking procedures are commonly considered 1. Rigid docking and 2. Flexible docking. In Rigid Docking (or Lock and Key) considers essentially only geometric complementarities between a ligand and receptor, it does not takes into account the flexibility of both the molecules in question. The rigidity may have limited the specificity and accuracy of the results but the technique is capable of identifying ligand binding site for several different proteins precisely. This approach is more commonly used for protein –protein docking now because complexity involved in modeling the flexibility of protein molecules. In flexible Docking (or Induced fit) the ligand or protein or both molecules at the same time are kept flexible and the energy for different conformations of the ligand fitting into the protein is calculated. Compared to rigid docking flexible docking is more specific but at the same time demands for computational power and CPU time. Docking protocols includes combination of a search algorithm and a scoring function. Execution of molecular docking requires.
1. Structural data
2. Protein target of interest
3. Procedure to estimate protein-ligand interaction poses and strengths
Ligand sampling algorithms are essentials for generating acceptable ligand poses. The search algorithm should allow the degrees of freedom of the protein–ligand complex to hold the true binding poses. A good search algorithm must have good speed and must effectively cover relevant conformational space. Figure (2) gives the outlay of flexible docking search algorithm. These algorithms are of following types- Search algorithm (Pose generation): The search algorithm is a process where all possible conformations and orientations of the ligand- receptor complex in a space i.e. binding pocket of receptor is being searched [2].
Matching Algorithms
In matching algorithm the shape of the ligand is matched with that of the binding pocket by placing the ligand in the binding site of the protein1 . But before actually performing the docking procedure different conformations of the ligand are generated using other programs (e.g., OMEGA [3,4 ], Corina[5]).
Incremental Construction Algorithm or Fragmentation Based
In fragment based algorithm reconstruction of the ligand in the active site of receptor by using an anchor fragment forms the basis of Incremental construction algorithm. Rotatable bonds are identified in the ligand and then the ligand is dissected down at rotatable bonds to obtain rigid pieces. Then docking is performed with a terminal fragment and its best scoring poses is kept and used as anchors on which the next fragment are progressively added and minimized until the molecule has been fully reconstructed. Docking programs which use incremental construction algorithms include DOCK, FLEXX, FLOG. Glide also uses the same algorithm and number of articles can be seen in literature using these software’s because of this algorithm [6].
Stochastic Algorithms
Stochastic search methods modify the conformation of the small molecule in the receptor site and assess it on the fly. Stochastic algorithms represent another class of search algorithms like Genetic Algorithms (GA), Particle Swarm Optimisation Algorithms (PSO), Ant Colony Optimisation Algorithms (ACO), Tabu Search Algorithms (TS) and Monte Carlo Algorithms (MC). Several docking programs based on GA or other evolutionary techniques are Auto Dock, FITTED and GOLD (Figure 3).techniques are Auto Dock, FITTED and GOLD (Figure 3).
Docking Optimization (Scoring Function or Pose Selection)
Scoring functions help to estimate the binding affinity of ligand poses. Good scoring function should represent the thermodynamics of interaction of the protein–ligand system, so as to assist in distinguishing the true binding modes from all the others explored poses, and to rank them accordingly. The accuracy of docking depends on the quality of scoring functions which are mathematical approximating methods for estimating binding affinity i.e. finding the highest-affinity ligand against a target. In general the score is represented by following equation:
Scoring functions help to estimate the binding affinity of ligand poses. Good scoring function should represent the thermodynamics of interaction of the protein–ligand system, so as to assist in distinguishing the true binding modes from all the others explored poses, and to rank them accordingly. The accuracy of docking depends on the quality of scoring functions which are mathematical approximating methods for estimating binding affinity i.e. finding the highest-affinity ligand against a target. In general the score is represented by following equation:
Score = S (target-ligand) + S (ligand)
The term S (target-ligand) is a sum over contributions from all heavy atom contacts between the ligand and the receptor involved in binding.
The binding between a ligand and its receptor is controlled by several factors like interaction between the ligand and receptor, the entropic factors that occur upon binding, the desolvation and solvation energies associated with the interacting molecules. The final free energy of binding (G) will depend on the overall balance of these factors. The interaction forces between the ligand and receptor are Electrostatic, Hydrogen bond, Van der Waals interactions and hydrophobic forces
Docking score is generally calculated by formula
ΔGbinding= ΔGvdW + ΔGelec + ΔGhbond + ΔGdesolv + ΔGtors
where ΔGvdW: 12-6 Lennard-Jones potential; ΔGelec: Coulombic with Solmajer-dielectric; ΔGhbond :12-10 Potential with Goodford Directionality; ΔGdesolv: Stouten Pairwise Atomic Solvation Parameters;
ΔGtors: Number of rotatable bonds
Various Scoring functions include
1. Emperical Scoring
2. Shape and chemical complementary scoring
3. Force field Scoring
4. Knowledge based Scoring
5. Clustering and entropy based Scoring
6. Consensus Scoring.
Challenges in Docking
Docking is currently in a mature stage of development, but it is still far from perfect. Most docking programs available now are normally able to predict known protein bound poses with good accuracies of about 1.5-2 Å with reported success rates in the range of 70–80%. Selecting the docking program that will give the best result for any given target is not straight forward. The most stringent test of docking is the accurate prediction of the binding affinities of a series of related compounds.
Challenges in Docking Docking is currently in a mature stage of development, but it is still far from perfect. Most docking programs available now are normally able to predict known protein bound poses with good accuracies of about 1.5-2 Å with reported success rates in the range of 70–80%. Selecting the docking program that will give the best result for any given target is not straight forward. The most stringent test of docking is the accurate prediction of the binding affinities of a series of related compounds
Ligand representations: Ligand representation and preparation has potential effect on the results of docking as the ligand recognition by a protein depends on shape i.e. 3D structure and electrostatic complementarities. Also ligand conformational sampling is as important as correct ligand preparation. The question of appropriate representation of molecules in databases has been addressed recently [8]. The tautomeric and protomeric states of the small molecules to be docked are user-defined in most of the docking programs. Typically, the structure most likely to be dominant at neutral pH is generated. The structures can be further optimized by removing or adding hydrogens provided approximate pKa values are known a priori.
The accuracy of atom typing is extremely important as the wrong definition of donor and acceptor properties of heteroatoms may lead to serious docking errors. In cases unknown stereochemistry of a synthesized compound, it would be better to generate all possible diastereoisomers of the structure and dock them individually to the receptor. Commercial software programs which use enumeration of all possible diastereoisomers of a given compound include: Stereoplex [9], Stergen [10], and Pipeline Pilot [11]. Accounting various tautomeric and protomeric states of the molecules is challenging at times during docking procedures. Many databases stores molecules such as acids or amines in their neutral forms. They are considered ionized under physiological conditions, so it is necessary to ionize them prior docking. However, while standard ionization is easy to achieve, the problem of generation of tautomer is already much more challenging: which tautomer should one use? Or should one use all possible tautomer’s of less for a given molecule? Not only for tautomers, but also for different ionization states balanced equilibria between the various ionization forms provide real challenges in docking
Baker and co-workers [12] have also discussed the issue of problem involved in sampling tautomeric and protonation states, given the possible difference of free and bound ligand states in these respects. Enumeration of tautomeric and protonation states is a possible solution as suggested by them but have warned about the potentially prohibitive computational cost. Another alternative suggested included incremental and segmentation construction of the docked ligand, whereby the protonation and tautomerism ‘‘decisions’’ are independent and hence decrease the problem size.
Some docking programs (like GOLD and Protein-Ligand ANT System (PLANTS)) have been found having problem in identifying the correct stereoisomer therefore affecting the final outcome. Ten Brink and Exner [13] have recommended that, a preselection of plausible protomers / tautomers should be routinely performed. They have developed Structure Protonation and Recognition System (SPORES) – a tool for preprocessing of protein and ligand-protein complexes and for the setup of 3D ligand databases. Spores perform rule-based assignment of atom types and generate tautomer and protonation as well as stereoisomer states, based on these assignments
Receptor representations: The common source of 3D structures of receptor for docking is X-ray crystallography and NMR (for smaller proteins). However, the growing gap between the sequence and structure availability has to some extent filled by homology modeling, threading, and de novo methods. But the quality of such models for the purposes of docking generally and virtual screening specifically has to be evaluated before use.
The quality of the receptor structure employed plays an important role in determining the success of docking protocol [14-17]. In general, the higher the resolution of the 3D crystal structure employed for studies, the better the observed docking results. The reproducibility of the program increases with increase in resolution of the co-crystal structures (less than 2.0 Å) employed for study [16]. A recent review of the accuracy, limitations and challenges of the structure refinement protocols of protein ligand complexes in general provided a critical assessment of the available structures [7]. Regardless of the possible ambiguities, success has been reported for large number of high throughput docking studies using X-ray receptor structures. Recent examples of this type of study include: kinesin [17], HIV protease [18], phosphoribosyl transferase [19], FKBP12 [20], farnesyl transferase [21], beta-lactamase [22], and PTP1B [23].
Ligand binding to protein often results in conformational changes of protein, so ignoring protein flexibility during molecular docking may give results that are incorrect [24].
One of the major challenges faced in the field of docking is handling of flexible protein receptor. A protein can adopts different conformations depending upon the ligand to which it binds. As a result, docking performed using a rigid receptor will correspond to a single receptor conformation. However, certain ligands require different receptor conformation in order to bind, where we need to keep receptor flexible. Proteins exists in constant motion between different conformational states having similar energies, which is usually neglected in docking studies, although it is known that protein flexibility accounts for increased affinity to be achieved between a given drug and its target. The number of degrees of freedom included in the conformational search is an important aspect that determines the searching efficiency.
A biological system usually consists of a ligand, the macromolecular receptor and solvent molecules. Large numbers of degrees of freedom are associated with the solvent molecules, which is normally excluded from the problem and sometimes they are implicitly considered in the scoring functions to understand the solvent effect. Rest of the degrees of freedom involved with ligand and receptor, can be reduced through the use of different approximations, allowing the search space to be more effectively sampled.
Some approaches which have been proposed to deal with flexible receptor include:
• Letting the receptor or parts thereof move during docking
• Docking the compounds into several different conformations of the same receptor and aggregating the results
• Docking into averaged receptor representations.
Receptor flexibility has been extensively treated in some software’s like by Monte Carlo (MC) simulations and rotamer libraries Rosetta Ligand offers one of the most extensive receptor flexibility treatments developed to date [12,25]. The binding site side chain rotamers are optimized using a simulated annealing procedure and the backbone is minimized subject to restraints. Auto Dock 4 also fully models the flexibility of selected portions of the protein[26]. Side chains of interest are separated from the protein and treated precisely during the simulation, allowing rotation around torsional degrees of freedom. Induced Fit Docking (IFD) Workflow [27, 28] in Schrodinger involves rigid receptor docking with Glide [29, 30], combined with minimization of protein-ligand complex with the homology modeling module Prime [28]. In the MADAMM procedure, the protein is made flexibile by the side chain rotamer libraries of the Insight II [31]. IFD has been successfully used for studies of HIV-1 Integrase [32], kinases [33,34], monoacylglycerol lipase [35], heat shock protein 90[36].
Receptor ensembles by molecular dynamics (MD) have also been widely used to handle the problem of receptor flexibility [37,38]. A 4D-docking protocol for Internal Coordinate Mechanics (ICM), has been developed by Abagyan and Totrov where the fourth dimension is receptor conformation[39-41]. In this protocol, multiple receptor conformations are represented by multiple grids and each is represented as a variable in the global optimization. By this approach increased accuracy with no loss in effectiveness compared to single grid methods has been achieved.
Active site water molecules are another important aspect of target flexibility. Water molecules should be checked carefully to avoid using artifact waters (i.e. water molecules not essential to the protein structure) in the docking process. Using artifact active site water molecules can have a deleterious effect by providing false energetic stability to the protein-ligand complex.
Receptors bind to their ligands in solution and the solvation aspects are commonly treated implicitly, that is, by the use of implicit solvents, knowledge-based scoring functions or by modification or calibration of other scoring functions. Cincilla et al. [42], modified the solvation treatment in the scoring function of Auto Dock 3 [43] to enhance the interactions/predictions of weak complexes containing ligands with polar atoms in the binding site. Kuntz and co-workers [44] have used two implicit solvent scoring functions AMBER/GBSA and AMBER/PBSA, implemented in DOCK 6, for docking small molecules to RNA. To neutralize the backbone charge sodium ions have been used and to shield the charges a double shell of explicit water. They have found that the quality of pose prediction increased from 70% to 80% for moderately flexible ligands (less than 7 rotatable bonds) and from 26% to 50–60% for highly flexible ligands (7–13 rotatable bonds). The effect of structural water molecules on docking as described in the literature suggested that explicit water molecules improve docking outcomes, both in hit identification and pose prediction in virtual screening [45]. Englebienne and Moitessier [46] have shown that the consideration of displaceable water molecules, implemented in FITTED, improves pose prediction, but does not significantly affect scoring accuracy and suggested that the latter may be the outcome of most scoring functions that having been developed for ‘‘dry’’ proteins
An imperfection in scoring function is another challenge in docking. Since search algorithm is capable of generating right conformations, scoring function should also be able to distinct the true binding modes from all the other alternative modes. A very rigorous scoring function would be computationally too expensive, unfeasible for analyzing several binding modes. Scoring functions make number of simplifications and assumptions for ligand affinity evaluation, but at the cost of accuracy. Certain physical phenomenon such as entropy and electrostatic interactions are disregarded in contemporary scoring schemes. Lack of a suitable scoring function, both in terms of accuracy and speed, is the major bottleneck in docking algorithm.
In 2009 several scientists have explored the potential of QM/MM scoring [47-53]. Fong et al. [47] have tested three functions (AM1d, HF/6-31G* , and PM3) for ligand treatment in combination with GoldScore, AMBER, and Chem Score for prediction of successful poses of six HIV protease inhibitors. Gleeson and Gleeson [48] have used the combination of 6-31G**/ B3LYP and Universal Force Field (UFF) for successful crossdocking and re-scoring of nine kinase ligands. A combination of QPLD, a QM/MM docking program, with Site Map [54] a binding site classification module is explored by Chung et al. [49]. They have used 455 protein-ligand complexes and demonstrated a scoring improvement, over Glide, for three possible binding site types (hydrophobic, hydrophilic, and metalloproteins). Cho and co-workers have tested QM/MM scoring function for different types of binding sites, namely for those with hydrophobic groups, polar groups, and metalloproteins [50,53]
Many scoring functions perform very well for the purpose of pose prediction, but a goal of predicting binding affinities using scoring functions is still unfulfilled. It is also clear from the literature and advances reported in scoring functions that the currently available functions could be used more efficiently in combination [55]. Commonly, a consensus scoring involves multiple rescoring of a docked pose with different scoring functions or a combination thereof [56]
Table 1: Non-exhaustive list of docking programs available and their basic characteristics.
Docking Software |
Supported Platform | Search Algorithm | Scoring Algorithm | Protein flexibility |
Remark |
Dock1 | Windows, Mac, Sun, SGI, Unix, Linux, IBM AIX, OSX |
Shape fitting | GB/SA Solvation scoring, Chemscore, Bump filters, Contact score, Grid based score, DOCK score |
P | Improved prediction of binding poses by Forcefield scoring |
Auto Dock 57 | Mac, OSX, Unix, Linux, SGI |
Lamarckian genetic, Genetic and Simulated annealing |
Auto Dock Foce-field methods | P | Improved prediction of binding poses by Forcefield scoring |
FlexX58 | Windows, Linux, Unix, SGI, Sun |
Incremental construction |
Screen Score, Drug Score, PLP, FlexX Score |
NP | Scored is on the basis of protein-ligand interaction |
FRED 59 | Windows, Unix, Linux, Mac, SGI, OSX, IBM, AIX |
Shape fitting (Gaussian) |
Gaussian shape score, PL, Screen score, User defined |
Fastest docking tool, suitable for ultrahighthroughput docking |
|
GOLD60 | Sun, SGI, Linux, IBM, Windows |
Genetic algorithm | ChemScore, GoldScore, User defined score |
P | GOLD has very high docking accuracy. |
Glide61 | IBM, SGI, Linux, Unix, AIX |
MC sampling | GlideScore, GlideComp | P | High docking accuracy but very slow, not good for large data sets |
LigandFit62 | SGI, Linux, AIX- IBM | MC sampling | LigScore, PLP, PMF | P | Very fast, suitable for virtual high or ultra HTS. |
ICM63 | Windows, Unix, Mac | Stochastic algorithms (MC techniques adapted in flexible docking algorithms |
ICM scoring function | P | |
ProDock64 | Monte Carlo sampling | AMBER force-field | P | Bezier spline energy grid has been incorporated to speed up optimization procedure |
|
QXP65 | Monte Carlo sampling | QXP force-field | P | Uses consensus scoring and ranking protocol for highly flexible ligands |
|
Surflex66 | Linux, Windows and OSX |
incremental construction search |
Hammerhead’s empirical scoring function |
P | Suitable for virtual HTS |
Abbreviations: ICM: Internal Coordinate Mechanics; MC: Monte Carlo; SGI: Silicon Graphics Inc; AIX: Advanced Interactive executive: P: Provided; NP: Not Provided; HTS: High-Throughput Screening |
CONCLUSION
It is evident from docking literature, that it has attained a good amount of maturity but accounting for flexibility and successful scoring remain significant challenges. Nevertheless important advances are being made in all aspects of docking programs. The selection of best docking technique should be done after thoroughly studying the target, ligands, and docking method performance. The issue of ligand flexibility seems more or less resolved and does not create much problem however protein flexibility needs more attention and improvement. Careful analysis of active site crystal water molecules is required is the key in good docking. Inclusion of water molecules should be considered after studying the hydrogen bonding with nonwater residues and after studying the relative abundance of water molecules by analysis of multiple crystal structures. In general using a single input conformation and performing a single docking run is likely to decrease docking accuracy, having multiple ligand input does not guarantee said accuracy.
ACKNOWLEDGEMENTS
The author acknowledges research associate (Dr. Munnazah Tasleem) and research scholar (Apeksha Shrivastava) for their help in compilation of the material