Determination of ligand position in aspartic proteases by correlating tanimoto coefficient and binding affinity with root mean square deviation

Article history: Received on: 19/09/2015 Revised on: 02/11/2015 Accepted on: 04/12/2015 Available online: 26/01/2016 The objective of this study was to develop and validate of Structure-Based Virtual Screening (SBVS) protocol which was used to select the best pose of inhibitor-aspartic protease complex interaction in the active sites of HIV-1 protease, plasmepsin I, II, and IV. Retrospective validation was performed on enhanced dataset of ligands and decoys (DUD-E) for HIV-1 protease. The crystal structures 1XL2, 3QS1, 1SME, and 1LS5 were obtained from Protein Data Bank. The protocol was then challenged to re-dock the ligands to its origin places in the active sites by correlating Tanimoto coefficient (Tc) and binding affinity (Ei) with Root Mean Square Deviation (RMSD). Enrichment factor at 1% false positives (EF1%) values for Tc and Ei were 18.26 and 9.03, respectively, while the Area Under Curve (AUC) values for Tc and Ei were 76.84 and 60.95. The SBVS protocol was valid and showed better virtual screening qualities in ligand identification for HIV-1 protease compared to the original protocol accompanying the release of DUD-E and showed its ability to reproduce the co-crystal pose in the HIV1 protease, plasmepsin I, II, and IV to its origin places in the active sites.


INTRODUCTION
Aspartic proteases are a family member of protease enzymes that use two highly conserved aspartic acid residues in the active site for catalytic cleavage of their peptide substrates.The generally accepted mechanism of action is acid-base mechanism involving coordination of a water molecule between the two highly conserved aspartate residues.Both of these aspartic acid residues respectively act as proton donors and acceptors, as well as the catalytic hydrolysis of peptide bonds in proteins.The first aspartic acid residue responsible for the initial activation of a water molecule, producing carbon nucleophile then attacks the amide substrate.Tetrahedral intermediate generated would then accept a proton from the second aspartic acid residues and forming products (Davies, 1990).Perhaps the most extensively studies as drug discovery targets are HIV-1 protease for anti-HIV, and plasmepsins for the treatment of malaria.A bioinformatic analysis has demonstrated that P. falciparum plasmepsin II, which is similar to the secretory aspartic protease 2 of Candida albicans (the first nonretroviral microorganism proven to be susceptible to plasmepsins), is one of the eukaryotic proteases that most resemble the HIV-1 protease (Cassone et al., 2002).Critical information on this similarity comes from a search that was conducted in the National Center for Biotechnology Information (NCBI) database with the Vector Alignment Search Tool (VAST), of structural neighbors of the HIV-1 protease.This search revealed a highly significant (P = .00003,by VAST) structural similarity between the HIV-1 protease and plasmepsin II, as well as between the HIV-1 protease and plasmepsin IV, another member of the aspartic protease family of P. falciparum (Tacconelli et al., 2004).Computational studies in drug discovery for anti-HIV and antimalarial have been carried out using aspartic proteases as targets.The one popular method is by Structure-Based Virtual Screening (SBVS) or molecular docking.AutoDock Vina is one of the molecular docking programs.This program is a popular freeware that has been proven could increase the speed and accuracy of docking process (Trott and Olson, 2010), which results are ligand poses and binding affinity (Ei) of each pose as the docking score.SBVS originally only calculates docking score, a simple form of actual binding between ligand and its target.Recently, a novel method defined as interaction fingerprint (IFP), is used as alternative method to visualize the actual binding between ligand and its target.While docking score indicates the affinity of ligand-protein interaction, IFP shows the .specificity of the interaction (Radifar et al., 2013).IFP converts 3D interaction of ligand and protein into 1D bit strings that could be analyzed using PyPLIF, a Python-based open source program.In PyPLIF, IFP bit strings are used to compare the similarity of ligand-protein interaction with those predicted by docking program, calculated as Tanimoto coefficient (Tc) (Radifar et al., 2013).This PyPLIF supports Sybil mol2 format only (e.g. the output of PLANTS program), therefore it could not be directly applied on AutoDock Vina, which format is pdbqt.
In this work, SBVS protocol was developed to achieve simpler and valid automatic procedure.The availability of HIV-1 protease structure and its ligands and decoys which has been published in a Database of Useful Decoys: Enhanced (DUD-E), has led some attempts to construct a valid SBVS protocol to identify novel inhibitors for aspartic proteases.The article presenting DUD-E shows that employing HIV-1 protease as the molecular target in a SBVS campaign gave enrichment factor at 1% false positives (EF 1% ) value and the Area Under Curve (AUC) value of the Receiver Operator Characteristic (ROC) of 4.7 and 59.58%, respectively (Mysinger et al., 2012).The validated protocol was subsequently examined to see its ability to reproduce the pose of the co-crystal ligands in the aspartic proteases active site of HIV-1 protease and plasmepsin I, II, and IV which are responsible in the degradation of hemoglobin in the food vacuole of P. falciparum (Coombs et al., 2001).

Material
A dataset of inhibitors for HIV-1 protease (536 compounds) and their decoys (35,750 compounds) in file type of .mol2obtained from DUD-E (Mysinger et al., 2012).The aspartic proteases crystal structure and its co-crystal ligands downloaded from Protein Data Bank (PDB) with pdb id 1XL2 for HIV-1 protease and 3QS1, 1SME, 1LS5 for plasmepsin I, II and IV, respectively.

Hardware and Programs
Personal computer equipped with Linux Ubuntu 14.04 LTS, Intel Core i5 2.30 GHz processor DRAM 4 GB was used in this work.Programs were SPORES, Open Babel, PLANTS1.2, and MGLTools1.5.6 shell script for intial preparation of the ligands and receptors.AutoDock Vina 1.1.2was used to redock ligands to its origin places in the active sites of the macromolecules, continued with PLANTS1.2 and PyPLIF 0.1.1 for the re-scoring based on IFP.PyMOL and PoseView were used for RMSD calculation and docking visualization.R software version 3.2.2 was used for statistical analysis.

Protein target preparation
The x-ray crystallographic 3D structures of HIV-1 protease (PDB code: 1XL2), plasmepsin I (PDB code: 3QS1), plasmepsin II (PDB code: 1SME), and plasmepsin IV (PDB code: 1LS5) were downloaded from protein data bank (www.pdb.org) (Fig. 1) using wget linux command and gunzip to extract the .pdbfiles.The co-crystallized ligands in all enzymes were separated using Structure Protonation and Recognition System (SPORES) software.SPORES was employed to split the pdb file to protein and ligands using splitpdb module.The protein was protonated and stored as protein.mol2,while the reference ligand and water molecule were treated similarly as ligand_ref.mol2and water.mol2,respectively.

Ligand preparation for retrospective validation
Inhibitors for HIV-1 protease active ligands and their decoys were downloaded in their SMILES format from DUD-E.There were 536 ligands and 35,750 decoys downloaded and stored locally as actives_final.ism and decoys_final.ism.The files were subsequently concatenated into a file named all.smi.Each compound in the file was then subjected to Open Babel 2.2.3 conversion software to be converted in its three dimensional (3D) format at pH 7.4 as a .mol2file.The settypes module in SPORES was subsequently employed to properly check and assign the .mol2file.

Automated molecular docking, IFP re-scoring, and RMSD calculation
The binding site was automatically calculated using bind module of PLANTS 1.2 which was set at 5 Å distance from the coordinates of the native ligand (PLANTS1.2--mode bind ligand .mol2 5 protein.mol2).The coordinates of binding site and the amino acid residues information, which were obtained from the previous step, were then converted into configuration file to be used in AutoDock Vina and PyPLIF re-scoring with the assistance of PLANTS.Python scripts of AutoDock Tools and Open Babel 2.3.2 were also used to prepare both receptor and ligand before the docking process using AutoDock Vina by converting .mol2into .pdbqtfile extension.Docking was performed using AutoDock Vina.Each resulted pose was redocked using rigid docking method of PLANTS (PLANTS1.2--mode rescore plantsconfig_rescore.txt) to maintain the best pose previously obtained from AutoDock Vina and continued by re-scoring using PyPLIF and RMSD calculating using rms_cur in PyMOL.Iteration was automatically performed 100 times, while re-scoring value, calculated as Tanimoto coefficient (Tc) and scoring function obtained from the program, calculated as binding affinity (Ei), were each correlated with RMSD.

SBVS quality assessment
The docking pose with the best binding affinity (Ei) from Autodock Vina and the best Tanimoto coefficient (Tc) value from PyPLIF was selected for each virtually screened compound.Virtual screening accuracies were determined in terms of Area Under the Curve (AUC) of the Receiver-Operator Characteristic (ROC) plots computed with pROC package in R statistical computing software version 3.2.2 and the enrichment in True Positives rate (TP) reported at a False Positive rate (FP) of 1% (EF1%) value (Robin et al., 2011).The EF1% values were calculated as follows: EF1% = TP/FP1%.

RESULTS AND DISCUSSION
Internal validation is used to confirm whether SBVS protocol could be used to reproduce pose of co-crystallized ligand.The protocol is categorized as valid if RMSD is less than 2.0 (Marcou and Rognan, 2007).Of 100 docking iterations, 900 ligand poses were resulted for each protein.The best RMSD for HIV-1 protease, plasmepsin I, II, and IV were 1.143 Å, 1.238 Å, 1.684 Å and 0.801 Å, respectively.RMSD value of < 2 for HIV-1 protease, plasmepsin I, II, and IV were 91%, 90%, 78%, and 93%, respectively (Fig. 2).Correlation diagram between RMSD, Tc and Ei for each protein could be seen in Fig. 3.That shows there was a good correlation between RMSD with Tc (r approaches -1, which means that higher Tc indicates smaller RMSD); and waek correlation between RMSD and Ei.Furthermore, it could be concluded that correlation between RMSD with Tc is stronger than RMSD with Ei.Basically, PyPLIF calculates IFP by converting ligand-protein interaction into bit arrays that match amino acid residues and types of interactions.There are seven bit arrays which are: (i) apolar (Van Der Waals), (ii) face to face aromatics, (iii) edge to face aromatics, (iv) hydrogen bonds (protein as HBD), (v) hydrogen bonds (protein as HBA), (vi) electrostatic interaction (positive charge protein), and (vii) electrostatic interaction (negative charge protein) (Radifar et al., 2013).
Furthermore, bit array of the pose was compared to the reference and was determined its similarity using Tanimoto coefficient (Tc); Tanimoto coefficient (Tc) ranges between 0.000 to 1.000, which means that 0.000 indicates no similarity, while 1.000 indicates that IFP resulted from the docking is identical with the reference (Radifar et al., 2013).
The retrospective validations to identify inhibitors for HIV-1 protease have been conducted to 536 ligands and 35,750 decoys obtained from DUD-E.By employing either Tc from PyPLIF and Ei score from AutoDock Vina, the ROC curves of the %true positives (%TP) and %false positives (%FP) were plotted (Fig. 4).This method and strategy have also been conducted to validate SBVS protocols for another protein target (Istyastono and Setyaningsih, 2014;Setiawati et al., 2014;Yuniarti et al., 2011).The results showed that the developed protocols had better qualities compared the original SBVS (EF 1% = 4.7) with EF 1% values of 18.26 by using Tc from PyPLIF and of 9.03 by using Ei from Autodock Vina as the scoring functions.The EF 1% represents the early enrichment results from the protocol.The higher the EF 1% value, the better the protocol in the identification of known inhibitors for HIV-1 protease.It means that in the first 1% of the ranked database, the protocol can identify known ligands and put them in the higher rank compared to their decoys.The AUC values were calculated in 95 % level of confidence.The AUC values resulted in employing Tc and Ei as the scoring functions were 76.84 and 60.95, respectively.This value is also better than the AUC value of the original protocol (59.58%).The ideal value of the AUC is 100% which indicates that all known ligands are ranked higher than their decoys.In random sampling, the AUC value is 50%.The EF 1% value represents the early enrichment of the protocols, while the AUC value represents the global enrichment (Jain and Nicholls, 2008).
The developed protocol was intended to be employed also in the examination of the binding pose of known inhibitors for aspartic proteases.The protocol was then challenged to redock the co-crystal ligands in the HIV-1 protease, plasmepsin I, II, and IV binding pocket.After redocking simulations have been conducted, and the ligandprotein interactions were visualized and analized using PoseView (Fig. 5), the protocol showed its ability to reproduce the co-crystal pose very well.The Tc values for the best ligand pose in the HIV-1 protease, plasmepsin I, II, and IV binding pocket were 0.921, 0.909, 0.962, and 0.954, respectively.In HIV-1 inhibitor, the ligand interacts with Asp30.Hydrophobic interaction is formed with Gly49, Ile47, Ile50, Ile54, Ile84, Pro81, Val32 (Fig. 6).These interactions match with those showed in Protein Data Bank (Specker et al., 2005).

CONCLUSIONS
The developed SBVS protocol employing AutoDock Vina and PyPLIF to identify inhibitors for HIV-1 protease has been retrospectively validated using newly published database DUD-E.The re-scoring of ligand-protein interaction fingerprint (Tc) is more accurate in determining the ligand pose in the protein than the scoring function embedded in the program (Ei).This SBVS protocol has been proven valid to identify inhibitors for aspartic protease of HIV-1 protease, plasmepsin I, II, and IV.

Fig. 4 :
Fig. 4: ROC curves of retrospective validation of SBVS protocol.The results were ranked by Tc scores (black line).The results were ranked by Ei scores (gray line).The random sampling (dashed line).

Fig. 5 :
Fig. 5: Visualization of the best ligand poses in HIV-1 protease (a), plasmepsin I (b), plasmepsin II (c), and plasmepsin IV (d).Hidrogen interaction is showed as black dashed lines, while hydrophobic interaction is showed by green solid lines.