Help contents
1. What is shipped in the InterEvScore package?
The InterEvScore package contains the following:
- InterEvScore.py, the main Python script to run InterEvScore
- config_IvS.ini, the config file for InterEvScore
- 2body.mat and 3body.mat, the tables containing the propensities and non-additivity coefficients used in InterEvScore
- AlphaShape.cpp, the C++ program for the calculation of alphashapes
- VdW.dat, a file containing atomic Van der Waals radii which are used to prepare the alphashape calculation. This file is based on OPLS (Jorgensen and Tirado-Rives, JACS 1988). The parameters are indicative and can be modified to suit your needs.
- cluster_ InterEvScore_results.py, a Python script to cluster the results from InterEvScore
- config_clustering.ini, the config file for the clustering
- license.txt, a copy of the GNU General Public License
- README_example.txt, a README file to explain the example input and output provided in the example/ directory
2. Running InterEvScore
2.1. Running options (command line)
InterEvScore should be run from the command line using:
[Python_executable] {InterEvScore_directory}/InterEvScore.py [running_options]
Example:
python InterEvScore.py -p my_PDB_id
-f my_complex_file.pdb
-a1 alig_for_first_chain.fasta
-a2 alig_for_second_chain.fasta
-o my_output_file.txt
-ch AB
-c config_IvS.ini
If no running options are provided, a help message will be displayed.
The following arguments should be provided:
- -p [PDB identifier]. This identifier will be used in patches mode to generate or locate the appropriate patches file.
- -f [PDB file] or -l [list of PDB files] or -d [directory containing PDB files]. See 2.3.1 for details.
- -a1 alignment file for first chain -a2 alignment file for second chain. See 2.3.2 for details.
- -o location of output file. If the file already exists, the InterEvScore output will be appended to this file. If a list of PDB files is provided as input, the output file will contain a result line for each PDB file. See 2.4 for details.
- -c config_file. The config file should be built based on the config_IvS.ini file provided in the InterEvScore package.
- -ch AB. The two chains to analyze should be provided. Each PDB file provided as argument should contain these two chains. If the two chains are not in contact, all InterEvScore values will be zero.
2.2. Settings (config_IvS file)
2.2.1. [MATRIX] Propensity matrices
The propensity matrices for two- and three-body statistical potentials are provided in the InterEvScore package. Update file locations according to where you unpacked InterEvScore.
2.2.2. [PROGRAMS] Paths to the necessary programs
Update the locations of the necessary programs:
Kalign, Naccess and the
AlphaShape executable.
2.2.3. [MODULES] Paths to the necessary modules
Update the location of the
networkx module. If this line is not provided, InterEvScore will assume that your paths are already set to include the networkx module (otherwise InterEvScore will exit without scoring).
2.2.4. [ALPHASHAPE_PARAMETERS] Parameters for alpha-shape calculations
The parameter file param.dat is provided with InterEvScore; its location should be updated depending on where you unpacked InterEvScore.
The rest of the parameters in this section can be changed but this will affect the calculation of alpha-shape-based contacts and might lead to unexpected results.
If the proteins of interest have special properties regarding the apolarity of the surface, the patch_expansion value may be changed to yield more appropriate apolar patch calculation. If the protein surfaces are very apolar, the value can be increased for a more stringent patch detection (typically to 1.25). If the protein surface is very polar, the value can be decreased to detect more patches (typically to 0.75). It is however recommended to keep the patch_expansion value between 0.5 and 1.5.
2.2.5. [MODES] Modes to run InterEvScore
Two modes are defined in this section.
- The main mode corresponds to the scoring scheme.
- If set to standard, the evolutionary information extracted from the alignments will be used in a straightforward manner to obtain InterEvScore values.
- If set to patches, only residues belonging to apolar patches will be scored using evolutionary information; other residues will be scored regardless of the provided alignments.
Standard mode is recommended for a first evaluation of the InterEvScore values. Patches mode generally improves the quality of the scoring.
- The contact mode corresponds to the mode used to calculate contacts.
- If set to alpha, the contacts are calculated using an alpha-shape representation. This mode is highly recommended as it leads to much better results and it is faster to execute. However, it requires the presence of a compiled AlphaShape executable.
- If set to dist, the contacts are calculated using a distance-based representation. This leads to less accurate results and is slower to execute. Moreover, the patches mode is NOT compatible with distance-based contact calculation. However, if the AlphaShape executable is not found, InterEvScore will be switched automatically to dist mode.
2.2.6. [STORAGE] Directory where to store patch files
A directory should be provided in order to store the patch file which is calculated only once in each run of InterEvScore and only once for a given PDB id (the PDB id is given by the -p argument). If no directory is provided or the provided directory does not exist, the default output directory will be ~/PDBid (where ~ refers to the home directory of the current user and PDBid to the PDB id provided as -p argument).
2.3. Input
2.3.1. PDB file(s)
InterEvScore can be run using either a single PDB file (-f argument) or a list of PDB files (-l argument) or a directory containing PDB files (-d argument). If a list of PDB files is used, the list file should contain the full paths of all PDB files (one file per line). If a directory is provided, then all files with .pdb extension in this directory will be scored.
All PDB files for a given InterEvScore run should be coordinate files containing at least the two chains provided as argument (-ch argument). They should be properly formatted and include chain identifiers. InterEvScore can normally handle files containing non-ATOM lines and positions corresponding to non-amino acids (DNA, RNA) or non-canonical amino acids; however, this information will globally be ignored.
The two chains which will be considered by InterEvScore should have the same composition and internal set of coordinates (i.e. they should be exactly superimposable) in all PDB files. This means that the list provided as -l argument can be a list of decoys (candidate interfaces) generated for the same pair of proteins, but it cannot be a list of PDB files corresponding to interfaces between different proteins. The same applies to the .pdb files in the directory provided as -d argument.
2.3.2. Alignment files
Two alignment files should be provided as input to InterEvScore. The first alignment file (-a1 argument) should correspond to the first considered chain (first character of the -ch argument) and the second alignment file (-a2 argument) to the second considered chain (second character of the -ch argument).
In each alignment file, the first sequence should correspond to the sequence closest to the PDB sequence. A few differences between the first aligned sequence and the PDB sequence can be tolerated but it is best to have as little discrepancy as possible.
The two alignment files should contain the same number of sequences, corresponding to the sequences homologous to the PDB sequence, provided for the same species in the same order. If these conditions are not met, the evolutionary-based scores are likely to make no sense.
The best way to ensure compatibility of the provided alignments with InterEvScore is to generate these alignments using
InterEvolAlign. Two input sequences should be provided in FASTA format (typically, the two sequences extracted from the PDB input file or the PDB sequences of the bound structure if the PDB input file is generated from the unbound conformations).
InterEvolAlign can be run with various options; the recommended settings include restriction to the relevant superkingdom and activation of the reciprocal blast procedure (with medium stringency); other options can be left as default (or modified if you feel other values are more appropriate).
If the result of the InterEvolAlign run gives too many sequences or alignments that are visibly unreliable, you may try to restrict the run to only one iteration on the OMA database and/or to activate the reciprocal blast procedure with high stringency.
If the result of the InterEvolAlign run gives too few sequences (or none at all), you may try to add a third iteration on the REF database.
2.4. Output
InterEvScore values will be written as one result line per input PDB file. Each line contains:
- The path of the input PDB file which was scored.
- The numbers of interface residues, 2-body inter-molecular contacts and 3-body inter-molecular contacts for this PDB file.
- The four 2-body scores: plain score, score for best contacts only (one best contact per residue), score integrating evolutionary information and score considering best contacts only and integrating evolutionary information.
- The four 3-body scores: plain score, score for best contacts only (one best contact per residue), score integrating evolutionary information and score considering best contacts only and integrating evolutionary information.
- The two combined (2- or 3-body) scores: score for best contacts only (one best contact per residue chosen among all 2-body and 3-body contacts) and score considering best contacts only and integrating evolutionary information.
2.5. Execution speed and parallelization
If you need to score a large number of decoys, each can be scored independently with InterEvScore which enables easy parallelization of the scoring. Scoring one decoy for a protein of standard size (200-300 residues per chain) with alphashape-based contact calculation (in standard or patches mode) takes approximately one second (indicative real running time on a standard processor). The distance-based mode takes much longer - around 4 to 5 seconds.
3. Analysis of results
3.1. Important information
There are several important elements that should be kept in mind when analyzing the results.
- First of all, InterEvScore gives a score to each interface and a higher score means that the interface is predicted to be “better” (more native-like); thus the top ranked interface in a set of decoys is the one with the highest score.
- Also, InterEvScore values cannot be compared between interfaces involving different proteins or between cases with different multiple sequence alignments. In particular, the absolute values of InterEvScore which integrate evolutionary information (all _evol scores) depend on the number of sequences in the alignment, as the scores are summed for contacts derived for each couple of sequences from the alignment.
3.2. Clustering analysis
A clustering script is provided which selects the N top ranked decoys (using the 2or3body_best_evol score by default, or any other score value from InterEvScore), calculates the cross lRMSDs for each pair of decoys among these N top ranked decoys, and clusters the N decoys with a default cutoff of 7.5 Å.
The default value for N is 1,000, chosen as a compromise between the inclusion of as many hits as possible and the computing time of cross lRMSDs which increases exponentially with N. The default clustering algorithm is the GROMOS algorithm from Daura et al, Angew Chem Int Ed 1999. Default values are based on the assumption that 54,000 decoys were previously generated with ZDOCK.
If the decoys were generated using a program that leaves one of the two chains fixed (such as ZDOCK), then the alignment step is not necessary (set align to False in the config_clustering.ini file). However, this means that the ligand RMSD will be calculated using the second chain from the command line option -ch as the ligand (this should be the moving chain).
3.2.1. Clustering algorithms
Two clustering algorithms are proposed. The default clustering algorithm is the GROMOS greedy algorithm from Daura et al, Angew Chem Int Ed 1999. It basically works as follows:
- First build the graph of decoys; two decoys are considered neighbors if their mutual lRMSD value is less than the cutoff (default 7.5 Å). Then count the number of neighbors for each decoy.
- Select the decoy with the largest number of neighbors and pull it out from the graph together with its neighbors; this set of decoys will be the largest cluster and the decoy with largest number of neighbors will be the center of this cluster.
- Repeat previous steps until there are no more decoys with mutual lRMSD value less than cutoff.
- If less than N decoys have been clustered, it means the remaining decoys are single elements; for consistency, those are provided at the end of the clustering results.
The alternative clustering algorithm is the full linkage algorithm: any two decoys with mutual lRMSD value less than cutoff will be clustered together. This means that the clusters in this algorithm correspond to the connected components of the graph. This algorithm will lead to much larger clusters (so you might want to decrease the cutoff); moreover two decoys in the same cluster can have large mutual lRMSD value (the fact that they belong to the same cluster simply means that they are connected through a (potentially long) chain of decoys with mutual lRMSD value less than cutoff).
3.2.2. Running options (command line)
The clustering script should be run from the command line using:
[Python_executable] {InterEvScore_directory}/cluster_InterEvScore_results.py
[running_options]
Example:
python cluster_InterEvScore_results.py -s score_file.txt
-c config_clustering.ini
-ch AB
-d PDB_file_directory/
-o clustering_output.txt
If no running options are provided, a help message will be displayed.
The following arguments should be provided:
- -s [scorefile]. The score file should be the output file from InterEvScore.
- -c [clustering_config_file]. The location of the config_clustering file containing the clustering settings.
- -ch AB. The two chains to consider for the clustering analysis. The second chain is considered as the ligand; it should be the moving chain if the decoy generation leaves one of the two chains fixed (especially if the align option is set to False in config_clustering.ini).
- -d [dir_pdb_files]. The directory containing the PDB files. The names of the PDB files in that directory should be consistent with the names used to run InterEvScore (i.e. the first column of the score file).
- -o [output_file]. The desired location for the clustering output file.
3.2.3. Settings (config_clustering file)
- [PROGRAMS] PROFIT: Location of the ProFit program (necessary to run the clustering if align is set to True, see below).
- [CLUSTERING_PARAMETERS] contains the parameters for the clustering:
- nbdecoys: the number of structures to cluster (N, default 1,000)
- cutoff: the ligand RMSD cutoff for clustering (default 7.5 A)
- algorithm: the clustering algorithm to use (default = "Daura", uses the GROMOS algorithm from Daura et al, Angew Chem Int Ed 1999; otherwise "full_linkage", uses a full linkage algorithm)
- score_index: the index of the score to consider for the clustering (1 for 2body, 2 for 2body_evol, 3 for 2body_best, 4 for 2body_best_evol, 5 for 3body, 6 for 3body_evol, 7 for 3body_best, 8 for 3body_best_evol, 9 for 2or3body_best, 10 for 2or3body_best_evol (recommended))
- align: should be set to True only if the pairs of complexes need to be aligned (typically, ZDOCK keeps one of the two proteins fixed and generates rigid-body conformations of the complex by moving the other protein around, so the alignment step is not necessary, which saves some computing time). If align is set to False, then ProFit is not used and the lRMSD is calculated from the PDB coordinates using the second chain from the command line option -ch as the ligand (this should be the "moving" chain).
- store: should be set to True if you want to store the lRMSD files that are calculated during the clustering process. This is useful if for instance you want to test several cutoff values, because the lRMSD calculation step is the most time-consuming step of the clustering process. Simple reclustering if lRMSD files are present takes only a few seconds. The lRMSD files are stored in the same directory as the PDB files (directory provided with the -d command line option) and they have _vsall.lrms extension.
- [OPTIONS_PARALLEL] Options to run the clustering in parallelized mode
- option_parallel: should be set to True if you want to run the clustering in parallelized mode (uses the Message Passing Interface, requires the Python module mpi4py )
- nb_proc: number of available processors on which the clustering will be run
3.2.4. Input
The score file provided as input to the clustering script should be the output file from InterEvScore.
3.2.5. Output
The clustering output comes in the form of a file where each line corresponds to a different cluster.
The clusters are ordered by decreasing size.
On each line, cluster elements are separated by single spaces and ordered by decreasing score value (in the case of the Daura et al algorithm, the first element on each line is the cluster center i.e. the decoy with highest number of neighbors starting from which the cluster is pulled out; the remaining elements are ordered by decreasing score value). Each element is written in the form [structurename]_[scorevalue]. The structure name corresponds to the structure identifier (first column of score file). The score value corresponds to the score chosen in the config_clustering file.
3.2.6. Execution speed and parallelization
An option for MPI parallelization is provided in the config file (see above). Indeed, the cross-calculation of lRMSD values for 1,000 decoys is very time consuming and parallelization of the clustering process (actually of the lRMSD calculation process, which is the most time-consuming part of the process) is highly recommended. The clustering takes approximately 2 ½ hours for 1,000 decoys on a medium-sized complex; this duration is reduced to 20-30 min when parallelized on 8 processors.
If you plan to test the two clustering algorithms and/or test several cutoff values, it is recommended to set "store" to True in the config_clustering.ini file. This stores the lRMSD calculations in files (located in the same directory as the PDB files i.e. the directory provided with the -d command line option). The next time the clustering script is run, the lRMSD values will not need to be re-calculated. Simple reclustering if lRMSD files are present takes only a few seconds.
4. FAQ
This FAQ will be updated as the questions come along!
5. Reference and contact
If you find this program useful, please refer to:
J. Andreani, G. Faure, R. Guerois - InterEvScore: A novel coarse-grained interface scoring function using a multi-body statistical potential coupled to evolution (submitted)
Corresponding author:
guerois@cea.fr
6. Installation: Pre-required programs, modules and libraries
This section of the help file is intended as a guide to make the installation of the various modules and libraries easier. It is provided "as is" with no warranty.
The required modules and libraries should be obtained directly from the distribution sites. This section is intended for Linux users and has been written based on installation on a system using Linux version 2.6.18 with 64-bit RedHat 4.1.2 (CentOS). Installation was also tested for systems using Debian and Ubuntu server 64-bit distributions.
Some modules and programs are required to run the main program InterEvScore.py. Others are required only to run the clustering analysis script cluster_InterEvScore_results.py.
6.1. Python (Open Source, GPL compatible)
InterEvScore is written as a Python script. In order to run InterEvScore, Python must be installed together with some modules and packages.
6.1.1. Python version
InterEvScore can be executed with Python2.4, Python2.6 and Python2.7.
Compatibility with Python3.x is not guaranteed.
6.1.2. Required Python built-in modules
The following (quite standard) Python libraries should be installed and functional in order to run InterEvScore: sys, os, string, math, tempfile, copy, re, ConfigParser, shutil, time, glob.
6.1.3. networkx (Open source BSD license)
InterEvScore has been developed to run with networkx, a Python software package which allows simple manipulation of graphs. InterEvScore should run smoothly with networkx 1.0, 1.2 or any more recent version.
All necessary information about how to download and install networkx is available from the corresponding website:
http://networkx.lanl.gov/
In particular, networkx can be installed using the easy_install Python module.
6.1.4. mpi4py (for the clustering analysis only) (BSD license)
The clustering analysis can be run in a parallelized manner by setting the proper options in the config_clustering file. This requires the mpi4py module in order to run MPI jobs.
For Debian/Ubuntu users, the mpi4py module can be installed using apt-get. mpi4py can be installed using the easy_install Python module.
6.2. AlphaShape calculation (using CGAL)
In InterEvScore, the contacts as well as the apolar patches are calculated using alpha-shapes. In order to calculate alpha-shapes, the script requires the program AlphaShape to be compiled. This C++ program is built using the
CGAL library.
The script was inspired from the examples presented on the CGAL website (
3D Alpha Shapes CGAL Documentation by Tran Kai Frank Da, Sébastien Loriot and Mariette Yvinec).
If the AlphaShape program is not present, the InterEvScore script can run in degraded mode using contact calculations based on distances. However, it is highly recommended to compile the AlphaShape program as the degraded version is slower, less efficient and the "patches" mode is not available in this version.
6.2.1. Pre-required library: Installing CGAL (Open Source, GPL license)
This help section was written for CGAL 3.7 installation in CentOS/RedHat. However, more recent versions of CGAL can be used provided the versions of the pre-required modules and programs are adapted.
Note: for Ubuntu/Debian users, CGAL can be installed (together with the corresponding dependencies) using apt-get (e.g. apt-get install libcgal-dev).
CGAL should be downloaded and unpacked from the
distribution website.
CGAL is available under an Open Source license. Parts of CGAL are under the LGPL license and other parts are under the GPL license. The Triangulation and AlphaShape packages which we use in the AlphaShape.cpp script are under the GPL license.
Installation of CGAL 3.7 and compilation of AlphaShape with CGAL 3.7 require:
- CMake (version >= 2.6.2)
- the Boost library (version >= 1.34)
- Qt4 (>= 4.3)
- GMP (>= 4.1.4)
- MPFR
The following environment variables should be defined/modified:
- The environment variable LD_LIBRARY_PATH should be updated to contain /lib64, /usr/lib64, /lib, /usr/lib or any other library directories containing the Qt4 and GMP libraries.
- For the Boost library, the environment variables BOOST_INCLUDEDIR and BOOST_LIBRARYDIR should be set to (respectively) the include and lib sub-directories in the Boost directory.
- For MPFR, the include and lib directories should be added to the PATH variable.
Then CGAL should be installed in the directory where it has been unpacked {CGAL_install_dir} using:
cmake -DCMAKE_INSTALL_PREFIX={CGAL_install_dir}/build ./
make
make install
6.2.2. Compiling AlphaShape
When CGAL is installed, {CGAL_install_dir}/build/lib should be added to the LD_LIBRARY_PATH environment variable and {CGAL_install_dir} to the PATH variable.
Then AlphaShape should be compiled in a directory containing the provided file AlphaShape.cpp.
The CMakeLists.txt should be generated using the script provided by CGAL:
{CGAL_install_dir}/scripts/cgal_create_cmake_script
Finally AlphaShape should be compiled by executing:
cmake ./
make
The path for the AlphaShape executable obtained should be set in the config file for InterEvScore (see section
2.2).
6.3. Other external programs
6.3.1. Kalign (GPL license)
Kalign is absolutely necessary to run InterEvScore, as it ensures proper use of the input alignments depending on the sequence of the input PDB files.
The program can be downloaded from the corresponding website:
http://msa.sbc.su.se/cgi-bin/msa.cgi
6.3.2. Naccess (free for academic users, confidentiality agreement)
Note: Naccess is necessary only for the patches mode (not for the standard mode).
Naccess is necessary to run InterEvScore in patches mode, i.e. to include evolutionary information in a specific manner for apolar patches (see section
2.1). Naccess is used to identify surface atoms in order to define surface apolar patches.
Downloading information and details about the program are available directly from the corresponding website:
http://www.bioinf.manchester.ac.uk/naccess/
Naccess is available for free for researchers at academic and non-profit-making institutions provided the confidentiality agreement is signed. Naccess requires csh and fort77.
6.3.3. ProFit (for the clustering analysis only) (registration required)
Note: ProFit is not necessary if the decoys were generated using a program that leaves one of the two chains fixed (e.g. ZDOCK).
ProFit is used to run the clustering analysis script cluster_InterEvScore_results.py when the decoys need to be aligned before ligand RMSD calculation. ProFit calculates the ligand RMSD values for each pair of decoys selected to be clustered.
If the decoys were generated using a program that leaves one of the two chains fixed, then the alignment step is not necessary (set align to False in the config_clustering.ini file). However, this means that the ligand RMSD will be calculated using the second chain from the command line option -ch as the ligand (this should be the moving chain).
Downloading information and details about the ProFit program are available directly from the corresponding website:
http://www.bioinf.org.uk/software/profit/index.html
Registration information is required upon download.
Warning: ProFit requires 32-bit libraries to run properly on 64-bit systems.
7. License
The InterEvScore program and the clustering script associated with InterEvScore are distributed under GNU General Public License (GPL), version 3.0.
This means that InterEvScore is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see
http://www.gnu.org/licenses/.
Copyright 2013 CEA iBiTec-S/SB2SM/LBSR Jessica Andreani, Guilhem Faure and Raphael Guerois.
For further information please contact
Raphael Guerois.
In particular sections 15 and 16 of the GNU GPL apply:
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.