FAQ about data content and code execution and output
1. What data is provided in the PPI4DOCK_data compressed folder?
There are 5 subdirectories and 1 file.
PPI4DOCK_list_752_at_least_one_FRODOCK_Acc_top10000_10plus_co_seqs_noAntibody.txtprovides a list of the 752 cases taken from the PPI4DOCK dataset and that were used in our benchmark.
PPI4DOCK_pdbs/[pdb]/contains the native bound and the unbound structures of query proteins as well as the threaded homologs for each case.
-
native query structure
[pdb]_st.pdb
-
unbound query structures
[chain_1]_model_st.pdb
[chain_2]_model_st.pdb
-
unbound homolog model structures
[chain_1]_homolog_[sequence_number].pdb
[chain_2]_homolog_[sequence_number].pdb
PPI4DOCK_MSA/[pdb]/contains the complete coMSAs of each query partner for all cases. A subsampling of sequences was performed to get a reasonable but representative amount for threading purposes.
-
full coMSAs for both query partners sorted by decreasing id% with the query (first sequence)
[4_letter_code]_[chain_1]_coMSA.fasta
[4_letter_code]_[chain_2]_coMSA.fasta
PPI4DOCK_docking/[pdb]/contains Frodock's rotation/translation matrix for each case as well as the DockQ scores, CAPRI categories and Ligand-RMSD with the native for all cases.
-
FRODOCK2.1 rotation/translation matrices from which decoys can be physically generated (they are ordered by FRODOCK's score from best to worst)
[pdb]_frodock2_clustered_lrms4.00_output.dat
-
CAPRI categories, DockQ scores and Frodock's ligand RMSD with the native complex (superposition with the native complex was performed using ProFit3.1)
CAPRI_DOCKQ_LRMSD.txt
PPI4DOCK_scores/[pdb]/contains the scores of all query and homolog decoys for InterEvScore, SOAP-PP and Rosetta's Interface Score.
-
InterEvScore in distance mode on full coMSAs, coMSA40 and coMSA10 and explicit-homology variants on coMSA40 and coMSA10 on max 10000 decoys (the higher, the better)
scores_IES.txt
scores_IES40.txt
scores_IES10.txt
scores_IESh40.txt
scores_IESh10.txt
-
SOAP-PP (re-implemented) on query decoys and explicit-homology variants on coMSA40 and coMSA10 on max 10000 decoys (the higher, the better)
scores_SPP.txt
scores_SPPh40.txt
scores_SPPh10.txt
-
Rosetta's Interface Score on query decoys and explicit-homology variant on coMSA10 on max 1000 decoys and 150 decoys pre-selected using FRODOCK, IES-h40/10k and SPP-h40/10k (the lower, the better). Note that some complexes failed with Rosetta's scoring and were attributed the maximum (i.e. worst) score in the batch
scores_ISC_1k.txt
scores_ISCh10_1k.txt
scores_ISC_150h.txt
scores_ISCh10_150h.txt
PPI4DOCK_consensus/[pdb]/contains all the top 10 consensus outputs for 3- to 5-way combinations of the scores above.
SPP: SOAP-PP, IES: InterEvScore, FD: FRODOCK, ISC: Rosetta's interface score
-h
10: average over 10 threaded homologs, -h
40: average over 40 threaded homologs, None: scores over query decoys only
/10k: top 50 taken from the best in 10,000 decoys, /1k: top 50 taken from the best in 1,000 decoys, /150: top 50 taken from the top 50 of SPP-h40/10k, IES-h40/10k and FD
Text files contain the top 10 consensus rank, the decoy name, its origin (what score it came from), its CAPRI category (I:Incorrect, A:Acceptable, M:Medium and H:High) and its decoy number as given by FRODOCK.
-
SPP/10k, IES/10k and FD
Cons3.txt
-
SPP-h40/10k, IES-h40/10k and FD i.e. enriched by threaded homologs
Cons3_h.txt
-
SPP-h40/10k, ISC/150h, IES-h40/10k and FD
Cons4_h_150.txt
-
SPP-h40/10k, IES-h40/10k, FD and ISC/1k
Cons4_h_1k.txt
-
ISC-h10/150h, SPP-h40/10k, ISC/150, IES-h40/10k and FD
Cons5_h_150.txt
-
ISC-h10/150h, SPP-h40/10k, IES-h40/10k, FD and ISC/1k
Cons5_h_1k.txt
2. How to read the PPI4DOCK_list_752_at_least_one_FRODOCK_Acc_top10000_10plus_co_seqs_noAntibody.txt file?
This text file contains the information of all 752 docking targets. The header line indicates the names of the 30 items:
(1) Target (interface) name, of a format of "xxxx_MN", with xxxx being the PDB code and M/N are respectively receptor and ligand chain IDs;
(2) X-ray resolution of the reference complex;
(3) Score "Biological" evaluated by noxClass program for the reference interface;
(4) Score "Obligate" evaluated by noxClass program for the reference interface;
(5) Reference interface area;
(6) Reference interface area after stripping corresponding residues (In each subunit model, ab initio modeled tails and less contacting separate sub-regions; so the corresponding residues in reference structures are also removed so that the sequences are identical between reference structure and subunit models);
(7) Number (rounded up) of residues in contact per chain;
(8) TMalign score (structural similarity) between the 2 chains of reference complex;
(9) Interolog tag: 0 means the target is the only member of its interolog group and interfaces with the same non-zero tag belong to the same interolog group. See also items (28) and (29);
(10) Target 1 (receptor) name: xxxx_M;
(11) Number of residues in the receptor;
(12) Template used for building the homology model for the receptor;
(13) Sequence identity between the receptor (10) and its template (12);
(14) TMscore for the receptor model (taking complexed subunit as reference);
(15) GDT_TS score for the receptor model (taking complexed subunit as reference);
(16) RMSD for the receptor model (taking complexed subunit as reference);
(17-23) the same terms as (10-16) for the ligand instead of the receptor;
(24) Number of clashing residues per chain (rounded up);
(25) I-rms (interface RMSD) of the "superimposed decoy" (two subunit models superimposed onto the reference complex);
(26) CAPRI rank of the "superimposed decoy";
(27) Difficulty category;
(28) Interolog group status (redundancy at the superfamily level for the two chains of a complex compared with homologous chains in an interolog complex, using thresholds HHsearch probability 90% and Matras probability 80%): "Unique", "Repres" (representative) or "Redund" (redundant). "Unique" means a target with no interolog; for an interolog group containing multiple members, the representative is the one with the easiest difficulty level (27) (and the highest resolution (2) given the same difficulty category) and other members are labeled as "Redund";
(29) Interolog group representative: see explanation in (28);
(30) Interface group status (redundancy at the interface level, using full-linkage clustering within each Interolog group with iAlign p-value 10E-3): "Unique", "Repres" (representative) or "Redund" (redundant). "Unique" means a target from an interface group containing only one member; for an interface group containing multiple members, the representative is the one with the easiest difficulty level (27) (and the highest resolution (2) given the same difficulty category) and other members are labeled as "Redund";
(31) Interface group representative: see explanation in (30);
(32) Antibody-antigen information: False means not an AA complex, otherwise 2 letters ('H' for heavy chain, 'L' for light chain, 'A' for antigen chain) label two chains target1 (see 10) and target2 (see 17), respectively;
(33) Number of sequences in coupled MSA (the first sequence in fasta files in PPI4DOCK_MSA/ is the residue sequence of PDB, so not counted);
3. Installation help & troubleshooting
To download and install Rosetta, please visite the
Rosetta website. Rosetta is free for academic users. In order to guarantee the best performances, please install the Rosetta package in your home directory or a sub-folder thereof as we noticed an increase in computation time with singularity when it was not the case.
Singularity should be easily installable with apt-get in Ubuntu 18.04 or older. This Singularity Image should be compatible with Singularity versions 2.6.1 or later. The
Singularity website has a detailed section on download and installation tips. In particular, you can easily install Singularity on Linux from the NeuroDebian platform as indicated
here.
If you have trouble executing the pipeline, here are a few suggestions that might help:
-If the Singularity image cannot find your input files, try placing them in the current working directory or a sub-directory thereof as Singularity only automatically mounts your home directory and the current working directory at execution. If you have version 2.6, you might need to add
-B $PWD
at the execution of the image so that the Singularity image mounts your current working directory at execution.
-Make sure your 2 input protein structures are mono-chain and are relatively clean in format (e.g. inserted residues, alternative positions, UNK residues or non-canonical residues might cause problems in the execution as they are special cases that were not extensively tested). You can refer to Bovin's lab's pdb tools http://www.bonvinlab.org/pdb-tools/ to manipulate your pdb files more easily for example.
-If the docking pipeline blocks at the generation of the homolog models or the Rosetta scoring step, you might consider double-checking the paths to your installed Rosetta folder and make sure you execute the Singularity image with the
env LD_LIBRARY_PATH=/usr/local/rosetta_dir/path/to/rosetta/libraries/
and
-B /your/path/to/rosetta/folder:/usr/local/rosetta_dir
options. For example, if your Rosetta folder is installed in
/home/toto/installed_rosetta/
, make sure that
/home/toto/installed_rosetta/path/to/rosetta/libraries/
really exists.
4. What output files to expect when running the Singularity Image
After running docking_pipeline.py
This script outputs all of its files in the specified working directory (option -d). Important output results are:
-The scoring output files
scores_*.txt
files contain the top 10,000 Frodock2.1 decoy names and corresponding scores (names are quite explicit)
InterEvDock_Top10Consensus.txt
contains the top 10 3-, 4- or 5-way consensus according to your execution options
consensus_top5_residues.txt
contains the top 5 predicted interface residues of each protein
-The top 50 decoys of each individual score
*_models/
directories contain the top 50 decoys of each scoring function that was used in the docking and to calculate the consensus
-Files needed to re-generate the decoy structures
dockclust.dat
is the raw Frodock output containing all decoy rotation/translation coordinates
rosetta_homologs_partner_{a,b}/
contain the homolog models of both unbound query proteins (if docking was performed with option --runExplicit
) and a file listing them (homolog_partner_{a,b}_list.txt
)
After running generate_decoys.py
This script generates all decoys and stores them in the specified output directory (option -o). If lists are given as input in order to generate the homolog decoys, it will create sub-folders in the output directory.
After running score_decoys.py
This script scores your input decoys with Soap-PP and InterEvScore and stores the corresponding scores in the specified output file.
5. What data is provided in the Weng_BM5_data compressed folder?
There are 5 subdirectories and 1 file. The structure is close to that of the PPI4DOCK_data compressed folder, with a few differences:
- among the 230 cases, some are antibody-antigen complexes or do not have enough evolutionary information (i.e. at least 10 sequences in the coMSAs);
for those, we do not derive homology-enriched scores and consensuses;
- Rosetta interface score was run only on a subset of interface models, to limit computational cost;
- since several cases in this benchmark involve docking multimers, inputs were reformatted to involve docking of partner_a and partner_b.
info_BM5.txtprovides a list of the 230 cases form the protein docking benchmark version 5, together with the following details:
- from the docking benchmark: benchmark version (bmv), difficulty (diff: RB = rigid-body, M = medium, D = difficult)
and complex category (cat: A = antibody/antigen, E = enzyme, O = other);
- the number of sequences in the coMSAs (seqs);
- the number of interface models (decoys) of Acceptable or better category according to CAPRI criteria among the top 10,000 FRODOCK2.1 models (amh).
bm5_pdbs/[pdb]/contains the unbound structures of query proteins as well as the threaded homologs for each case.
-
unbound query structures
protein_a_for_dock.pdb
protein_b_for_dock.pdb
-
unbound homolog model structures
(in subdirectories for partner_a and partner_b)
homolog_partner_a_[sequence_number]_SC.pdb
homolog_partner_b_[sequence_number]_SC.pdb
bm5_MSA/[pdb]/contains the coMSAs of each query partner for all cases.
-
full coMSAs for both query partners sorted by decreasing id% with the query (first sequence)
[4_letter_code]_a_coMSA.fasta
[4_letter_code]_b_coMSA.fasta
bm5_docking/[pdb]/contains Frodock's rotation/translation matrix for each case as well as the CAPRI categories and DockQ scores for all cases.
-
FRODOCK2.1 rotation/translation matrices from which decoys can be physically generated (they are ordered by FRODOCK's score from best to worst)
[pdb]_frodock2_clustered_lrms4.00_output.dat
-
CAPRI categories and DockQ scores (superposition with the native complex was performed using ProFit3.1)
CAPRI_DOCKQ.txt
bm5_scores/[pdb]/contains the scores of all query and homolog decoys for InterEvScore, SOAP-PP and Rosetta's Interface Score.
-
InterEvScore in distance mode on full coMSAs and explicit-homology variant on coMSA40 on max 10000 decoys (the higher, the better)
scores_IES.txt
scores_IESh40.txt
-
SOAP-PP (re-implemented) on query decoys and explicit-homology variant on coMSA40 (the higher, the better)
scores_SPP.txt
scores_SPPh40.txt
-
Rosetta's Interface Score on 150 query decoys (pre-selected using FRODOCK, IES and SPP); Rosetta's Interface Score and its explicit-homology variant on coMSA10 on max 150 decoys pre-selected using FRODOCK, IES-h40/10k and SPP-h40/10k (the lower, the better). Note that some complexes failed with Rosetta's scoring and were attributed the maximum (i.e. worst) score in the batch
scores_ISC_150.txt
scores_ISC_150h.txt
scores_ISCh10_150h.txt
bm5_consensus/[pdb]/contains all the top 10 consensus outputs for 3- to 5-way combinations of the scores above.
SPP: SOAP-PP, IES: InterEvScore, FD: FRODOCK, ISC: Rosetta's interface score
-h
10: average over 10 threaded homologs, -h
40: average over 40 threaded homologs, None: scores over query decoys only
/10k: top 50 taken from the best in 10,000 decoys, /150h: top 50 taken from the top 50 of SPP-h40/10k, IES-h40/10k and FD
Text files contain the top 10 consensus rank, the decoy name, its origin (what score it came from), its CAPRI category (I:Incorrect, A:Acceptable, M:Medium and H:High) and its decoy number as given by FRODOCK.
-
SPP/10k, IES/10k and FD
Cons3.txt
-
SPP-h40/10k, IES-h40/10k and FD i.e. enriched by threaded homologs
Cons3_h.txt
-
SPP-h40/10k, ISC/150h, IES-h40/10k and FD
Cons4_h_150.txt
-
ISC-h10/150h, SPP-h40/10k, ISC/150h, IES-h40/10k and FD
Cons5_h_150.txt
How to cite us:
Chloé Quignot, Pierre Granger, Pablo Chacón, Raphaël Guerois and Jessica Andreani:
Atomic-level evolutionary information improves protein-protein interface scoring (BioRxiv doi:10.1101/2020.10.26.355073)