Background

The crucial steps of pepMMsMIMIC architecture are detailed as follows:

1. multi-conf MMsINC: conformer generation.

The best five lowest-energy conformers of each MMsINC^® entry (including all tautomers and ionic states) are generated by using Rotate ver. 1.0 software obtaining an ensemble of 18.461.878 conformers.
MMsINC^® (http://mms.dsfarm.unipd.it/MMsINC/search) is a database of 4.5 million of non-redundant, richly-annotated, and biomedically-relevant chemical structures. A primary goal of MMsINC ^® is to guarantee the highest quality and the uniqueness of each entry. MMsINC^® then adds value to these entries by including in the biochemical profile of each structure, the analysis of crucial chemical properties such as ionization and tautomerization processes, and the in silico prediction of 24 important molecular properties. MMsINC^® is consequently a natural input for different chemoinformatics and virtual screening applications.

2. Pharmacophore fingerprint generation.

Both L- and D-amino acids, as well as phospho-aminoacids, are labelled according the following pharmacophoric features: Tryptophan side chain, Tyrosine side chain, Histidine side chain; H-bond acceptor, H-bond donor, positively ionisable, negatively ionisable, aromatic, and hydrophobic. The definition of pharmacophoric features assigned to the selected amino acids is implemented on the basis of the FuzCav method described by Rognan and collaborators, as summurized as follows:

Backbone NH	H-bond acceptor
Backbone CO	H-bond donor
ALA	Hydrophobic Centroid
ARG	Hydrophobic Centroid, Positively charged, H-bond donor
ASN	H-bond donor, H-bond acceptor
ASP	H-bond acceptor, Negatively charged
CYS	Hydrophobic Centroid
GLN	H-bond donor, H-bond acceptor
GLU	H-bond acceptor, Negatively charged
HYS	H-bond acceptor, H-bond donor, Aromatic Centroid, Histidine side chain
HID	H-bond donor, H-bond acceptor, Aromatic Centroid
HIE	H-bond donor, H-bond acceptor, Aromatic Centroid
ILE	Hydrophobic Centroid
LEU	Hydrophobic Centroid
LYS	H-bond donor, Positively charged, H-bond donor
MET	Hydrophobic Centroid
PHE	Aromatic Centroid, Hydrophobic Centroid
PRO	Hydrophobic Centroid
SER	H-bond donor, H-bond acceptor
phospho-SER	Negatively charged
THR	Hydrophobic Centroid, H-bond acceptor, H-bond donor
phospho-THR	Hydrophobic Centroid, Negatively charged
TRP	Tryptophan side chain, H-bond donor, Aromatic Centroid
TYR	Tyrosine side chain, H-bond donor, H-bond acceptor, Aromatic Centroid
phospho-TYR	Aromatic Centroid, Negatively charged
VAL	Hydrophobic Centroid
D-aminoacids	same as the L-form

Both peptide and conformer are described by an an ensemnle of annotation points in terms of 3-point pharmacophores, in which every possible pair of centroids is binned according to the feature's distances. In the present work, we apply this method for the similarity between peptide and ligand pharmacophore fingerprints.

The first criterion used to encode triplets of pharmacophoric points into the pepMMsMIMIC bitstring is based on atom type recognition. All the possible 3-point combinations, using the above mentioned 9 different centroid types (ARO Aromatic , HYD Hydrophobic , HAC H-bond acceptor, HDO H-bond donor, HIS Histidine side chain, NCH Negatively ionizable , PCH Positively ionizable, TRP Tryptophan side chain, TYR Tyrosine side chain) are encoded in the pepMMsMIMIC bitstring.
Centroids are defined by contiguous atoms with atoms labelled with the same pharmacophoric type (i.e., the six carbon atoms of a benzene ring define an aromatic centroid localized at the center of the ring). More than one label can be assigned to each atom. Thus, each atom can be part of to more centroids. Labels were assigned with an in-house SMARTS mapping tool based on the Chemistry Development Kit (CDK) Java libraries.
The second criterion used to encode triplets is based on centroid distances. According to the FuzCav method10, we used a maximum distance cutoff of 14.3 Å and a distance binning defined as follows: [0, 4.8], [4.8, 7.2], [7.2, 9.5], [9.5, 11.9], [11.9, 14.3]. For each class of interaction (i.e. ARO-ARO-PCH, ..., HDO-HYD-ARO), distances range according to the scheme are reported below:

		ARO-ARO-PCH [0, 4.8][0, 4.8][0, 4.8], ..., ARO-ARO-PCH [0, 4.8][0, 4.8][11.9, 14.3]
		ARO-ARO-PCH [0, 4.8][4.8, 7.2][0, 4.8], ..., ARO-ARO-PCH [0, 4.8][4.8, 7.2][11.9, 14.3]
		ARO-ARO-PCH [0, 4.8][7.2, 9.5][0, 4.8], ..., ARO-ARO-PCH [0, 4.8][7.2, 9.5][11.9, 14.3]
		ARO-ARO-PCH [0, 4.8][9.5, 11.9][0, 4.8], ..., ARO-ARO-PCH [0, 4.8][9.5, 11.9][11.9, 14.3]
		ARO-ARO-PCH [0, 4.8][11.9, 14.3][0, 4.8], ..., ARO-ARO-PCH [0, 4.8][11.9, 14.3][11.9, 14.3]
		ARO-ARO-PCH [4.8, 7.2][0, 4.8][0, 4.8], ..., ARO-ARO-PCH [4.8, 7.2][0, 4.8][11.9, 14.3]
		ARO-ARO-PCH [4.8, 7.2][4.8, 7.2][0, 4.8], ..., ARO-ARO-PCH [4.8, 7.2][4.8, 7.2][11.9, 14.3]

Using this scheme, each bin is associated with a specific triplet of interaction defined by the type of the vertices comprising the triplet and the relative distances between each centroid pairs. Due to the class definition, ARO-ARO-PCH, ARO-PCH-ARO, or PCH-ARO-ARO (triplet) belongs to the same triplets of interaction ARO-ARO-PCH. Every time a triplet is made of 2 aromatic centroids and 1 positively ionizable centroid, the first criterion will be used to associate it with the class ARO-ARO-PCH, and the second criterion (based on atom pair distances) will be used to correctly locate the triplet inside the ARO-ARO-PCH class within the pepMMsMIMIC bitstring. The same rule is applied to all 3-point pharmacophore (ARO-ARO-ARO, ARO-NCH-PCH, ..., HYD-HYD-HYD) combinations.
The same classification scheme is adopted for both ligands and peptide residues, with the difference that one or more pharmacophoric points are assigned to the geometric center of the side chains according.
The fingerprint is a vector with possible 19.815 bits; among them, only 12.448 bits were populated in our virtual library.
Finally, only those conformers for which at least 3 spatially distinct features were retained. We stored both USR descriptors and fingerprints for a total of 17.713.005 conformers.

3. Ultrafast Shape Recognition (USR) methodology.

The Ultrafast Shape Recognition (USR), introduced by Ballester and Richard, is a fast 3D similarity search method based on the assumption that the shape of a molecule is uniquely determined by the relative position of its atoms. The approach is based on moments of distance distributions, and it has been successfully applied to the fast identification of similarly shaped compounds within large molecular databases.
In the USR encoding, the shape of the atomic ensemble (a molecule or any generic set of atoms) is characterized by the distributions of atomic distances to four reference locations: the molecular centroid (ctd), the closest atom to ctd (cst), the farthest atom to ctd (fct) and the farthest atom to fct (ftf)). In turn, each of these distributions is described through its first three moments. In this way, each molecule (or atomic ensemble) has associated a vector of 12 shape descriptors.

Ballester and Richard proposed in their original paper the use of this method for fast similarity comparisons between macromolecules; here we extend the use of USR method to the comparison between a collection of ligands and a protein atomic ensemble. Our in-house Python implementation of the USR algorithm (available on request to the Authors) has been used to calculate the USR descriptors for the whole dataset and it is used for the on-the-fly calculation of protein atoms ensemble defined by the user's selection.

4. Scoring metrics.

Fingerprints focusing methods commonly use similarity coefficients, such as Tanimoto, to retrieve or classify compounds out of a typically large chemical library. However, the results of similarity searching are often systematically affected by bit density differences between reference and database molecules. For example, in a previous study Tversky similarity calculations showed apparent asymmetric behaviour because optimal search performance was achieved by assigning higher weights on bit settings of reference than database compounds. For Tversky similarity calculations, such biasing effects could be corrected by introducing the weighted Tversky coefficient (wTv), which made possible to set relative weights on "1" and "0" bits and thereby balance complexity differences between reference and database molecules. In the current version of pepMMsMMIC, fingerprints similarity measure has been implemented, based on the weighted Tversky similarity index calculated as below:

SwTv(q, m) = c / (0.9*a + 0.1*b + c)

where c is the number of common bits between the peptide query fingerprint "q" and conformer's fingerprint "m", a is the count of bits on in "q" but not in "m", and b is the count of bits on in "m" but not in "q".

Four different scoring methods are actually implemented in the current version of pepMMsMIMIC:
a. shape score (ShS) based on the Ultrafast Shape Recognition methods.
b. pharmacophoric fingerprint similarity based on weighted Tversky coefficient (wTv).
c. in tandem shape score (ShS) and pharmacophoric fingerprint similarity filtering: conformers are primarily filtered by using a ShS score threshold (which must be >= of 0.5) and subsequently sorted by fingerprints similarity.
d. shape score (ShS) and pharmacophoric fingerprint similarity weighted approach: In order to reduce the number of false positives, we introduced an hybrid scoring function where a weighted combination of the shape score (ShS) and pharmacophoric fingerprint similarity based on weighted Tversky coefficient (wTv) is combined as reported below:

ShS&wTv = (0.4 * ShS) + (0.6 * wTv).

Weight terms were chosen based on a preliminary in-house validation.

References

Masciocchi, J., Frau, G., Fanton, M., Sturlese, M., Floris, M., Pireddu, L., Palla, P., Cedrati, F., Rodriguez-Tomé, P. & Moro, S. Nucleic Acids Res. 37, D284-290 (2009).
CORINA is available from Molecular Networks GmbH, Erlangen, Germany (http://www.mol-net.com).
Weill, N. & Rognan, D. J. Chem. Inf. Model. 50, 123-135 (2010)
Steinbeck, C., Hoppe, C., Kuhn, S., Floris, M., Guha, R. & Willighagen, E.L. Curr. Pharm. Des., 12, 2111-2120 (2006).

For more informations, for a copy of the executables or for any technical issue, please contact Stefano Moro (stefano.moro@unipd.it) and Matteo Floris (floris@crs4.it).