R: Calculate compound dissimilarities

compDis {chemodiv}

R Documentation

Calculate compound dissimilarities

Description

Function to quantify dissimilarities between phytochemical compounds.

Usage

compDis(
  compoundData,
  type = "PubChemFingerprint",
  npcTable = NULL,
  unknownCompoundsMean = FALSE
)

Arguments

`compoundData`	Data frame with the chemical compounds of interest, usually the compounds found in the sample dataset. Should have a column named "compound" with common names of the compounds, a column named "smiles" with SMILES IDs of the compounds, and a column named "inchikey" with the InChIKey IDs for the compounds.
`type`	Type of data compound dissimilarity calculations will be based on: `NPClassifier`, `PubChemFingerprint` or `fMCS`. If more than one is chosen, a matrix with mean values of the other matrices will also calculated.
`npcTable`	A data frame already generated by `NPCTable` can optionally be supplied, if compound dissimilarities are to be calculated using `type = "NPClassifier"`.
`unknownCompoundsMean`	If unknown compounds, i.e. ones without SMILES or InChIKey, should be given mean dissimilarity values. If not, these will have dissimilarity 1 to all other compounds.

Details

This function calculates matrices with pairwise dissimilarities between the chemical compounds in compoundData, to quantify how different the molecules are to each other. It does so in three different ways, based on the biosynthetic classification or molecular structure of the molecules:

Using the classification from the NPClassifier tool, type = "NPClassifier". NPClassifier (Kim et al. 2021) is a deep-learning tool that automatically classifies natural products (i.e. phytochemical compounds) into a hierarchical classification of three levels: pathway, superclass and class. This classification largely corresponds to the biosynthetic groups/pathways the compounds are produced in. Classifications are downloaded from https://npclassifier.ucsd.edu/. NPClassifier does not always manage to classify every compound into all three hierarchical levels. In such cases, it might be beneficial to first run NPCTable, manually edit the resulting data frame with probable classifications if possible (with help from the Supporting Information in Kim et al. 2021), and then supply this classification to the compDis function with the npcTable argument. This will ensure that compound dissimilarities are computed optimally.
Using PubChem Fingerprints, type = "PubChemFingerprint". This is a binary substructure fingerprint with 881 binary variables describing the chemical structure of a compound. With this method, compounds are therefore compared based on how structurally dissimilar the molecules are. See https://pubchem.ncbi.nlm.nih.gov/docs/data-specification for more information. (There are many other types of fingerprints, and ways of calculating compound dissimilarities based on them, see e.g. packages fingerprint and rcdk). Fingerprint data for molecules is downloaded from PubChem. In association with this, there might be a Warning message about closing unused connections, which is not important.
fMCS, flexible Maximum Common Substructure, type = "fMCS". This is a pairwise graph matching concept. The fMCS of two compounds is the largest substructure that occurs in both compounds allowing for atom and/or bond mismatches (Wang et al 2013). As with the fingerprints, compounds are compared based on how structurally dissimilar the molecules are. While potentially a very accurate similarity measure, fMCS is much more computationally demanding than the other methods, and will take a significant amount of time for larger data sets. Data on molecules is downloaded from PubChem. In association with this, there might be a Warning message about closing unused connections, which is not important.

Dissimilarities using NPClassifier and PubChem Fingerprints are generated by calculating Jaccard (Tanimoto) dissimilarities from a 0/1 table with compounds as rows and group (NPClassifier) or binary fingerprint variable (PubChem Fingerprints) as columns. fMCS generates dissimilarity values by calculating Jaccard dissimilarities based on the number of atoms in the maximum common substructure, allowing for one atom and one bond mismatch. Dissimilarities are outputted as dissimilarity matrices.

If dissimilarities are calculated with more than one method, the function will output additional dissimilarity matrices. This always includes a matrix with the mean dissimilarity values of the selected methods. If "NPClassifier" is included in type, a matrix of "mix" values is also calculated. The values in this matrix are the dissimilarities from NPClassifier when these are > 0. For pairs of compounds where dissimilarities from NPClassifier equals 0 (i.e. when the compounds belong to the same pathway, superclass and class), values are equal to half of the (mean) value(s) of the structural dissimilarity/-ies from PubChem Fingerprints and/or fMCS. With this method, compound dissimilarities are primarily based on NPClassifier, but instead of compounds with identical classification having 0 dissimilarity, these have a dissimilarity based on PubChem Fingerprints and/or fMCS, scaled to always be less (< 0.5) than compounds being in the same pathway and superclass, but different class.

If there are unknown compounds, which do not have a corresponding SMILES or InChIKey, this can be handled in three different ways. First, these can be completely removed from the list of compounds and the sample data set, and hence excluded from all analyses. Second, if unknownCompoundsMean = FALSE, unknown compounds will be given a dissimilarity value of 1 to all other compounds. Third, if unknownCompoundsMean = TRUE, unknown compounds will be given a dissimilarity value to all other compounds which equals the mean dissimilarity value between all known compounds. See chemodiv for alternative methods that can be used when most or all compounds are unknown.

Value

List with compound dissimilarity matrices. A list is always outputted, even if only one matrix is calculated. Downstream functions, including calcDiv, calcBetaDiv, calcDivProf, sampDis, molNet and chemoDivPlot require only the matrix as input (e.g. as fullList$specificMatrix) rather than the whole list.

References

Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB, van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. Journal of Natural Products 84: 2795-2807.

Wang Y, Backman TWH, Horan K, Girke T. 2013. fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics 29: 2792-2794.

Examples

data(minimalCompData)
data(minimalNPCTable)
compDis(minimalCompData, type = "NPClassifier",
npcTable = minimalNPCTable) # Dissimilarity based on NPClassifier

## Not run: compDis(minimalCompData) # Dissimilarity based on Fingerprints

data(alpinaCompData)
data(alpinaNPCTable)
compDis(compoundData = alpinaCompData, type = "NPClassifier",
npcTable = alpinaNPCTable) # Dissimilarity based on NPClassifier

[Package chemodiv version 0.3.0 Index]