compDis {chemodiv} | R Documentation |
Calculate compound dissimilarities
Description
Function to quantify dissimilarities between phytochemical compounds.
Usage
compDis(
compoundData,
type = "PubChemFingerprint",
npcTable = NULL,
unknownCompoundsMean = FALSE
)
Arguments
compoundData |
Data frame with the chemical compounds of interest, usually the compounds found in the sample dataset. Should have a column named "compound" with common names of the compounds, a column named "smiles" with SMILES IDs of the compounds, and a column named "inchikey" with the InChIKey IDs for the compounds. |
type |
Type of data compound dissimilarity calculations will be
based on: |
npcTable |
A data frame already generated by |
unknownCompoundsMean |
If unknown compounds, i.e. ones without SMILES or InChIKey, should be given mean dissimilarity values. If not, these will have dissimilarity 1 to all other compounds. |
Details
This function calculates matrices with pairwise dissimilarities between
the chemical compounds in compoundData
, to quantify how
different the molecules are to each other. It does so in three
different ways, based on the biosynthetic classification or
molecular structure of the molecules:
Using the classification from the NPClassifier tool,
type = "NPClassifier"
. NPClassifier (Kim et al. 2021) is a deep-learning tool that automatically classifies natural products (i.e. phytochemical compounds) into a hierarchical classification of three levels: pathway, superclass and class. This classification largely corresponds to the biosynthetic groups/pathways the compounds are produced in. Classifications are downloaded from https://npclassifier.ucsd.edu/. NPClassifier does not always manage to classify every compound into all three hierarchical levels. In such cases, it might be beneficial to first runNPCTable
, manually edit the resulting data frame with probable classifications if possible (with help from the Supporting Information in Kim et al. 2021), and then supply this classification to thecompDis
function with thenpcTable
argument. This will ensure that compound dissimilarities are computed optimally.Using PubChem Fingerprints,
type = "PubChemFingerprint"
. This is a binary substructure fingerprint with 881 binary variables describing the chemical structure of a compound. With this method, compounds are therefore compared based on how structurally dissimilar the molecules are. See https://pubchem.ncbi.nlm.nih.gov/docs/data-specification for more information. (There are many other types of fingerprints, and ways of calculating compound dissimilarities based on them, see e.g. packagesfingerprint
andrcdk
). Fingerprint data for molecules is downloaded from PubChem. In association with this, there might be a Warning message about closing unused connections, which is not important.fMCS, flexible Maximum Common Substructure,
type = "fMCS"
. This is a pairwise graph matching concept. The fMCS of two compounds is the largest substructure that occurs in both compounds allowing for atom and/or bond mismatches (Wang et al 2013). As with the fingerprints, compounds are compared based on how structurally dissimilar the molecules are. While potentially a very accurate similarity measure, fMCS is much more computationally demanding than the other methods, and will take a significant amount of time for larger data sets. Data on molecules is downloaded from PubChem. In association with this, there might be a Warning message about closing unused connections, which is not important.
Dissimilarities using NPClassifier and PubChem Fingerprints are generated by calculating Jaccard (Tanimoto) dissimilarities from a 0/1 table with compounds as rows and group (NPClassifier) or binary fingerprint variable (PubChem Fingerprints) as columns. fMCS generates dissimilarity values by calculating Jaccard dissimilarities based on the number of atoms in the maximum common substructure, allowing for one atom and one bond mismatch. Dissimilarities are outputted as dissimilarity matrices.
If dissimilarities are calculated with more than one method,
the function will output additional dissimilarity matrices.
This always includes a matrix with the mean dissimilarity values of the
selected methods. If "NPClassifier"
is included in type
,
a matrix of "mix" values is also calculated. The values in this matrix
are the dissimilarities from NPClassifier when these are > 0.
For pairs of compounds where dissimilarities from NPClassifier
equals 0 (i.e. when the compounds belong to the same pathway, superclass
and class), values are equal to half of the (mean) value(s) of the
structural dissimilarity/-ies from PubChem Fingerprints and/or fMCS.
With this method, compound dissimilarities are primarily based on
NPClassifier, but instead of compounds with identical classification having
0 dissimilarity, these have a dissimilarity based on PubChem Fingerprints
and/or fMCS, scaled to always be less (< 0.5) than compounds being in the
same pathway and superclass, but different class.
If there are unknown compounds, which do not have a
corresponding SMILES or InChIKey, this can be handled in three
different ways. First, these can be completely removed from the list
of compounds and the sample data set, and hence excluded from all analyses.
Second, if unknownCompoundsMean = FALSE
, unknown compounds will
be given a dissimilarity value of 1 to all other compounds. Third, if
unknownCompoundsMean = TRUE
, unknown compounds will be given
a dissimilarity value to all other compounds which equals the mean
dissimilarity value between all known compounds. See chemodiv
for alternative methods that can be used when most or all compounds
are unknown.
Value
List with compound dissimilarity matrices. A list is always
outputted, even if only one matrix is calculated. Downstream functions,
including calcDiv
, calcBetaDiv
,
calcDivProf
, sampDis
, molNet
and chemoDivPlot
require only the matrix as
input (e.g. as fullList$specificMatrix
) rather than the whole list.
References
Kim HW, Wang M, Leber CA, Nothias L-F, Reher R, Kang KB, van der Hooft JJJ, Dorrestein PC, Gerwick WH, Cottrell GW. 2021. NPClassifier: A Deep Neural Network-Based Structural Classification Tool for Natural Products. Journal of Natural Products 84: 2795-2807.
Wang Y, Backman TWH, Horan K, Girke T. 2013. fmcsR: mismatch tolerant maximum common substructure searching in R. Bioinformatics 29: 2792-2794.
Examples
data(minimalCompData)
data(minimalNPCTable)
compDis(minimalCompData, type = "NPClassifier",
npcTable = minimalNPCTable) # Dissimilarity based on NPClassifier
## Not run: compDis(minimalCompData) # Dissimilarity based on Fingerprints
data(alpinaCompData)
data(alpinaNPCTable)
compDis(compoundData = alpinaCompData, type = "NPClassifier",
npcTable = alpinaNPCTable) # Dissimilarity based on NPClassifier