chemodiv {chemodiv} | R Documentation |
chemodiv: A package for analysing phytochemical diversity
Description
chemodiv is an R package for analysing the chemodiversity of phytochemical data. The package includes a number of functions that enables quantification and visualization of phytochemical diversity and dissimilarity for any type of phytochemical (and similar) samples, such as herbivore defence compounds, volatiles and similar. Importantly, calculations of diversity and dissimilarity can incorporate biosynthetic and/or structural properties of the phytochemical compounds, resulting in more comprehensive quantifications of diversity and dissimilarity. Functions in the R-package will work best for sets of data, commonly generated by chemical ecologists using GC-MS, LC-MS or similar, where all or most compounds in the samples have been confidently identified. See Petren et al. 2023a for a detailed description of the package, and Petren et al. 2023b for a more in-depth discussion and review of plant chemodiversity.
Details
Two datasets are needed to use the full set of analyses included in the package.
The first dataset should contain data on the relative
abundance/concentration (i.e. proportion) of different compounds (columns)
in different samples (rows). See the included
dataset minimalSampData
for a basic example.
Note that all calculations of diversity, and most calculations of
dissimilarity, are only performed on relative, rather than absolute, values.
The second dataset should contain, in each of three columns in a data frame,
the compound name, SMILES and InChIKey IDs of all the compounds
present in the first dataset. See the included dataset
minimalCompData
for a basic example. SMILES and InChIKey
are chemical identifiers that are easily obtained for each compound
by searching for it in PubChem https://pubchem.ncbi.nlm.nih.gov/.
Here, a search with a common name will bring up the compound's
record in the database, where the (isomeric/canonical) SMILES and
InChIKey are included. Various automated tools such as
the PubChem Identifier Exchange Service
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi or
The Chemical Translation Service https://cts.fiehnlab.ucdavis.edu/
can also be used. The user is intentionally required to compile the
chemical identifiers manually to ensure these are correct,
as lists of compounds very often contain compounds wrongly named,
wrongly formatted, under various synonyms etc. which prevents easy
automatic translation of compound names to SMILES and InChIKey.
Note that SMILES IDs might contain the character combination "\C"
.
If SMILES are entered manually directly in R, this is interpreted as an
unrecognized escape and results in an error. In this case, an extra
backslash has to be added: "\\C"
. If the dataset is instead
imported into R as a csv-file or txt-file (recommended), this is done
automatically and no manual edits has to be done.
The second dataset with the chemical IDs is primarily used to construct
one or more dissimilarity matrices with pairwise dissimilarities between
chemical compounds, which can then be used in calculations of phytochemical
diversity and dissimilarity. As noted above, to do this, the compounds
in the samples have to be identified and their chemical IDs listed.
If some compounds in a dataset are unknown, these can be handled in
different ways decided by the user, see compDis
for details.
If many or all compounds are unknown, as is common for more metabolomic
type datasets, phytochemical diversity and dissimilarity can still be
calculated using indices that do not consider compound dissimilarities.
Alternatively, other ways to calculate compound dissimilarities,
not based on knowing compound identities, can be used.
For example, cosine dissimilarities between tandem (MS/MS) mass spectra of
metabolomic features can be calculated in the GNPS
framework https://gnps.ucsd.edu (Wang et al. 2016).
A dissimilarity matrix of such dissimilarities can then be used
for the compDisMat
argument in various functions in the package,
thereby enabling comprehensive quantification of phytochemical diversity
and dissimilarity also for datasets consisting of unidentified compounds.
Once the dataset with samples and the dataset with compounds are prepared, these should be imported/constructed as separate data frames in R, and all analyses in the package can then be performed, in largely the same order as they appear in the list below.
Data format checks
Compound classification and dissimilarity
Diversity calculations
calcDiv
calcBetaDiv
calcDivProf
Sample dissimilarities
Molecular network and properties
Chemodiversity and network plots
Shortcut function
Author(s)
Hampus Petren, Tobias G. Koellner, Robert R. Junker
References
Petren H, Koellner TG, Junker RR. 2023a. Quantifying chemodiversity considering biochemical and structural properties of compounds with the R package chemodiv. New Phytologist 237: 2478-2492.
Petren H, Anaia RA, Aragam KS, Braeutigam A, Eckert S, Heinen R, Jakobs R, Ojeda L, Popp M, Sasidharan R, Schnitzler J-P, Steppuhn A, Thon F, Tschikin S, Unsicker SB, van Dam NM, Weisser WW, Wittmann MJ, Yepes S, Ziaja D, Meuller C, Junker RR. 2023b. Understanding the phytochemical diversity of plants: Quantification, variation and ecological function. bioRxiv doi: 10.1101/2023.03.23.533415.
Wang M, Carver JJ, Phelan VV, et al. 2016. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nature Biotechnology 34: 828-837.
See Also
https://github.com/hpetren/chemodiv