SimilarityMetrics {Quartet} | R Documentation |
Tree similarity measures
Description
Measure tree similarity or difference.
Usage
SimilarityMetrics(elementStatus, similarity = TRUE)
DoNotConflict(elementStatus, similarity = TRUE)
ExplicitlyAgree(elementStatus, similarity = TRUE)
StrictJointAssertions(elementStatus, similarity = TRUE)
SemiStrictJointAssertions(elementStatus, similarity = TRUE)
SymmetricDifference(elementStatus, similarity = TRUE)
RawSymmetricDifference(elementStatus, similarity = FALSE)
RobinsonFoulds(elementStatus, similarity = FALSE)
MarczewskiSteinhaus(elementStatus, similarity = TRUE)
SteelPenny(elementStatus, similarity = TRUE)
QuartetDivergence(elementStatus, similarity = TRUE)
SimilarityToReference(elementStatus, similarity = TRUE, normalize = FALSE)
Arguments
elementStatus |
Two-dimensional integer array, with rows corresponding to
counts of matching quartets or partitions for each tree, and columns named
according to the output of |
similarity |
Logical specifying whether to calculate the similarity or dissimilarity. |
normalize |
Logical; if |
Details
Estabrook et al. (1985) (table 2) define four similarity metrics in terms of the total number of quartets (N, their Q), the number of quartets resolved in the same manner in two trees (s), the number resolved differently in both trees (d), the number resolved in tree 1 or 2 but unresolved in the other tree (r1, r2), and the number that are unresolved in both trees (u).
The similarity metrics are then given as below. The dissimilarity metrics are their complement (i.e. 1 - similarity), and can be calculated algebraically using the identity N = s + d + r1 + r2 + u.
Although defined using quartets, analogous values can be calculated using partitions – though for a number of reasons, quartets may offer a more meaningful measure of the amount of information shared by two trees (Smith 2020).
Do Not Conflict (DC): (s + r1 + r2 + u) / N
Explicitly Agree (EA): s / N
Strict Joint Assertions (SJA): s / (s + d)
SemiStrict Joint Assertions (SSJA): s / (s + d + u)
(The numerator of the SemiStrict Joint Assertions similarity metric is given in Estabrook et al. (1985) table 2 as s + d, but this is understood, with reference to their text, to be a typographic error.)
Steel and Penny (1993) propose a further metric,
which they denote d_Q_,
which this package calculates using the function SteelPenny()
:
Steel & Penny's quartet metric (dQ): (s + u) / N
Another take on tree similarity is to consider the symmetric difference: that is, the number of partitions or quartets present in one tree that do not appear in the other, originally used to measure tree similarity by Robinson and Foulds (1981). (Note that, given the familiarity of the Robinson–Foulds distance metric, this quantity is be default expressed as a difference rather than a similarity.)
Raw symmetric difference (RF): d1 + d2 + r1 + r2
A pair of trees will have a high symmetric difference if they are well-resolved but disagree on many relationships; or if they agree on most relationships but are poorly resolved. As such, it is essential to contextualize the symmetric difference by appropriate normalization (Smith 2019). Multiple approaches to normalization have been proposed:
The total number of resolved quartets or partitions present in both trees (Day 1986):
Symmetric Difference (SD): (2 d + r1 + r2) / (2 d + 2 s + r1 + r2)
The total distinctly resolved quartets or partitions (Marczewski and Steinhaus 1958; Day 1986):
Marczewski-Steinhaus (MS): (2 d + r1 + r2) / (2 d + s + r1 + r2)
The maximum number of quartets or partitions that could have been resolved, given the number of tips (Smith 2019):
Symmetric Divergence: (d + d + r1 + r2) / N
Finally, in cases where a reconstructed tree r1
is being compared to a
reference tree r2
taken to represent "true" relationships,
a symmetric difference is not desired.
In such settings, the desired score is the expectation that a given
quartet's resolution in the reconstructed tree is "correct", given by
Asher and Smith (2022):
Similarity to Reference (S2R): (s + (r1 + r2 + u) / 3) / Q
This may optionally be normalized with reference to the maximum possible similarity, (s + d + r2 + (r1 + u) / 3) / Q, subtracting 1/3 (the probability of matching at random) from both the S2R score and maximum possible score before dividing; then, a tree scores zero if it is as different from the true tree as a random or fully unresolved tree, and one if it is as "true" as can be known.
Value
SimilarityMetrics()
returns a named two-dimensional array in which each row
corresponds to an input tree, and each column corresponds to one of the
listed measures.
DoNotConflict()
and others return a named vector describing the requested
similarity (or difference) between the trees.
Author(s)
Martin R. Smith (martin.smith@durham.ac.uk)
References
Asher R, Smith MR (2022).
“Phylogenetic signal and bias in paleontology.”
Systematic Biology, 71(4), 986–1008.
doi:10.1093/sysbio/syab072.
Day WH (1986).
“Analysis of quartet dissimilarity measures between undirected phylogenetic trees.”
Systematic Biology, 35(3), 325–333.
doi:10.1093/sysbio/35.3.325.
Estabrook GF, McMorris FR, Meacham CA (1985).
“Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units.”
Systematic Zoology, 34(2), 193–200.
doi:10.2307/2413326.
Marczewski E, Steinhaus H (1958).
“On a certain distance of sets and the corresponding distance of functions.”
Colloquium Mathematicae, 6(1), 319–327.
https://eudml.org/doc/210378.
Robinson DF, Foulds LR (1981).
“Comparison of phylogenetic trees.”
Mathematical Biosciences, 53(1-2), 131–147.
doi:10.1016/0025-5564(81)90043-2.
Smith MR (2019).
“Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets.”
Biology Letters, 15(2), 20180632.
doi:10.1098/rsbl.2018.0632.
Smith MR (2020).
“Information theoretic Generalized Robinson-Foulds metrics for comparing phylogenetic trees.”
Bioinformatics, 36(20), 5007–5013.
doi:10.1093/bioinformatics/btaa614.
Steel MA, Penny D (1993).
“Distributions of tree comparison metrics—some new results.”
Systematic Biology, 42(2), 126–141.
doi:10.1093/sysbio/42.2.126, http://www.math.canterbury.ac.nz/~m.steel/Non_UC/files/research/distributions.pdf.
See Also
Calculate status of each quartet – the raw material from which the Estabrook et al. metrics are calculated – with
QuartetStatus()
:Equivalent metrics for bipartition splits:
SplitStatus()
,CompareSplits()
Examples
data("sq_trees")
sq_status <- QuartetStatus(sq_trees)
SimilarityMetrics(sq_status)
QuartetDivergence(sq_status, similarity = FALSE)
library("TreeTools", quietly = TRUE, warn.conflict = FALSE)
set.seed(0)
reference <- CollapseNode(as.phylo(101, 10), 16:18)
trees <- c(
reference = reference,
binaryRef = MakeTreeBinary(reference),
balanced = BalancedTree(reference),
pectinate = PectinateTree(reference),
star = StarTree(reference),
random = RandomTree(reference),
random2 = RandomTree(reference)
)
elementStatus <- QuartetStatus(trees, reference)
SimilarityToReference(elementStatus)
SimilarityToReference(elementStatus, normalize = TRUE)