group.knowns {TRAMPR} | R Documentation |
Knowns Clustering
Description
Group a TRAMPknowns
object so that knowns
with similar TRFLP patterns and knowns that share the same species
name “group” together. In general, this function will be called
automatically whenever appropriate (e.g. when loading a data set or
adding new knowns). Please see Details to understand why this
function is necessary, and how it works.
The main reason for manually calling group.knowns
is to change
the default values of the arguments; if you call group.knowns
on a TRAMPknowns
object, then any subsequent automatic call to
group.knowns
will use any arguments you passed in the
manual group.knowns
call (e.g. after doing
group.knowns(x, cut.height=20)
, all future groupings will use
cut.height=20
).
Usage
group.knowns(x, ...)
## S3 method for class 'TRAMPknowns'
group.knowns(x, dist.method, hclust.method, cut.height, ...)
## S3 method for class 'TRAMP'
group.knowns(x, ...)
Arguments
x |
A |
dist.method |
Distance method used in calculating similarity
between different knowns (see |
hclust.method |
Clustering method used in generating clusters
from the similarity matrix (see |
cut.height |
Passed to |
... |
Arguments passed to further methods. |
Details
group.knowns
groups together knowns in a
TRAMPknowns
object based on two criteria: (1) TRFLP
profiles that are very similar across shared enzyme/primer
combinations (based on clustering) and (2) TRFLP profiles that belong
to the same species (i.e. share a common species
column in the
info
data.frame of x
; see TRAMPknowns
for
more information). This is to solve three issues in TRFLP analysis:
The TRFLP profile of a single species can have variation in peak sizes due to DNA sequence variation. By including multiple collections of each species, variation in TRFLP profiles can be accounted for. If a
TRAMPknowns
object contains multiple collections of a species, these will be aggregated bygroup.knowns
. This aggregation is essential for community analysis, as leaving individual collections will artificially inflate the number of “present species” when runningTRAMP
.Some authors have taken an alternative approach by using a larger tolerance in matching peaks between samples and knowns (effectively increasing
accept.error
inTRAMP
) to account for within-species variation. This is not recommended, as it dramatically increases the risk of incorrect matches.Distinctly different TRFLP profiles may occur within a species (or in some cases within an individual); see Avis et al. (2006).
group.knowns
looks at thespecies
column of theinfo
data.frame ofx
and joins any knowns with identicalspecies
values as a group. This can also be used where multiple profiles are present in an individual.Different species may share a similar TRFLP profile and therefore be indistinguishable using TRFLP. If these patterns are not grouped, two species will be recorded as present wherever either is present.
group.knowns
prevents this by joining knowns with “very similar” TRFLP patterns as a group. Ideally, these problematic groups can be resolved by increasing the number of enzyme/primer pairs in the data.
Groups names are generated by concatenating all unique (sorted) species names together, separated by commas.
To determine if knowns are “similar enough” to form a group, we
use R's clustering tools: dist
, hclust
and cutree
. First, we generate a distance matrix of the
knowns profiles using dist
, and using method
dist.method
(see Example below; this is very similar to what
TRAMP
does, and dist.method
should be specified
accordingly). We then generate clusters using hclust
,
and using method hclust.method
, and “cut” the tree at
cut.height
using cutree
.
Knowns are grouped together iteratively; so that all groups sharing a common cluster are grouped together, and all knowns that share a common species name are grouped together. In certain cases this may chain together seemingly unrelated groups.
Because group.knowns
is generic, it can be run on either a
TRAMPknowns
or a TRAMP
object. When run
on a TRAMP
object, it updates the TRAMPknowns
object
(stored as x$knowns
), so that subsequent calls to
plot.TRAMPknowns
or summary.TRAMPknowns
(for example) will use the new grouping parameters.
Parameters set by group.knowns
are retained as part of the
object, so that when adding additional knowns (add.known
and combine
), or when subsetting a knowns database (see
[.TRAMPknowns
,
aka TRAMPindexing
), the same grouping parameters will be
used.
Value
For group.knowns.TRAMPknowns
, a new TRAMPknowns
object.
The cluster.pars
element will have been updated with new
parameters, if any were specified.
For group.knowns.TRAMP
, a new TRAMP
object, with an
updated knowns
element. Note that the original
TRAMPknowns
object (i.e. the one from which the TRAMP
object was constructed) will not
be modified.
Warning
Warning about missing data: where there are NA
values in
certain combinations, NA
s may be present in the final distance
matrix, which means we cannot use hclust
to generate the
clusters! In general, NA
values are fine. They just can't be
everywhere.
References
Avis PG, Dickie IA, Mueller GM 2006. A ‘dirty’ business: testing the limitations of terminal restriction fragment length polymorphism (TRFLP) analysis of soil fungi. Molecular Ecology 15: 873-882.
See Also
TRAMPknowns
, which describes the TRAMPknowns
object.
build.knowns
, which attempts to generate a knowns
database from a TRAMPsamples
data set.
plot.TRAMPknowns
, which graphically displays the
relationships between knowns.
Examples
data(demo.knowns)
data(demo.samples)
demo.knowns <- group.knowns(demo.knowns, cut.height=2.5)
plot(demo.knowns)
## Increasing cut.height makes groups more inclusive:
plot(group.knowns(demo.knowns, cut.height=100))
res <- TRAMP(demo.samples, demo.knowns)
m1.ungrouped <- summary(res)
m1.grouped <- summary(res, group=TRUE)
ncol(m1.grouped) # 94 groups
res2 <- group.knowns(res, cut.height=100)
m2.ungrouped <- summary(res2)
m2.grouped <- summary(res2, group=TRUE)
ncol(m2.grouped) # Now only 38 groups
## group.knowns results in the same distance matrix as produced by
## TRAMP, therefore using the same method (e.g. method="maximum") is
## important. The example below shows how the matrix produced by
## dist(summary(x)) (as calculated by group.knowns) is the same as that
## produced by TRAMP:
f <- function(x, method="maximum") {
## Create a pseudo-samples object from our knowns
y <- x
y$data$height <- 1
names(y$info)[names(y$info) == "knowns.pk"] <- "sample.pk"
names(y$data)[names(y$data) == "knowns.fk"] <- "sample.fk"
class(y) <- "TRAMPsamples"
## Run TRAMP, clean up and return
## (If method != "maximum", rescale the error to match that
## generated by dist()).
z <- TRAMP(y, x, method=method)
if ( method != "maximum" ) z$error <- z$error * z$n
names(dimnames(z$error)) <- NULL
z
}
g <- function(x, method="maximum")
as.matrix(dist(summary(x), method=method))
all.equal(f(demo.knowns, "maximum")$error, g(demo.knowns, "maximum"))
all.equal(f(demo.knowns, "euclidian")$error, g(demo.knowns, "euclidian"))
all.equal(f(demo.knowns, "manhattan")$error, g(demo.knowns, "manhattan"))
## However, TRAMP is over 100 times slower in this special case.
system.time(f(demo.knowns))
system.time(g(demo.knowns))