R: Knowns Clustering

group.knowns {TRAMPR}

R Documentation

Knowns Clustering

Description

Group a TRAMPknowns object so that knowns with similar TRFLP patterns and knowns that share the same species name “group” together. In general, this function will be called automatically whenever appropriate (e.g. when loading a data set or adding new knowns). Please see Details to understand why this function is necessary, and how it works.

The main reason for manually calling group.knowns is to change the default values of the arguments; if you call group.knowns on a TRAMPknowns object, then any subsequent automatic call to group.knowns will use any arguments you passed in the manual group.knowns call (e.g. after doing group.knowns(x, cut.height=20), all future groupings will use cut.height=20).

Usage

group.knowns(x, ...)
## S3 method for class 'TRAMPknowns'
group.knowns(x, dist.method, hclust.method, cut.height, ...)
## S3 method for class 'TRAMP'
group.knowns(x, ...)

Arguments

`x`	A `TRAMPknowns` or `TRAMP` object, containing identified TRFLP patterns.
`dist.method`	Distance method used in calculating similarity between different knowns (see `dist`). Valid options include `"maximum"`, `"euclidian"` and `"manhattan"`.
`hclust.method`	Clustering method used in generating clusters from the similarity matrix (see `hclust`).
`cut.height`	Passed to `cutree`; controls how similar members of each group should be (the larger `cut.height`, the more inclusive knowns groups will be).
`...`	Arguments passed to further methods.

Details

group.knowns groups together knowns in a TRAMPknowns object based on two criteria: (1) TRFLP profiles that are very similar across shared enzyme/primer combinations (based on clustering) and (2) TRFLP profiles that belong to the same species (i.e. share a common species column in the info data.frame of x; see TRAMPknowns for more information). This is to solve three issues in TRFLP analysis:

The TRFLP profile of a single species can have variation in peak sizes due to DNA sequence variation. By including multiple collections of each species, variation in TRFLP profiles can be accounted for. If a TRAMPknowns object contains multiple collections of a species, these will be aggregated by group.knowns. This aggregation is essential for community analysis, as leaving individual collections will artificially inflate the number of “present species” when running TRAMP.

Some authors have taken an alternative approach by using a larger tolerance in matching peaks between samples and knowns (effectively increasing accept.error in TRAMP) to account for within-species variation. This is not recommended, as it dramatically increases the risk of incorrect matches.
Distinctly different TRFLP profiles may occur within a species (or in some cases within an individual); see Avis et al. (2006). group.knowns looks at the species column of the info data.frame of x and joins any knowns with identical species values as a group. This can also be used where multiple profiles are present in an individual.
Different species may share a similar TRFLP profile and therefore be indistinguishable using TRFLP. If these patterns are not grouped, two species will be recorded as present wherever either is present. group.knowns prevents this by joining knowns with “very similar” TRFLP patterns as a group. Ideally, these problematic groups can be resolved by increasing the number of enzyme/primer pairs in the data.

Groups names are generated by concatenating all unique (sorted) species names together, separated by commas.

To determine if knowns are “similar enough” to form a group, we use R's clustering tools: dist, hclust and cutree. First, we generate a distance matrix of the knowns profiles using dist, and using method dist.method (see Example below; this is very similar to what TRAMP does, and dist.method should be specified accordingly). We then generate clusters using hclust, and using method hclust.method, and “cut” the tree at cut.height using cutree.

Knowns are grouped together iteratively; so that all groups sharing a common cluster are grouped together, and all knowns that share a common species name are grouped together. In certain cases this may chain together seemingly unrelated groups.

Because group.knowns is generic, it can be run on either a TRAMPknowns or a TRAMP object. When run on a TRAMP object, it updates the TRAMPknowns object (stored as x$knowns), so that subsequent calls to plot.TRAMPknowns or summary.TRAMPknowns (for example) will use the new grouping parameters.

Parameters set by group.knowns are retained as part of the object, so that when adding additional knowns (add.known and combine), or when subsetting a knowns database (see [.TRAMPknowns, aka TRAMPindexing), the same grouping parameters will be used.

Value

For group.knowns.TRAMPknowns, a new TRAMPknowns object. The cluster.pars element will have been updated with new parameters, if any were specified.

For group.knowns.TRAMP, a new TRAMP object, with an updated knowns element. Note that the original TRAMPknowns object (i.e. the one from which the TRAMP object was constructed) will not be modified.

Warning

Warning about missing data: where there are NA values in certain combinations, NAs may be present in the final distance matrix, which means we cannot use hclust to generate the clusters! In general, NA values are fine. They just can't be everywhere.

References

Avis PG, Dickie IA, Mueller GM 2006. A ‘dirty’ business: testing the limitations of terminal restriction fragment length polymorphism (TRFLP) analysis of soil fungi. Molecular Ecology 15: 873-882.

Examples

data(demo.knowns)
data(demo.samples)

demo.knowns <- group.knowns(demo.knowns, cut.height=2.5)
plot(demo.knowns)

## Increasing cut.height makes groups more inclusive:
plot(group.knowns(demo.knowns, cut.height=100))

res <- TRAMP(demo.samples, demo.knowns)
m1.ungrouped <- summary(res)
m1.grouped <- summary(res, group=TRUE)
ncol(m1.grouped) # 94 groups

res2 <- group.knowns(res, cut.height=100)
m2.ungrouped <- summary(res2)
m2.grouped <- summary(res2, group=TRUE)
ncol(m2.grouped) # Now only 38 groups

## group.knowns results in the same distance matrix as produced by
## TRAMP, therefore using the same method (e.g. method="maximum") is
## important.  The example below shows how the matrix produced by
## dist(summary(x)) (as calculated by group.knowns) is the same as that
## produced by TRAMP:
f <- function(x, method="maximum") {
  ## Create a pseudo-samples object from our knowns
  y <- x
  y$data$height <- 1
  names(y$info)[names(y$info) == "knowns.pk"] <- "sample.pk"
  names(y$data)[names(y$data) == "knowns.fk"] <- "sample.fk"
  class(y) <- "TRAMPsamples"

  ## Run TRAMP, clean up and return
  ## (If method != "maximum", rescale the error to match that
  ## generated by dist()).
  z <- TRAMP(y, x, method=method)
  if ( method != "maximum" ) z$error <- z$error * z$n
  names(dimnames(z$error)) <- NULL
  z
}

g <- function(x, method="maximum")
  as.matrix(dist(summary(x), method=method))

all.equal(f(demo.knowns, "maximum")$error,   g(demo.knowns, "maximum"))
all.equal(f(demo.knowns, "euclidian")$error, g(demo.knowns, "euclidian"))
all.equal(f(demo.knowns, "manhattan")$error, g(demo.knowns, "manhattan"))

## However, TRAMP is over 100 times slower in this special case.
system.time(f(demo.knowns))
system.time(g(demo.knowns))

[Package TRAMPR version 1.0-10 Index]