kdistance {kmer}R Documentation

K-mer distance matrix computation.

Description

Computes the matrix of k-mer distances between all pairwise comparisons of a set of sequences.

Usage

kdistance(x, k = 5, method = "edgar", residues = NULL, gap = "-",
  compress = TRUE, ...)

Arguments

x

a matrix of aligned sequences or a list of unaligned sequences. Accepted modes are "character" and "raw" (the latter being applicable for "DNAbin" and "AAbin" objects).

k

integer representing the k-mer size to be used for calculating the distance matrix. Defaults to 5. Note that high values of k may be slow to compute and use a lot of memory due to the large numbers of calculations required, particularly when the residue alphabet is also large.

method

a character string giving the k-mer distance measure to be used. Currently the available options are "edgar" (default; see Edgar (2004) for details) and the standard methods available for the base function "dist" ("euclidean", "maximum", "manhattan", "canberra", "binary" and "minkowski").

residues

either NULL (default; the residue alphabet is automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless x is a "DNAbin" or "AAbin" object.

gap

the character used to represent gaps in the alignment matrix (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.

compress

logical indicating whether to compress AAbin sequences using the Dayhoff(6) alphabet for k-mer sizes exceeding 4. Defaults to TRUE to avoid memory overflow and excessive computation time.

...

further arguments to be passed to "as.dist".

Details

This function computes the n * n k-mer distance matrix (where n is the number of sequences), returning an object of class "dist". DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or as a matrix of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT").

To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid excessive memory use and computation time.

Value

an object of class "dist".

Author(s)

Shaun Wilkinson

References

Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385.

Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290.

Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.

See Also

kcount for k-mer counting, and mbed for leaner distance matrices

Examples

  ## compute a k-mer distance matrix for the woodmouse
  ## dataset (ape package) using a k-mer size of 5
  library(ape)
  data(woodmouse)
  ### subset global alignment by removing gappy ends
  woodmouse <- woodmouse[, apply(woodmouse, 2, function(v) !any(v == 0xf0))]
  ### compute the distance matrix
  woodmouse.dist <- kdistance(woodmouse, k = 5)
  ### cluster and plot UPGMA tree
  woodmouse.tree <- as.dendrogram(hclust(woodmouse.dist, "average"))
  plot(woodmouse.tree)

[Package kmer version 1.1.2 Index]