R: Clustering sequences based on pairwise distances

bClust {micropan}

R Documentation

Clustering sequences based on pairwise distances

Description

Sequences are clustered by hierarchical clustering based on a set of pariwise distances. The distances must take values between 0.0 and 1.0, and all pairs not listed are assumed to have distance 1.0.

Usage

bClust(dist.tbl, linkage = "complete", threshold = 0.75, verbose = TRUE)

Arguments

`dist.tbl`	A `tibble` with pairwise distances.
`linkage`	A text indicating what type of clustering to perform, either ‘⁠complete⁠’ (default), ‘⁠average⁠’ or ‘⁠single⁠’.
`threshold`	Specifies the maximum size of a cluster. Must be a distance, i.e. a number between 0.0 and 1.0.
`verbose`	Logical, turns on/off text output during computations.

Details

Computing clusters (gene families) is an essential step in many comparative studies. bClust will assign sequences into gene families by a hierarchical clustering approach. Since the number of sequences may be huge, a full all-against-all distance matrix will be impossible to handle in memory. However, most sequence pairs will have an ‘infinite’ distance between them, and only the pairs with a finite (smallish) distance need to be considered.

This function takes as input the distances in dist.tbl where only the relevant distances are listed. The columns ‘⁠Query⁠’ and ‘⁠Hit⁠’ contain tags identifying pairs of sequences. The column ‘⁠Distance⁠’ contains the distances, always a number from 0.0 to 1.0. Typically, this is the output from bDist. All pairs of sequences not listed are assumed to have distance 1.0, which is considered the ‘infinite’ distance. All sequences must be listed at least once in ceither column ‘⁠Query⁠’ or ‘⁠Hit⁠’ of the dist.tbl. This should pose no problem, since all sequences must have distance 0.0 to themselves, and should be listed with this distance once (‘⁠Query⁠’ and ‘⁠Hit⁠’ containing the same tag).

The ‘⁠linkage⁠’ defines the type of clusters produced. The ‘⁠threshold⁠’ indicates the size of the clusters. A ‘⁠single⁠’ linkage clustering means all members of a cluster have at least one other member of the same cluster within distance ‘⁠threshold⁠’ from itself. An ‘⁠average⁠’ linkage means all members of a cluster are within the distance ‘⁠threshold⁠’ from the center of the cluster. A ‘⁠complete⁠’ linkage means all members of a cluster are no more than the distance ‘⁠threshold⁠’ away from any other member of the same cluster.

Typically, ‘⁠single⁠’ linkage produces big clusters where members may differ a lot, since they are only required to be close to something, which is close to something,...,which is close to some other member. On the other extreme, ‘⁠complete⁠’ linkage will produce small and tight clusters, since all must be similar to all. The ‘⁠average⁠’ linkage is between, but closer to ‘⁠complete⁠’ linkage. If you want the ‘⁠threshold⁠’ to specify directly the maximum distance tolerated between two members of the same gene family, you must use ‘⁠complete⁠’ linkage. The ‘⁠single⁠’ linkage is the fastest alternative to compute. Using the default setting of ‘⁠single⁠’ linkage and maximum ‘⁠threshold⁠’ (1.0) will produce the largest and fewest clusters possible.

Value

The function returns a vector of integers, indicating the cluster membership of every unique sequence from the ‘⁠Query⁠’ or ‘⁠Hit⁠’ columns of the input ‘⁠dist.tbl⁠’. The name of each element indicates the sequence. The numerical values have no meaning as such, they are simply categorical indicators of cluster membership.

Author(s)

Lars Snipen and Kristian Hovde Liland.

Examples

# Loading example BLAST distances
data(xmpl.bdist)

# Clustering with default settings
clst <- bClust(xmpl.bdist)
# Other settings, and verbose
clst <- bClust(xmpl.bdist, linkage = "average", threshold = 0.5, verbose = TRUE)

[Package micropan version 2.1 Index]