R: Clustering sequences based on domain sequence

dClust {micropan}

R Documentation

Clustering sequences based on domain sequence

Description

Proteins are clustered by their sequence of protein domains. A domain sequence is the ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned to the same cluster.

Usage

dClust(hmmer.tbl)

Arguments

hmmer.tbl

A tibble of results from a hmmerScan against a domain database.

Details

A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins contain known domains, but those who do will have from one to several domains, and these can be ordered forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise alignments, see bClust. Domain sequence clusters are less influenced by gene prediction errors.

The input is a tibble of the type produced by readHmmer. Typically, it is the result of scanning proteins (using hmmerScan) against Pfam-A or any other HMMER3 database of protein domains. It is highly reccomended that you remove overlapping hits in ‘⁠hmmer.tbl⁠’ before you pass it as input to dClust. Use the function hmmerCleanOverlap for this. Overlapping hits are in some cases real hits, but often the poorest of them are artifacts.

Value

The output is a numeric vector with one element for each unique sequence in the ‘⁠Query⁠’ column of the input ‘⁠hmmer.tbl⁠’. Sequences with identical number belong to the same cluster. The name of each element identifies the sequence.

This vector also has an attribute called ‘⁠cluster.info⁠’ which is a character vector containing the domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc. In this way you can, in addition to clustering the sequences, also see which domains the sequences of a particular cluster share.

Author(s)

Lars Snipen and Kristian Hovde Liland.

References

Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to Escherichia coli. F1000 Research, 1:19.

Examples

# HMMER3 result files in this package
hf <- file.path(path.package("micropan"), "extdata", 
                str_c("GID", 1:3, "_vs_microfam.hmm.txt.xz"))

# We need to uncompress them first...
hmm.files <- tempfile(fileext = rep(".xz", length(hf)))
ok <- file.copy(from = hf, to = hmm.files)
hmm.files <- unlist(lapply(hmm.files, xzuncompress))

# Reading the HMMER3 results, cleaning overlaps...
hmmer.tbl <- NULL
for(i in 1:3){
  readHmmer(hmm.files[i]) %>% 
    hmmerCleanOverlap() %>% 
    bind_rows(hmmer.tbl) -> hmmer.tbl
}

# The clustering
clst <- dClust(hmmer.tbl)

# ...and cleaning...
ok <- file.remove(hmm.files)

[Package micropan version 2.1 Index]