dClust {micropan} | R Documentation |
Clustering sequences based on domain sequence
Description
Proteins are clustered by their sequence of protein domains. A domain sequence is the ordered sequence of domains in the protein. All proteins having identical domain sequence are assigned to the same cluster.
Usage
dClust(hmmer.tbl)
Arguments
hmmer.tbl |
A |
Details
A domain sequence is simply the ordered list of domains occurring in a protein. Not all proteins
contain known domains, but those who do will have from one to several domains, and these can be ordered
forming a sequence. Since domains can be more or less conserved, two proteins can be quite different in
their amino acid sequence, and still share the same domains. Describing, and grouping, proteins by their
domain sequence was proposed by Snipen & Ussery (2012) as an alternative to clusters based on pairwise
alignments, see bClust
. Domain sequence clusters are less influenced by gene prediction errors.
The input is a tibble
of the type produced by readHmmer
. Typically, it is the
result of scanning proteins (using hmmerScan
) against Pfam-A or any other HMMER3 database
of protein domains. It is highly reccomended that you remove overlapping hits in ‘hmmer.tbl’ before
you pass it as input to dClust
. Use the function hmmerCleanOverlap
for this.
Overlapping hits are in some cases real hits, but often the poorest of them are artifacts.
Value
The output is a numeric vector with one element for each unique sequence in the ‘Query’ column of the input ‘hmmer.tbl’. Sequences with identical number belong to the same cluster. The name of each element identifies the sequence.
This vector also has an attribute called ‘cluster.info’ which is a character vector containing the domain sequences. The first element is the domain sequence for cluster 1, the second for cluster 2, etc. In this way you can, in addition to clustering the sequences, also see which domains the sequences of a particular cluster share.
Author(s)
Lars Snipen and Kristian Hovde Liland.
References
Snipen, L. Ussery, D.W. (2012). A domain sequence approach to pangenomics: Applications to Escherichia coli. F1000 Research, 1:19.
See Also
panPrep
, hmmerScan
, readHmmer
,
hmmerCleanOverlap
, bClust
.
Examples
# HMMER3 result files in this package
hf <- file.path(path.package("micropan"), "extdata",
str_c("GID", 1:3, "_vs_microfam.hmm.txt.xz"))
# We need to uncompress them first...
hmm.files <- tempfile(fileext = rep(".xz", length(hf)))
ok <- file.copy(from = hf, to = hmm.files)
hmm.files <- unlist(lapply(hmm.files, xzuncompress))
# Reading the HMMER3 results, cleaning overlaps...
hmmer.tbl <- NULL
for(i in 1:3){
readHmmer(hmm.files[i]) %>%
hmmerCleanOverlap() %>%
bind_rows(hmmer.tbl) -> hmmer.tbl
}
# The clustering
clst <- dClust(hmmer.tbl)
# ...and cleaning...
ok <- file.remove(hmm.files)