extractPSSM {protr} | R Documentation |
Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence
Description
This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides.
Usage
extractPSSM(
seq,
start.pos = 1L,
end.pos = nchar(seq),
psiblast.path = NULL,
makeblastdb.path = NULL,
database.path = NULL,
iter = 5,
silent = TRUE,
evalue = 10L,
word.size = NULL,
gapopen = NULL,
gapextend = NULL,
matrix = "BLOSUM62",
threshold = NULL,
seg = "no",
soft.masking = FALSE,
culling.limit = NULL,
best.hit.overhang = NULL,
best.hit.score.edge = NULL,
xdrop.ungap = NULL,
xdrop.gap = NULL,
xdrop.gap.final = NULL,
window.size = NULL,
gap.trigger = 22L,
num.threads = 1L,
pseudocount = 0L,
inclusion.ethresh = 0.002
)
Arguments
seq |
Character vector, as the input protein sequence. |
start.pos |
Optional integer denoting the start position of the
fragment window. Default is |
end.pos |
Optional integer denoting the end position of the
fragment window. Default is |
psiblast.path |
Character string indicating the path of the
|
makeblastdb.path |
Character string indicating the path of the
|
database.path |
Character string indicating the path of a reference database (a FASTA file). |
iter |
Number of iterations to perform for PSI-Blast. |
silent |
Logical. Whether the PSI-Blast running output
should be shown or not (May not work on some Windows versions and
PSI-Blast versions), default is |
evalue |
Expectation value (E) threshold for saving hits.
Default is |
word.size |
Word size for wordfinder algorithm. An integer >= 2. |
gapopen |
Integer. Cost to open a gap. |
gapextend |
Integer. Cost to extend a gap. |
matrix |
Character string. The scoring matrix name
(default is |
threshold |
Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0. |
seg |
Character string. Filter query sequence with SEG ( |
soft.masking |
Logical. Apply filtering locations as soft masks?
Default is |
culling.limit |
An integer >= 0. If the query range of a hit is
enveloped by that of at least this many higher-scoring hits,
delete the hit. Incompatible with |
best.hit.overhang |
Best Hit algorithm overhang value
(A real value >= 0 and =< 0.5, recommended value: 0.1).
Incompatible with |
best.hit.score.edge |
Best Hit algorithm score edge value
(A real value >=0 and =< 0.5, recommended value: 0.1).
Incompatible with |
xdrop.ungap |
X-dropoff value (in bits) for ungapped extensions. |
xdrop.gap |
X-dropoff value (in bits) for preliminary gapped extensions. |
xdrop.gap.final |
X-dropoff value (in bits) for final gapped alignment. |
window.size |
An integer >= 0. Multiple hits window size,
To specify 1-hit algorithm, use |
gap.trigger |
Number of bits to trigger gapping. Default is |
num.threads |
Integer. Number of threads (CPUs) to use in the
BLAST search. Default is |
pseudocount |
Integer. Pseudo-count value used when constructing PSSM.
Default is |
inclusion.ethresh |
E-value inclusion threshold for pairwise alignments.
Default is |
Details
For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.
Value
The original PSSM, a numeric matrix which has
end.pos - start.pos + 1
columns and 20
named rows.
Note
The function requires the makeblastdb
and psiblast
programs
to be properly installed in the operation system or
their paths provided.
The two command-line programs are included in the NCBI-BLAST+ software package. To install NCBI Blast+ for your operating system, see https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html for detailed instructions.
Ubuntu/Debian users can directly use the command
sudo apt-get install ncbi-blast+
to install NCBI Blast+.
For OS X users, download ncbi-blast- ... .dmg
then install.
For Windows users, download ncbi-blast- ... .exe
then install.
Author(s)
Nan Xiao <https://nanx.me>
References
Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389–3402.
Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.
Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.
See Also
extractPSSMFeature extractPSSMAcc
Examples
if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") {
cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first")
} else {
x <- readFASTA(system.file(
"protseq/P00750.fasta",
package = "protr"
))[[1]]
dbpath <- tempfile("tempdb", fileext = ".fasta")
invisible(file.copy(from = system.file(
"protseq/Plasminogen.fasta",
package = "protr"
), to = dbpath))
pssmmat <- extractPSSM(seq = x, database.path = dbpath)
dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
}