extractPSSM {protr}R Documentation

Compute PSSM (Position-Specific Scoring Matrix) for given protein sequence

Description

This function calculates the PSSM (Position-Specific Scoring Matrix) derived by PSI-Blast for given protein sequence or peptides.

Usage

extractPSSM(
  seq,
  start.pos = 1L,
  end.pos = nchar(seq),
  psiblast.path = NULL,
  makeblastdb.path = NULL,
  database.path = NULL,
  iter = 5,
  silent = TRUE,
  evalue = 10L,
  word.size = NULL,
  gapopen = NULL,
  gapextend = NULL,
  matrix = "BLOSUM62",
  threshold = NULL,
  seg = "no",
  soft.masking = FALSE,
  culling.limit = NULL,
  best.hit.overhang = NULL,
  best.hit.score.edge = NULL,
  xdrop.ungap = NULL,
  xdrop.gap = NULL,
  xdrop.gap.final = NULL,
  window.size = NULL,
  gap.trigger = 22L,
  num.threads = 1L,
  pseudocount = 0L,
  inclusion.ethresh = 0.002
)

Arguments

seq

Character vector, as the input protein sequence.

start.pos

Optional integer denoting the start position of the fragment window. Default is 1, i.e. the first amino acid of the given sequence.

end.pos

Optional integer denoting the end position of the fragment window. Default is nchar(seq), i.e. the last amino acid of the given sequence.

psiblast.path

Character string indicating the path of the psiblast program. If NCBI Blast+ was previously installed in the operation system, the path will be automatically detected.

makeblastdb.path

Character string indicating the path of the makeblastdb program. If NCBI Blast+ was previously installed in the system, the path will be automatically detected.

database.path

Character string indicating the path of a reference database (a FASTA file).

iter

Number of iterations to perform for PSI-Blast.

silent

Logical. Whether the PSI-Blast running output should be shown or not (May not work on some Windows versions and PSI-Blast versions), default is TRUE.

evalue

Expectation value (E) threshold for saving hits. Default is 10.

word.size

Word size for wordfinder algorithm. An integer >= 2.

gapopen

Integer. Cost to open a gap.

gapextend

Integer. Cost to extend a gap.

matrix

Character string. The scoring matrix name (default is "BLOSUM62").

threshold

Minimum word score such that the word is added to the BLAST lookup table. A real value >= 0.

seg

Character string. Filter query sequence with SEG ("yes", "window locut hicut", or "no" to disable). Default is "no".

soft.masking

Logical. Apply filtering locations as soft masks? Default is FALSE.

culling.limit

An integer >= 0. If the query range of a hit is enveloped by that of at least this many higher-scoring hits, delete the hit. Incompatible with best.hit.overhang and best_hit_score_edge.

best.hit.overhang

Best Hit algorithm overhang value (A real value >= 0 and =< 0.5, recommended value: 0.1). Incompatible with culling_limit.

best.hit.score.edge

Best Hit algorithm score edge value (A real value >=0 and =< 0.5, recommended value: 0.1). Incompatible with culling_limit.

xdrop.ungap

X-dropoff value (in bits) for ungapped extensions.

xdrop.gap

X-dropoff value (in bits) for preliminary gapped extensions.

xdrop.gap.final

X-dropoff value (in bits) for final gapped alignment.

window.size

An integer >= 0. Multiple hits window size, To specify 1-hit algorithm, use 0.

gap.trigger

Number of bits to trigger gapping. Default is 22.

num.threads

Integer. Number of threads (CPUs) to use in the BLAST search. Default is 1.

pseudocount

Integer. Pseudo-count value used when constructing PSSM. Default is 0.

inclusion.ethresh

E-value inclusion threshold for pairwise alignments. Default is 0.002.

Details

For given protein sequences or peptides, PSSM represents the log-likelihood of the substitution of the 20 types of amino acids at that position in the sequence. Note that the output value is not normalized.

Value

The original PSSM, a numeric matrix which has end.pos - start.pos + 1 columns and 20 named rows.

Note

The function requires the makeblastdb and psiblast programs to be properly installed in the operation system or their paths provided.

The two command-line programs are included in the NCBI-BLAST+ software package. To install NCBI Blast+ for your operating system, see https://blast.ncbi.nlm.nih.gov/doc/blast-help/downloadblastdata.html for detailed instructions.

Ubuntu/Debian users can directly use the command sudo apt-get install ncbi-blast+ to install NCBI Blast+. For OS X users, download ncbi-blast- ... .dmg then install. For Windows users, download ncbi-blast- ... .exe then install.

Author(s)

Nan Xiao <https://nanx.me>

References

Altschul, Stephen F., et al. "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic acids research 25.17 (1997): 3389–3402.

Ye, Xugang, Guoli Wang, and Stephen F. Altschul. "An assessment of substitution scores for protein profile-profile comparison." Bioinformatics 27.24 (2011): 3356–3363.

Rangwala, Huzefa, and George Karypis. "Profile-based direct kernels for remote homology detection and fold recognition." Bioinformatics 21.23 (2005): 4239–4247.

See Also

extractPSSMFeature extractPSSMAcc

Examples

if (Sys.which("makeblastdb") == "" | Sys.which("psiblast") == "") {
  cat("Cannot find makeblastdb or psiblast. Please install NCBI Blast+ first")
} else {
  x <- readFASTA(system.file(
    "protseq/P00750.fasta",
    package = "protr"
  ))[[1]]
  dbpath <- tempfile("tempdb", fileext = ".fasta")
  invisible(file.copy(from = system.file(
    "protseq/Plasminogen.fasta",
    package = "protr"
  ), to = dbpath))

  pssmmat <- extractPSSM(seq = x, database.path = dbpath)

  dim(pssmmat) # 20 x 562 (P00750: length 562, 20 Amino Acids)
}

[Package protr version 1.7-2 Index]