| crossSetSimDisk {protr} | R Documentation | 
Parallel Protein Sequence Similarity Calculation Between Two Sets Based on Sequence Alignment (Disk-Based Version)
Description
Parallel calculation of protein sequence similarity based on sequence alignment between two sets of protein sequences. This version offloads the partial results in each batch to the hard drive and merges the results together in the end, which reduces the memory usage.
Usage
crossSetSimDisk(
  protlist1,
  protlist2,
  cores = 2,
  batches = 1,
  path = tempdir(),
  verbose = FALSE,
  type = "local",
  submat = "BLOSUM62",
  gap.opening = 10,
  gap.extension = 4
)
Arguments
| protlist1 | A length  | 
| protlist2 | A length  | 
| cores | Integer. The number of CPU cores to use for parallel execution,
default is  | 
| batches | Integer. How many batches should we split the pairwise similarity computations into. This is useful when you have a large number of protein sequences, enough number of CPU cores, but not enough RAM to compute and fit all the pairwise similarities into a single batch. Defaults to 1. | 
| path | Directory for caching the results in each batch. Defaults to the temporary directory. | 
| verbose | Print the computation progress?
Useful when  | 
| type | Type of alignment, default is  | 
| submat | Substitution matrix, default is  | 
| gap.opening | The cost required to open a gap of any length in the alignment. Defaults to 10. | 
| gap.extension | The cost to extend the length of an existing gap by 1. Defaults to 4. | 
Value
A n x m similarity matrix.
Author(s)
Nan Xiao <https://nanx.me>
See Also
See crossSetSim for the in-memory version.
Examples
## Not run: 
# Be careful when testing this since it involves parallelization
# and might produce unpredictable results in some environments
library("Biostrings")
library("foreach")
library("doParallel")
s1 <- readFASTA(system.file("protseq/P00750.fasta", package = "protr"))[[1]]
s2 <- readFASTA(system.file("protseq/P08218.fasta", package = "protr"))[[1]]
s3 <- readFASTA(system.file("protseq/P10323.fasta", package = "protr"))[[1]]
s4 <- readFASTA(system.file("protseq/P20160.fasta", package = "protr"))[[1]]
s5 <- readFASTA(system.file("protseq/Q9NZP8.fasta", package = "protr"))[[1]]
set.seed(1010)
plist1 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)])
plist2 <- as.list(c(s1, s2, s3, s4, s5)[sample(1:5, 100, replace = TRUE)])
psimmat <- crossSetSimDisk(
  plist1, plist2,
  cores = 2, batches = 10, verbose = TRUE,
  type = "local", submat = "BLOSUM62"
)
## End(Not run)