pacbio {jackalope} | R Documentation |
Create and write PacBio reads to FASTQ file(s).
Description
From either a reference genome or set of variant haplotypes, create PacBio reads
and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
Usage
pacbio(obj,
out_prefix,
n_reads,
chi2_params_s = c(0.01214, -5.12, 675, 48303.0732881,
1.4691051212330266),
chi2_params_n = c(0.00189237136, 2.53944970, 5500),
max_passes = 40,
sqrt_params = c(0.5, 0.2247),
norm_params = c(0, 0.2),
prob_thresh = 0.2,
ins_prob = 0.11,
del_prob = 0.04,
sub_prob = 0.01,
min_read_length = 50,
lognorm_read_length = c(0.200110276521, -10075.4363813,
17922.611306),
custom_read_lengths = NULL,
prob_dup = 0.0,
haplotype_probs = NULL,
sep_files = FALSE,
compress = FALSE,
comp_method = "bgzip",
n_threads = 1L,
read_pool_size = 100L,
show_progress = FALSE,
overwrite = FALSE)
Arguments
obj |
Sequencing object of class |
out_prefix |
Prefix for the output file(s), including entire path except for the file extension. |
n_reads |
Number of reads you want to create. |
chi2_params_s |
Vector containing the 5 parameters for the curve determining
the scale parameter for the chi^2 distribution.
Defaults to |
chi2_params_n |
Vector containing the 3 parameters for the function
determining the n parameter for the chi^2 distribution.
Defaults to |
max_passes |
Maximal number of passes for one molecule.
Defaults to |
sqrt_params |
Vector containing the 2 parameters for the square root
function for the quality increase.
Defaults to |
norm_params |
Vector containing the 2 parameters for normal distributed
noise added to quality increase square root function
Defaults to |
prob_thresh |
Upper bound for the modified total error probability.
Defaults to |
ins_prob |
Probability for insertions for reads with one pass.
Defaults to |
del_prob |
Probability for deletions for reads with one pass.
Defaults to |
sub_prob |
Probability for substitutions for reads with one pass.
Defaults to |
min_read_length |
Minium read length for lognormal distribution.
Defaults to |
lognorm_read_length |
Vector containing the 3 parameters for lognormal
read length distribution.
Defaults to |
custom_read_lengths |
Sample read lengths from a vector or column in a
matrix; if a matrix, the second column specifies the sampling weights.
If |
prob_dup |
A single number indicating the probability of duplicates.
Defaults to |
haplotype_probs |
Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
|
sep_files |
Logical indicating whether to make separate files for each haplotype.
This argument is coerced to |
compress |
Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If |
comp_method |
Character specifying which type of compression to use if any
is desired. Options include |
n_threads |
The number of threads to use in processing.
If |
read_pool_size |
The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to |
show_progress |
Logical for whether to show a progress bar.
Defaults to |
overwrite |
Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist. |
Value
Nothing is returned.
ID lines
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>
where genome name
is always REF
for reference genomes (as opposed to haplotypes).
References
Stöcker, B. K., J. Köster, and S. Rahmann. 2016. SimLoRD: simulation of long read data. Bioinformatics 32:2704–2706.
Examples
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
pacbio(rg, paste0(dir, "/pacbio_reads"), n_reads = 100)