R: Create and write PacBio reads to FASTQ file(s).

pacbio {jackalope}

R Documentation

Create and write PacBio reads to FASTQ file(s).

Description

From either a reference genome or set of variant haplotypes, create PacBio reads and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

Usage

pacbio(obj,
       out_prefix,
       n_reads,
       chi2_params_s = c(0.01214, -5.12, 675, 48303.0732881,
                         1.4691051212330266),
       chi2_params_n = c(0.00189237136, 2.53944970, 5500),
       max_passes = 40,
       sqrt_params = c(0.5, 0.2247),
       norm_params = c(0, 0.2),
       prob_thresh = 0.2,
       ins_prob = 0.11,
       del_prob = 0.04,
       sub_prob = 0.01,
       min_read_length = 50,
       lognorm_read_length = c(0.200110276521, -10075.4363813,
                               17922.611306),
       custom_read_lengths = NULL,
       prob_dup = 0.0,
       haplotype_probs = NULL,
       sep_files = FALSE,
       compress = FALSE,
       comp_method = "bgzip",
       n_threads = 1L,
       read_pool_size = 100L,
       show_progress = FALSE,
       overwrite = FALSE)

Arguments

`obj`	Sequencing object of class `ref_genome` or `haplotypes`.
`out_prefix`	Prefix for the output file(s), including entire path except for the file extension.
`n_reads`	Number of reads you want to create.
`chi2_params_s`	Vector containing the 5 parameters for the curve determining the scale parameter for the chi^2 distribution. Defaults to `c(0.01214, -5.12, 675, 48303.0732881, 1.4691051212330266)`.
`chi2_params_n`	Vector containing the 3 parameters for the function determining the n parameter for the chi^2 distribution. Defaults to `c(0.00189237136, 2.53944970, 5500)`.
`max_passes`	Maximal number of passes for one molecule. Defaults to `40`.
`sqrt_params`	Vector containing the 2 parameters for the square root function for the quality increase. Defaults to `c(0.5, 0.2247)`.
`norm_params`	Vector containing the 2 parameters for normal distributed noise added to quality increase square root function Defaults to `c(0, 0.2)`.
`prob_thresh`	Upper bound for the modified total error probability. Defaults to `0.2`.
`ins_prob`	Probability for insertions for reads with one pass. Defaults to `0.11`.
`del_prob`	Probability for deletions for reads with one pass. Defaults to `0.04`.
`sub_prob`	Probability for substitutions for reads with one pass. Defaults to `0.01`.
`min_read_length`	Minium read length for lognormal distribution. Defaults to `50`.
`lognorm_read_length`	Vector containing the 3 parameters for lognormal read length distribution. Defaults to `c(0.200110276521, -10075.4363813, 17922.611306)`.
`custom_read_lengths`	Sample read lengths from a vector or column in a matrix; if a matrix, the second column specifies the sampling weights. If `NULL`, it samples read lengths from the lognormal distribution using parameters in `lognorm_read_length`. Defaults to `NULL`.
`prob_dup`	A single number indicating the probability of duplicates. Defaults to `0.0`.
`haplotype_probs`	Relative probability of sampling each haplotype. This is ignored if sequencing a reference genome. `NULL` results in all having the same probability. Defaults to `NULL`.
`sep_files`	Logical indicating whether to make separate files for each haplotype. This argument is coerced to `FALSE` if the `obj` argument is not a `haplotypes` object. Defaults to `FALSE`.
`compress`	Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If `TRUE`, a compression level of `6` is used. Defaults to `FALSE`.
`comp_method`	Character specifying which type of compression to use if any is desired. Options include `"gzip"` and `"bgzip"`. This is ignored if `compress` is `FALSE`, and it throws an error if it's set to `"gzip"` when `n_threads > 1` (since I don't have a method to do gzip compression in parallel). Defaults to `"bgzip"`.
`n_threads`	The number of threads to use in processing. If `compress` is `TRUE` or `⁠> 0⁠` (indicating compressed output), setting `n_threads` to `2` or more makes this function first create an uncompressed file/files using `n_threads` threads, then compress that/those file/files also using `n_threads` threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set `n_threads` to `1`. Threads are NOT spread across chromosomes or haplotypes, so you don't need to think about these when choosing this argument's value. However, all threads write to the same file/files, so there are diminishing returns for providing many threads. This argument is ignored if the package was not compiled with OpenMP. Defaults to `1`.
`read_pool_size`	The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to `100`.
`show_progress`	Logical for whether to show a progress bar. Defaults to `FALSE`.
`overwrite`	Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

Value

Nothing is returned.

ID lines

The ID lines for FASTQ files are formatted as such:

⁠@<genome name>-<chromosome name>-<starting position>-<strand>⁠

where ⁠genome name⁠ is always REF for reference genomes (as opposed to haplotypes).

References

Stöcker, B. K., J. Köster, and S. Rahmann. 2016. SimLoRD: simulation of long read data. Bioinformatics 32:2704–2706.

Examples


rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
pacbio(rg, paste0(dir, "/pacbio_reads"), n_reads = 100)

[Package jackalope version 1.1.5 Index]