R: Create and write Illumina reads to FASTQ file(s).

illumina {jackalope}

R Documentation

Create and write Illumina reads to FASTQ file(s).

Description

From either a reference genome or set of variant haplotypes, create Illumina reads from error profiles and write them to FASTQ output file(s). I encourage you to cite the reference below in addition to jackalope if you use this function.

Usage

illumina(obj,
         out_prefix,
         n_reads,
         read_length,
         paired,
         frag_mean = 400,
         frag_sd = 100,
         matepair = FALSE,
         seq_sys = NULL,
         profile1 = NULL,
         profile2 = NULL,
         ins_prob1 = 0.00009,
         del_prob1 = 0.00011,
         ins_prob2 = 0.00015,
         del_prob2 = 0.00023,
         frag_len_min = NULL,
         frag_len_max = NULL,
         haplotype_probs = NULL,
         barcodes = NULL,
         prob_dup = 0.02,
         sep_files = FALSE,
         compress = FALSE,
         comp_method = "bgzip",
         n_threads = 1L,
         read_pool_size = 1000L,
         show_progress = FALSE,
         overwrite = FALSE)

Arguments

`obj`	Sequencing object of class `ref_genome` or `haplotypes`.
`out_prefix`	Prefix for the output file(s), including entire path except for the file extension.
`n_reads`	Number of reads you want to create.
`read_length`	Length of reads.
`paired`	Logical for whether to use paired-end reads. This argument is changed to `TRUE` if `matepair` is `TRUE`.
`frag_mean`	Mean of the Gamma distribution that generates fragment sizes. Defaults to `400`.
`frag_sd`	Standard deviation of the Gamma distribution that generates fragment sizes. Defaults to `100`.
`matepair`	Logical for whether to simulate mate-pair reads. Defaults to `FALSE`.
`seq_sys`	Full or abbreviated name of sequencing system to use. See "Sequencing systems" section for options. See "Sequencing profiles" section for more information on how this argument, `profile1`, and `profile2` are used to specify profiles. Defaults to `NULL`.
`profile1`	Custom profile file for read 1. See "Sequencing profiles" section for more information on how this argument, `profile2`, and `seq_sys` are used to specify profiles. Defaults to `NULL`.
`profile2`	Custom profile file for read 2. See "Sequencing profiles" section for more information on how this argument, `profile1`, and `seq_sys` are used to specify profiles. Defaults to `NULL`.
`ins_prob1`	Insertion probability for read 1. Defaults to `0.00009`.
`del_prob1`	Deletion probability for read 1. Defaults to `0.00011`.
`ins_prob2`	Insertion probability for read 2. Defaults to `0.00015`.
`del_prob2`	Deletion probability for read 2. Defaults to `0.00023`.
`frag_len_min`	Minimum fragment size. A `NULL` value results in the read length. Defaults to `NULL`.
`frag_len_max`	Maximum fragment size. A `NULL` value results in `2^32-1`, the maximum allowed value. Defaults to `NULL`
`haplotype_probs`	Relative probability of sampling each haplotype. This is ignored if sequencing a reference genome. `NULL` results in all having the same probability. Defaults to `NULL`.
`barcodes`	Character vector of barcodes for each haplotype, or a single barcode if sequencing a reference genome. `NULL` results in no barcodes. Defaults to `NULL`.
`prob_dup`	A single number indicating the probability of duplicates. Defaults to `0.02`.
`sep_files`	Logical indicating whether to make separate files for each haplotype. This argument is coerced to `FALSE` if the `obj` argument is not a `haplotypes` object. Defaults to `FALSE`.
`compress`	Logical specifying whether or not to compress output file, or an integer specifying the level of compression, from 1 to 9. If `TRUE`, a compression level of `6` is used. Defaults to `FALSE`.
`comp_method`	Character specifying which type of compression to use if any is desired. Options include `"gzip"` and `"bgzip"`. This is ignored if `compress` is `FALSE`, and it throws an error if it's set to `"gzip"` when `n_threads > 1` (since I don't have a method to do gzip compression in parallel). Defaults to `"bgzip"`.
`n_threads`	The number of threads to use in processing. If `compress` is `TRUE` or `⁠> 0⁠` (indicating compressed output), setting `n_threads` to `2` or more makes this function first create an uncompressed file/files using `n_threads` threads, then compress that/those file/files also using `n_threads` threads. There is no speed increase if you try to use multiple threads to create compressed output on the fly, so that option is not included. If you want to be conservative with disk space (by not having an uncompressed file present even temporarily), set `n_threads` to `1`. Threads are NOT spread across chromosomes or haplotypes, so you don't need to think about these when choosing this argument's value. However, all threads write to the same file/files, so there are diminishing returns for providing many threads. This argument is ignored if the package was not compiled with OpenMP. Defaults to `1`.
`read_pool_size`	The number of reads to store before writing to disk. Increasing this number should improve speed but take up more memory. Defaults to `1000`.
`show_progress`	Logical for whether to show a progress bar. Defaults to `FALSE`.
`overwrite`	Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist.

Value

Nothing is returned.

Sequencing profiles

This section outlines how to use the seq_sys, profile1, and profile2 arguments. If all arguments are NULL (their defaults), a sequencing system is chosen based on the read length. If, however, one or more arguments has been provided, then how they're provided should depend on whether you want single- or paired-end reads.

For single-end reads

profile2 should be NULL.
Only seq_sys or profile1 should be provided, not both.

For paired-end reads

If providing seq_sys, don't provide either profile1 or profile2.
If providing profile1, you must also provide profile2 (they can be the same if you want) and you cannot provide seq_sys.

Sequencing systems

Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.

name	abbrev	max_len	paired
Genome Analyzer I	GA1	44	Yes
Genome Analyzer II	GA2	75	Yes
HiSeq 1000	HS10	100	Yes
HiSeq 2000	HS20	100	Yes
HiSeq 2500	HS25	150	Yes
HiSeqX v2.5 PCR free	HSXn	150	Yes
HiSeqX v2.5 TruSeq	HSXt	150	Yes
MiniSeq TruSeq	MinS	50	No
MiSeq v1	MSv1	250	Yes
MiSeq v3	MSv3	250	Yes
NextSeq 500 v2	NS50	75	Yes

ID lines

The ID lines for FASTQ files are formatted as such:

⁠@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]⁠

where the part in ⁠[]⁠ is only for paired-end Illumina reads, and where ⁠genome name⁠ is always REF for reference genomes (as opposed to haplotypes).

References

Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.

Examples


rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100,
         read_length = 100, paired = FALSE)

[Package jackalope version 1.1.5 Index]