illumina {jackalope} | R Documentation |
Create and write Illumina reads to FASTQ file(s).
Description
From either a reference genome or set of variant haplotypes, create Illumina reads
from error profiles and write them to FASTQ output file(s).
I encourage you to cite the reference below in addition to jackalope
if you use
this function.
Usage
illumina(obj,
out_prefix,
n_reads,
read_length,
paired,
frag_mean = 400,
frag_sd = 100,
matepair = FALSE,
seq_sys = NULL,
profile1 = NULL,
profile2 = NULL,
ins_prob1 = 0.00009,
del_prob1 = 0.00011,
ins_prob2 = 0.00015,
del_prob2 = 0.00023,
frag_len_min = NULL,
frag_len_max = NULL,
haplotype_probs = NULL,
barcodes = NULL,
prob_dup = 0.02,
sep_files = FALSE,
compress = FALSE,
comp_method = "bgzip",
n_threads = 1L,
read_pool_size = 1000L,
show_progress = FALSE,
overwrite = FALSE)
Arguments
obj |
Sequencing object of class |
out_prefix |
Prefix for the output file(s), including entire path except for the file extension. |
n_reads |
Number of reads you want to create. |
read_length |
Length of reads. |
paired |
Logical for whether to use paired-end reads.
This argument is changed to |
frag_mean |
Mean of the Gamma distribution that generates fragment sizes.
Defaults to |
frag_sd |
Standard deviation of the Gamma distribution that generates
fragment sizes.
Defaults to |
matepair |
Logical for whether to simulate mate-pair reads.
Defaults to |
seq_sys |
Full or abbreviated name of sequencing system to use.
See "Sequencing systems" section for options.
See "Sequencing profiles" section for more information on how this argument,
|
profile1 |
Custom profile file for read 1.
See "Sequencing profiles" section for more information on how this argument,
|
profile2 |
Custom profile file for read 2.
See "Sequencing profiles" section for more information on how this argument,
|
ins_prob1 |
Insertion probability for read 1. Defaults to |
del_prob1 |
Deletion probability for read 1. Defaults to |
ins_prob2 |
Insertion probability for read 2. Defaults to |
del_prob2 |
Deletion probability for read 2. Defaults to |
frag_len_min |
Minimum fragment size. A |
frag_len_max |
Maximum fragment size.
A |
haplotype_probs |
Relative probability of sampling each haplotype.
This is ignored if sequencing a reference genome.
|
barcodes |
Character vector of barcodes for each haplotype, or a single barcode
if sequencing a reference genome. |
prob_dup |
A single number indicating the probability of duplicates.
Defaults to |
sep_files |
Logical indicating whether to make separate files for each haplotype.
This argument is coerced to |
compress |
Logical specifying whether or not to compress output file, or
an integer specifying the level of compression, from 1 to 9.
If |
comp_method |
Character specifying which type of compression to use if any
is desired. Options include |
n_threads |
The number of threads to use in processing.
If |
read_pool_size |
The number of reads to store before writing to disk.
Increasing this number should improve speed but take up more memory.
Defaults to |
show_progress |
Logical for whether to show a progress bar.
Defaults to |
overwrite |
Logical for whether to overwrite existing FASTQ file(s) of the same name, if they exist. |
Value
Nothing is returned.
Sequencing profiles
This section outlines how to use the seq_sys
, profile1
,
and profile2
arguments.
If all arguments are NULL
(their defaults), a sequencing system is chosen
based on the read length.
If, however, one or more arguments has been provided, then how they're provided
should depend on whether you want single- or paired-end reads.
For single-end reads
-
profile2
should beNULL
. Only
seq_sys
orprofile1
should be provided, not both.
For paired-end reads
If providing
seq_sys
, don't provide eitherprofile1
orprofile2
.If providing
profile1
, you must also provideprofile2
(they can be the same if you want) and you cannot provideseq_sys
.
Sequencing systems
Sequencing system options are the following, where, for each system, "name" is the full name, "abbrev" is the abbreviated name, "max_len" indicates the maximum allowed read length, and "paired" indicates whether paired-end sequencing is allowed.
name | abbrev | max_len | paired |
Genome Analyzer I | GA1 | 44 | Yes |
Genome Analyzer II | GA2 | 75 | Yes |
HiSeq 1000 | HS10 | 100 | Yes |
HiSeq 2000 | HS20 | 100 | Yes |
HiSeq 2500 | HS25 | 150 | Yes |
HiSeqX v2.5 PCR free | HSXn | 150 | Yes |
HiSeqX v2.5 TruSeq | HSXt | 150 | Yes |
MiniSeq TruSeq | MinS | 50 | No |
MiSeq v1 | MSv1 | 250 | Yes |
MiSeq v3 | MSv3 | 250 | Yes |
NextSeq 500 v2 | NS50 | 75 | Yes |
ID lines
The ID lines for FASTQ files are formatted as such:
@<genome name>-<chromosome name>-<starting position>-<strand>[/<read#>]
where the part in []
is only for paired-end Illumina reads, and where genome name
is always REF
for reference genomes (as opposed to haplotypes).
References
Huang, W., L. Li, J. R. Myers, and G. T. Marth. 2012. ART: a next-generation sequencing read simulator. Bioinformatics 28:593–594.
Examples
rg <- create_genome(10, 100e3, 100)
dir <- tempdir(TRUE)
illumina(rg, paste0(dir, "/illumina_reads"), n_reads = 100,
read_length = 100, paired = FALSE)