simPoolseq {poolHelper} | R Documentation |
Simulate Pool-seq data
Description
Simulates pooled sequencing data given a set of parameters and individual genotypes.
Usage
simPoolseq(
genotypes,
pools,
pError,
sError,
mCov,
vCov,
min.minor,
minimum = NA,
maximum = NA
)
Arguments
genotypes |
a list of genotypes, where each entry is a matrix corresponding to a different locus. At each matrix, each column is a different SNP and each row is a different individual. Genotypes should be coded as 0, 1 or 2. |
pools |
a list with a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10. |
pError |
an integer representing the value of the error associated with DNA pooling. This value is related with the unequal contribution of both individuals and pools towards the total number of reads observed for a given population - the higher the value the more unequal are the individual and pool contributions. |
sError |
a numeric value with error rate associated with the sequencing and mapping process. This error rate is assumed to be symmetric: error(reference -> alternative) = error(alternative -> reference). This number should be between 0 and 1. |
mCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
vCov |
an integer that defines the mean depth of coverage to simulate. Please note that this represents the mean coverage across all sites. |
min.minor |
is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data. |
minimum |
an optional integer representing the minimum coverage allowed. Sites where the population has a depth of coverage below this threshold are removed from the data. |
maximum |
an optional integer representing the maximum coverage allowed. Sites where the population has a depth of coverage above this threshold are removed from the data. |
Details
Note that this functions allows for different combinations of parameters.
Thus, Pool-seq data can be simulated for a variety of parameters. For
instance, different mean depths of coverage can be used to simulate Pool-seq
data. It is also possible to simulate Pool-seq data using different pool
sizes (by changing the pools
input) and different values of the
Pool-seq error parameter (pError
).
Value
a list with three named entries:
reference |
a list with one entry per locus. Each entry is a matrix with the number of reference allele reads. Each column represents a different site. |
alternative |
a list with one entry per locus. Each entry is a matrix with the number of alternative allele reads. Each column represents a different site. |
total |
a list with one entry per locus. Each entry is a matrix with the total depth of coverage. Each column represents a different site. |
Examples
# simulate Pool-seq data for 100 individuals sampled at a single locus
genotypes <- run_scrm(nDip = 100, nloci = 1, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and two pools of 50 individuals each
simPoolseq(genotypes = genotypes, pools = c(50, 50), pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)
# simulate Pool-seq data for 10 individuals sampled at 5 loci
genotypes <- run_scrm(nDip = 10, nloci = 5, theta = 5)
# simulate Pool-seq data assuming a coverage of 100x and a single pool of 10 individuals
simPoolseq(genotypes = genotypes, pools = 10, pError = 100, sError = 0.001,
mCov = 100, vCov = 250, min.minor = 0)