R: Compute summary statistics from Pooled DNA sequencing

poolStats {poolABC}

R Documentation

Compute summary statistics from Pooled DNA sequencing

Description

This function combines all the necessary steps to simulate pooled sequencing data and compute summary statistics from that data.

Usage

poolStats(
  parameters,
  model,
  nDip,
  size,
  nLoci,
  nSites,
  mutrate,
  mean,
  variance,
  minimum,
  maximum,
  min.minor = NA,
  force = FALSE
)

Arguments

`parameters`	a vector of parameters used to create the command line for the scrm package. Each entry of the vector is a different parameter. Note that each vector entry should be named with the name of the corresponding parameter. The output of the `CreateParameters` function is the intended input.
`model`	a character, either 2pops", "Single" or "Parallel" indicating which model should be simulated.
`nDip`	an integer representing the total number of diploid individuals to simulate. Note that scrm actually simulates haplotypes, so the number of simulated haplotypes is double of this. Also note that this is the total number of diploid individuals and this function will distribute the individuals equally by the simulated populations.
`size`	a list with one entry per population. Each entry should be a vector containing the size (in number of diploid individuals) of each pool. Thus, if a population was sequenced using a single pool, the vector should contain only one entry. If a population was sequenced using two pools, each with 10 individuals, this vector should contain two entries and both will be 10.
`nLoci`	an integer that represents how many independent loci should be simulated.
`nSites`	is an integer that specifies how many base pairs should scrm simulate, i.e. how many sites per locus to simulate.
`mutrate`	an integer representing the mutation rate assumed for the simulations.
`mean`	an integer or a vector defining the mean value of the negative binomial distribution from which different number of reads are drawn. It represents the mean coverage across all sites. If a vector is supplied, the function assumes that each entry of the vector is the mean for a different population.
`variance`	an integer or a vector defining the variance of the negative binomial distribution from which different number of reads are drawn. It represents the variance of the total coverage across all sites. If a vector is supplied, the function assumes that each entry of the vector is the variance for a different population.
`minimum`	an integer representing the minimum coverage allowed. Sites where any population has a depth of coverage below this threshold are removed from the data.
`maximum`	an integer representing the maximum coverage allowed. Sites where any population has a depth of coverage above this threshold are removed from the data.
`min.minor`	is an integer representing the minimum allowed number of minor-allele reads. Sites that, across all populations, have less minor-allele reads than this threshold will be removed from the data.
`force`	is a logical value indicating whether the required number of loci should be enforced. The default is FALSE but, if set to TRUE, then additional loci will be simulated. These additional loci are simulated to try to have sufficient loci to keep the required number of loci after filtering.

Details

The sampled parameter values are incorporated into a command line for the scrm package. Then, genetic data is simulated according to a model of ecotype formation and the sampled parameters. Finally, various summary statistics are calculated from the simulated data.

Value

a list with several named entries. The number of entries depends of the chosen model.

`nPoly`	numeric, mean number of polymorphic sites across all simulated locus.
`nFilter`	numeric, mean number of polymorphic sites retained after filtering across all simulated locus.
`nLoci`	numeric, total number of loci retained after filtering. Summary statistics are calculated for these loci.
`Sf`	numeric, fraction of sites fixed between populations. For the model with two populations, this is a single value. For the four-population models, this includes three values: the first is the fraction of fixed sites between the two populations in the first location, the second value is between the populations in the second location and the third value is the overall fraction of fixed sites, obtained by comparing each population against the other three.
`Sx`	numeric, fraction of exclusive sites per population. When running the model with two populations, this entry has two values - one per population. For the four-population models, there is also one value per population, followed by a fifth value representing the fraction of sites that are segregating in only one of the populations.
`SS`	numeric values representing the fraction of sites shared between populations. For the model with two populations, this is a single value. When running one of the four-population models, this entry has three values. The first is the fraction of shared sites between the two populations in the first location, the second value is between the populations in the second location and the third value is the fraction of shared sites across all four populations.
`Mean_Het`	numeric, expected heterozygosity within each population. This entry has two values when using a two populations model and four when running one of the four-populations model.
`SD_Het`	numeric, standard deviation of the expected heterozygosity for each population. This entry has two values when using a two populations model and four when running one of the four-populations model.
`Mean_HetBet`	numeric, mean heterozygosity between all pairs of populations. For the two populations model, this is a single value representing the heterozygosity between the two populations. For the four-population models, this entry includes six values. The first value is the heterozygosity between the first and the second population, the second value is between the first and the third population, the third value is between the first and fourth population, the fourth value is between the second and third populations, the fifth value is between the second and fourth population and the sixth value is between the third and fourth populations.
`SD_HetBet`	numeric, standard deviation of the mean heterozygosity between all pairs of populations. For the two populations model, this is a single value representing the standard deviation of heterozygosity between the two populations. When running one of the four-population models, this entry includes six values. The order of those entries is the same as for `Mean_HetBet`.
`Mean_FST`	numeric, mean pairwise FST between populations. For the two populations model, this is a single value representing the mean FST between the two populations. For the four-population models, this entry includes six values. The first value is the mean FST between the first and second populations, the second is between the first and third population, the third is between the second and third populations, the fourth is between the first and fourth populations, the fifth value is between the second and fourth populations and the sixth is between the third and fourth populations.
`SD_FST`	numeric, standard deviation of the mean pairwise FST between populations. For the two populations model, this is a single value representing the standard deviation of the FST between the two populations. When running one of the four-population models, this entry includes six values. The order of those entries is the same as for `Mean_FST`.
`FSTQ1`	numeric, it is the 5% quantile of the mean pairwise FST distribution. For the two populations model, this is a single value representing the 5% quantile of the FST between the two populations. When running one of the four-population models, this entry includes six values. The order of those entries is the same as for `Mean_FST`.
`FSTQ2`	numeric, it is the 95% quantile of the mean pairwise FST distribution. For the two populations model, this is a single value representing the 95% quantile of the FST between the two populations. For the four-population models, this entry includes six values. The order of those entries is the same as for `Mean_FST`.
`Dstat`	numeric, value of D-statistic for various combinations of populations. This entry only exists if a four-population model was selected. It includes three different values. For the first value, P1 was the W ecotype in the first location P2 was the W ecotype in the second location and P3 was the C ecotype at the first location. For the second value P1 was again the W ecotype in the first location but P2 was the C ecotype in the second ecotype and P3 was the C ecotype at the first location. For the third value, P1 was also the W ecotype at the first location, P2 was the C ecotype at the first location and P3 was the W ecotype at the second location. For all combinations, P4 was assumed to be an outgroup fixed, at all sites, for the major allele.
`SD_dstat`	numeric, standard deviation of D-statistic for various combinations of populations. This entry only exists if a four-population model was selected. Each entry is the standard deviation of the corresponding D-statistic in the `Dstat` entry.

Examples

# create a vector of parameters for a model with two populations
parameters <- createParams(Nref = c(25000, 25000), ratio = c(0.1, 3), pool = c(5, 250),
seq = c(0.0001, 0.001), split = c(0, 3), CW = c(1e-13, 1e-3), WC = c(1e-13, 1e-3),
bT = c(0, 0.2), model = "2pops")

# simulate a two populations model:
# note that we are using two pools for each population, each with 50 individuals
poolStats(parameters = parameters, model = "2pops", nDip = 200, size = rep(list(rep(50, 2)), 2),
nLoci = 100, nSites = 2000, mutrate = 2e-8, mean = c(100, 80), variance = c(200, 180), minimum = 10,
maximum = 150, min.minor = 1)

[Package poolABC version 1.0.0 Index]