| multidog {updog} | R Documentation |
Fit flexdog to multiple SNPs.
Description
This is a convenience function that will run flexdog over many SNPs.
Support is provided for parallel computing through the doParallel package.
This function has not been extensively tested. Please report any bugs to
https://github.com/dcgerard/updog/issues.
Usage
multidog(
refmat,
sizemat,
ploidy,
model = c("norm", "hw", "bb", "s1", "s1pp", "f1", "f1pp", "flex", "uniform", "custom"),
nc = 1,
p1_id = NULL,
p2_id = NULL,
bias_init = exp(c(-1, -0.5, 0, 0.5, 1)),
prior_vec = NULL,
...
)
Arguments
refmat |
A matrix of reference read counts. The columns index
the individuals and the rows index the markers (SNPs). This matrix must have
rownames (for the names of the markers) and column names (for the names
of the individuals). These names must match the names in |
sizemat |
A matrix of total read counts. The columns index
the individuals and the rows index the markers (SNPs). This matrix must have
rownames (for the names of the markers) and column names (for the names
of the individuals). These names must match the names in |
ploidy |
The ploidy of the species. Assumed to be the same for each individual. |
model |
What form should the prior (genotype distribution) take? See Details for possible values. |
nc |
The number of computing cores to use when doing parallelization
on your local machine. See the section "Parallel Computation" for how
to implement more complicated evaluation strategies using the
When you are specifying other evaluation strategies using the
The value of |
p1_id |
The ID of the first parent. This should be a character of
length 1. This should correspond to a single column name in |
p2_id |
The ID of the second parent. This should be a character of
length 1. This should correspond to a single column name in |
bias_init |
A vector of initial values for the bias parameter
over the multiple runs of |
prior_vec |
The pre-specified genotype distribution. Only used if
|
... |
Additional parameters to pass to |
Details
You should format your reference counts and total read counts in two separate matrices. The rows should index the markers (SNPs) and the columns should index the individuals. Row names are how we ID the SNPs and column names are how we ID the individuals, and so they are required attributes.
If your data are in VCF files, I would recommend importing them using the VariantAnnotation package from Bioconductor https://bioconductor.org/packages/VariantAnnotation/. It's a great VCF parser.
See the details of flexdog for the possible values of
model.
If model = "f1", model = "s1", model = "f1pp"
or model = "s1pp" then the user may
provide the individual ID for parent(s) via the p1_id
and p2_id arguments.
The output is a list containing two data frames. The first data frame,
called snpdf, contains information on each SNP, such as the allele bias
and the sequencing error rate. The second data frame, called inddf,
contains information on each individual at each SNP, such as the estimated
genotype and the posterior probability of being classified correctly.
SNPs that contain 0 reads (or all missing data) are entirely removed.
Value
A list-like object of two data frames.
snpdfA data frame containing properties of the SNPs (markers). The rows index the SNPs. The variables include:
snpThe name of the SNP (marker).
biasThe estimated allele bias of the SNP.
seqThe estimated sequencing error rate of the SNP.
odThe estimated overdispersion parameter of the SNP.
prop_misThe estimated proportion of individuals misclassified in the SNP.
num_iterThe number of iterations performed during the EM algorithm for that SNP.
llikeThe maximum marginal likelihood of the SNP.
ploidyThe provided ploidy of the species.
modelThe provided model for the prior genotype distribution.
p1refThe user-provided reference read counts of parent 1.
p1sizeThe user-provided total read counts of parent 1.
p2refThe user-provided reference read counts of parent 2.
p2sizeThe user-provided total read counts of parent 2.
Pr_kThe estimated frequency of individuals with genotype k, where k can be any integer between 0 and the ploidy level.
- Model specific parameter estimates
See the return value of
parin the help page offlexdog.
inddfA data frame containing the properties of the individuals at each SNP. The variables include:
snpThe name of the SNP (marker).
indThe name of the individual.
refThe provided reference counts for that individual at that SNP.
sizeThe provided total counts for that individual at that SNP.
genoThe posterior mode genotype for that individual at that SNP. This is the estimated reference allele dosage for a given individual at a given SNP.
postmeanThe posterior mean genotype for that individual at that SNP. This is a continuous genotype estimate of the reference allele dosage for a given individual at a given SNP.
maxpostprobThe maximum posterior probability. This is the posterior probability that the individual was genotyped correctly.
Pr_kThe posterior probability that a given individual at a given SNP has genotype k, where k can vary from 0 to the ploidy level of the species.
logL_kThe genotype log-likelihoods for dosage k for a given individual at a given SNP, where k can vary f rom 0 to the ploidy level of the species.
Parallel Computation
The multidog() function supports parallel computing. It does
so through the future
package.
If you are just running multidog() on a local machine, then you
can use the nc argument to specify the parallelization. Any value
of nc greater than 1 will result in multiple background R sessions to
genotype all of the SNPs. The maximum value of nc you should
try can be found by running future::availableCores(). Running
multidog() using nc is equivalent to setting the future
plan with future::plan(future::multisession, workers = nc).
Using the future package means that different evaluation strategies
are possible. In particular, if you are using a high performance machine,
you can explore using the
future.batchtools
package to evaluate multidog() using schedulers like Slurm
or TORQUE/PBS.
To use a different strategy, set nc = NA and then
run future::plan() prior to
running multidog(). For example, to set up forked R processes
on your current machine (instead of using background R sessions), you would
run (will not work on Windows):
future::plan(future::multicore), followed by
running multidog() with nc = NA. See the examples below.
Author(s)
David Gerard
See Also
flexdog():For the underlying genotyping function.
format_multidog():For converting the output of
multidog()to a matrix.filter_snp():For filtering SNPs using the output of
multidog().
Examples
## Not run:
data("uitdewilligen")
## Run multiple R sessions using the `nc` variable.
mout <- multidog(refmat = t(uitdewilligen$refmat),
sizemat = t(uitdewilligen$sizemat),
ploidy = uitdewilligen$ploidy,
nc = 2)
mout$inddf
mout$snpdf
## Run multiple external R sessions on the local machine.
## Note that we set `nc = NA`.
cl <- parallel::makeCluster(2, timeout = 60)
future::plan(future::cluster, workers = cl)
mout <- multidog(refmat = t(uitdewilligen$refmat),
sizemat = t(uitdewilligen$sizemat),
ploidy = uitdewilligen$ploidy,
nc = NA)
mout$inddf
mout$snpdf
## Close cluster and reset future to current R process
parallel::stopCluster(cl)
future::plan(future::sequential)
## End(Not run)