R: Estimate population allele frequency from the set of Cq...

freqpcr {freqpcr}

R Documentation

Estimate population allele frequency from the set of Cq measurements.

Description

The function estimates the population allele frequency using the dataset of Cq values measured over length(N) bulk samples, each of which has a sample size of N[i] as the number of individuals included. N[i] can be 1, meaning that every individual is analyzed separately. For the ith sample, the four Cq values were measured as housek0[i], target0[i], housek1[i], and target1[i]. The function can estimate up to five parameters simultaneously when the Cq sets are available for more than two (bulk) samples: P, K, targetScale, sdMeasure, and EPCR.
Since v0.3.2, user can also use an experimental ‘continuous model’ by specifying A instead of N. That is, each sample DNA is directly extracted from the environment and the sample allele ratio y follows y ~ Beta(apk, a(1-p)k) instead of y ~ Beta(mk, (n-m)k), m ~ Binomial(n, p), where p and k are the population allele frequency and the gamma shape parameter of the individual DNA yield, respectively. Each element of A, a is a scaling factor of relative DNA contents between the samples. The continuous model is likely when each sample directly comes from the environment e.g., water sampling in an eDNA analysis or cell culture in a petri dish.
Since v0.4.0, freqpcr() also works without specifying housek0 and target0 i.e., it can estimate population allele frequency from \DeltaCq values instead of \Delta\DeltaCq. In this setting, the sizes of targetScale and sdMeasure should be fixed in addition to EPCR and zeroAmount.

Usage

freqpcr(
  N,
  A,
  housek0,
  target0,
  housek1,
  target1,
  P = NULL,
  K = NULL,
  targetScale = NULL,
  sdMeasure = NULL,
  EPCR = 0.99,
  XInit0 = c(P = NULL, K = NULL, targetScale = NULL, sdMeasure = NULL, EPCR = NULL),
  zeroAmount = NULL,
  beta = TRUE,
  diploid = FALSE,
  pvalue = 0.05,
  gradtol = 1e-04,
  steptol = 1e-09,
  iterlim = 100,
  maxtime = 600,
  print.level = 1,
  ...
)

Arguments

`N`	Sample sizes as a numeric vector. `N[i]` signifies the number of individuals (both for haploidy and diploidy) contained in the ith bulk sample. `N` must not contain a missing value (`NA`). If `N` is not applicable (= even not 1), feed `A` instead of `N` and then the estimation process runs with the ‘continuous model’.
`A`	Use instead of `N` in the continuous model. This is a scale factor to control the relative DNA content between samples. `A[i]` can take any positive value, but must not be `NA`. Considering the case you have arranged each sample by e.g. water filrtation or extraction from a culture in a petri dish, it is convenient to define the unit size of `A[i]` == 1.0 to be same as the vessel volume (e.g. 2.0 for two petri dishs, 0.5 for half bottle of water, etc.). When neither `N` nor `A` is specified by the user, the function stops. If both `N` and `A` are specified, only `N` is evaluated.
`housek0`	A numeric vector. In RED-`\Delta\Delta`Cq method, `housek0` is the Cq values of the test sample without the restriction enzyme digestion, which is amplified with the primer set for a housekeeping gene. In general `\Delta\Delta`Cq analyses, `housek0` is defined for the control sample (typically, 100% mutant) solution, which is also amplified with the primer set for the housekeeping gene. Since `v0.4.0`, you can run `freqpcr()` without specifying `housek0` and `target0` (a `\Delta`Cq method). As this setting halves the effective data points, it is recommended to fix other parameters, especially `targetScale`. The four Cq arguments, `housek0`, `target0`, `housek1`, and `target1`, all must have the same data length. They also must be the same length as `N` or `A`. If the Cq dataset has missing values, they must be filled with NA so that the length of the data vectors will not differ.
`target0`	In RED-`\Delta\Delta`Cq method, `target0[i]` signifies the measured Cq value of the ith bulk sample without the digestion, for which both alleles, wild-type (S: susceptible) and mutant (R: resistant to a pesticide), on the target locus are amplified. In general `\Delta\Delta`Cq analyses, `target0` is the Cq values of the pure-S control sample, which is amplified with a R-allele-specific primer set.
`housek1`	The Cq values of the test sample measured on the housekeeping gene after the restriction enzyme digestion (in RED-`\Delta\Delta`Cq method), or the test sample amplified on the housekeeping gene (in general `\Delta\Delta`Cq analyses).
`target1`	For each test sample with unknown allele-ratio, `target1[i]` is defined as the Cq value for the target locus amplified after the restriction enzyme digestion (in RED-`\Delta\Delta`Cq method), or the target locus amplified with the R-allele-specific primer set (in general `\Delta\Delta`Cq analyses).
`P`	Scalar. Population allele frequency from which the test samples are derived. Default is `NULL` and to be estimated. If the parameter is known, it is given as a numeric between 0 and 1.
`K`	Scalar. The gamma shape parameter of the individual DNA yield. Default is `NULL` and to be estimated. If known, given as a positive numeric.
`targetScale`	(`\delta_{T}`) Scalar. The relative template DNA amount of the target locus to the houskeeping locus. If known, given as a positive numeric.
`sdMeasure`	(`\sigma_{c}`) Scalar. The measurement error (standard deviation) on each Cq value following Normal(0, `\sigma_{c}^2`). If known, given as a positive numeric.
`EPCR`	(`\eta`) Scalar. Amplification efficiency per PCR cycle. If known, given as a positive numeric. When `EPCR = 1`, template DNA doubles every cycle (`EPCR + 1 = 2`).
`XInit0`	Optionally the initial value for the parameter optimization can be specified, but it is strongly recommended to keep the argument as is. Unlike `XInit` in `knownqpcr()`, the argument is not directly passed to the optimizer; used only when each parameter is set unknown (the parameter is absent or specified as NULL).
`zeroAmount`	(In RED-`\Delta\Delta`Cq method) residue rate of restriction enzyme digestion, or (in general `\Delta\Delta`Cq analyses) small portion of the off-target allele on the target locus of the test sample, which will be amplified in the PCR. It needs to be always specified by the user as a number between 0 and 1, usually near 0.
`beta`	Whether to use the beta distribution to approximate the sample allele ratio instead of specifying individual gamma distribution for each of the allelic DNA amounts? Default is `TRUE`, which accelerates the calculation.
`diploid`	Is the target organism diploidy? Default is `FALSE`, assuming haploidy. Current implementation of diploidy assumes i.i.d. between the amounts of R and S chromosomes owned by a heterozygote individual, which is unlikely in many animals but necessary for the calculation in a realistic time.
`pvalue`	The two-sided confidence interval is calculated at the last iteration at given significance level. Default is 0.05, which returns the 95% Wald's CI (2.5 to 97.5 percentile) based on the Hessian matrix.
`gradtol`, `steptol`, `iterlim`	Control parameters passed to `nlm()`. `gradtol` and `steptol` are the positive scalars giving the tolerance to terminate the algorithm and the minimum allowable relative step length. `iterlim` specifies the maximum number of iterations to be performed before the program is terminated (and evaluated at the last iteration). Usually 30 iterations are enough.
`maxtime`	A positive scalar to set the maximum calculation time in seconds to abort the optimizer (and return error). The total calculation time largely depends on `N[i]`, the number of individuals contained in each bulk sample.
`print.level`	`print.level=1` (the default) shows the initial values of the parameters and likelihood as well as the output in the last iteration. `print.level=2` shows the parameter values and gradients in every step. `print.level=0` does not output any intermediate state to R console, simply returning the result summary.
`...`	Additional arguments passed to the function.

Value

Object of the S4 class CqFreq. The slot report is a matrix and each row contains the estimated parameter value with 100*(1-pvalue)% confidence interval. The following parameters are returned:

P, the population allele frequency from which the test samples are derived.
K, the gamma shape parameter of the individual DNA yield.
targetScale (\delta_{T}), the relative template DNA amount of the target to the houskeeping loci.
EPCR (\eta), the amplification efficiency per PCR cycle.
sdMeasure or "Cq measurement error" (\sigma_{c}).

Choise of the parameters to be estimated

Estimation is conducted only for parameters where the values are not specified or specified explicitly as NULL. If one feeds a value e.g. K=1 or sdMeasure=0.24, it is then treated as fixed parameter. Notwithstanding, EPCR is estimated only when EPCR = NULL is specified explicitly.
You must verify the size of EPCR and zeroAmount beforehand because they are not estimable from the experiments with unknown allele ratios. Although targetScale and sdMeasure are estimable within freqpcr(), it is better to feed the known values, especially when you have only a few bulk samples (length(N) <= 3). Fixing targetScale and sdMeasure is also strongly recommended when housek0 and target0 are absent (\DeltaCq method). The functions knownqpcr() or knownqpcr_unpaired(), depending on your data format, provide the procedure to estimate the sizes of the experimental parameters using the DNA solutions of known allele mixing ratios.
For the unknown parameters, XInit0 optionally specifies initial values for the optimization using nlm() though the use of internal default is strongly recommended. The specification as a fixed parameter has higher priority than the specification in XInit0. Every user-specified parameter values, fixed or unknown, must be given in linear scale (e.g. between 0 and 1 for the allele frequency); they are transformed internally to log or logit.

Examples


# Dummy Cq dataset with six bulk samples, each of which comprises of eight haploids.
EPCR <- 0.95; zeroAmount <- 0.0016; # True values for the two must be known.
P <- 0.75
dmy_cq <- make_dummy(   rand.seed=1, P=P, K=2, ntrap=6, npertrap=8,
                        scaleDNA=1e-07, targetScale=1.5, baseChange=0.3,
                        EPCR=EPCR, zeroAmount=zeroAmount,
                        sdMeasure=0.3, diploid=FALSE   )
print(dmy_cq)

# Estimation with freqpcr, where P, K, targetScale, and baseChange are marked unknown.
result <- freqpcr( N=dmy_cq@N, housek0=dmy_cq@housek0, target0=dmy_cq@target0,
                   housek1=dmy_cq@housek1, target1=dmy_cq@target1,
                   EPCR=EPCR, zeroAmount=zeroAmount, beta=TRUE, print.level=0 )
print(result)

# Estimation with freqpcr, assuming diploidy.
result <- freqpcr( N=dmy_cq@N, housek0=dmy_cq@housek0, target0=dmy_cq@target0,
                   housek1=dmy_cq@housek1, target1=dmy_cq@target1,
                   EPCR=EPCR, zeroAmount=zeroAmount, beta=TRUE, diploid=TRUE )

# Estimation where you have knowledge on the size of K.
result <- freqpcr( N=dmy_cq@N, housek0=dmy_cq@housek0, target0=dmy_cq@target0,
                   housek1=dmy_cq@housek1, target1=dmy_cq@target1,
                   K=2,
                   EPCR=EPCR, zeroAmount=zeroAmount, beta=TRUE, print.level=2 )
# (>= v0.3.2)
# Provided the model is continuous (specify A instead of N).
result <- freqpcr( A=dmy_cq@N, housek0=dmy_cq@housek0, target0=dmy_cq@target0,
                   housek1=dmy_cq@housek1, target1=dmy_cq@target1,
                   K=2, EPCR=EPCR, zeroAmount=zeroAmount, beta=TRUE, print.level=1 )
# (>= v0.4.0)
# If the dataset lacks control samples (housek0 and target0 are absent).
# Fixing the sizes of targetScale and sdMeasure is strongly recommended.
result <- freqpcr( N=dmy_cq@N, housek1=dmy_cq@housek1, target1=dmy_cq@target1,
                   K=2, EPCR=EPCR, zeroAmount=zeroAmount,
                   targetScale=1.5, sdMeasure=0.3, beta=TRUE, print.level=1 )

[Package freqpcr version 0.4.0 Index]