R: Fast bias-correction for LD Estimation

ldfast {ldsep}

R Documentation

Fast bias-correction for LD Estimation

Description

Estimates the reliability ratios from posterior marginal moments and uses these to correct the biases in linkage disequilibrium estimation caused by genotype uncertainty. These methods are described in Gerard (2021).

Usage

ldfast(
  gp,
  type = c("r", "r2", "z", "D", "Dprime"),
  shrinkrr = TRUE,
  se = TRUE,
  thresh = TRUE,
  upper = 10,
  mode = c("zero", "estimate"),
  win = NULL
)

Arguments

`gp`	A three-way array with dimensions SNPs by individuals by dosage. That is, `gp[i, j, k]` is the posterior probability of dosage `k-1` for individual `j` at SNP `i`.
`type`	What LD measure should we estimate? `"r"` The Pearson correlation. `"r2"` The squared Pearson correlation. `"z"` The Fisher-z transformed Pearson correlation. `"D"` The LD coefficient. `"Dprime"` The standardized LD coefficient. Note that these are all composite measures of LD (see the description in `ldest()`).
`shrinkrr`	A logical. Should we use adaptive shrinkage (Stephens, 2016) to shrink the reliability ratios (`TRUE`) or keep the raw reliability ratios (`FALSE`). Defaults to `TRUE`.
`se`	Should we also return a matrix of standard errors (`TRUE`) or not (`FALSE`)? It is faster to not return standard errors. Defaults to `TRUE`.
`thresh`	A logical. Should we apply an upper bound on the reliability ratios (`TRUE`) or not (`FALSE`).
`upper`	The upper bound on the reliability ratios if `thresh = TRUE`. The default is a generous 10.
`mode`	A character. Only applies if `shrinkrr = TRUE`. When using hierarchical shrinkage on the log of the reliability ratios, should we use zero as the mode (`mode = "zero"`) or estimate it using the procedure of Robertson and Cryer (1974) (`mode = "estimate"`)?
`win`	A positive integer. The window size. This will constrain the correlations calculated to those +/- the window size. This will only improve speed if the window size is much less than the number of SNPs.

Value

A list with some or all of the following elements:

ldmat: The bias-corrected LD matrix.
rr: The estimated reliability ratio for each SNP. This is the multiplicative factor applied to the naive LD estimate for each SNP.
rr_raw: The raw reliability ratios (for the covariance, not the correlation). Only returned if shrinkrr = TRUE.
rr_se: The standard errors for the log-raw reliability ratios for each SNP. That is, we have sd(log(rr_raw)) ~ rr_se. Only returned if shrinkrr = TRUE.
semat: A matrix of standard errors of the corresponding estimators of LD.

Details

Returns consistent and bias-corrected estimates of linkage disequilibrium. The usual measures of LD are implemented: D, D', r, r2, and z (Fisher-z of r). These are all composite measures of LD, not haplotypic measures of LD (see the description in ldest()). They are always appropriate measures of association between loci, but only correspond to haplotypic measures of LD when Hardy-Weinberg equilibrium is fulfilled in autopolyploids.

In order for these estimates to perform well, you need to use posterior genotype probabilities that have been calculated using adaptive priors, i.e. empirical/hierarchical Bayes approaches. There are many approaches that do this, such as updog, polyRAD, fitPoly, or SuperMASSA. Note that GATK uses a uniform prior, so would be inappropriate for use in ldfast().

Calculating standard errors and performing hierarchical shrinkage of the reliability ratios are both rather slow operations compared to just raw method-of-moments based estimation for LD. If you don't need standard errors, you can double your speed by setting se = FALSE. It is not recommended that you disable the hierarchical shrinkage.

Due to sampling variability, the estimates sometime lie outside of the theoretical boundaries of the parameters being estimated. In such cases, we truncate the estimates at the boundary and return NA for the standard errors.

Mathematical formulation

Let

r be the sample correlation of posterior mean genotypes between loci 1 and 2,
a1 be the sample variance of posterior means at locus 1,
a2 be the sample variance of posterior means at locus 2,
b1 be the sample mean of posterior variances at locus 1, and
b2 be the sample mean of posterior variances at locus 2.

Then the estimated Pearson correlation between the genotypes at loci 1 and 2 is

\sqrt{(a1 + b1)/a1}\sqrt{(a2 + b2)/a2}r.

All other LD calculations are based on this equation. In particular, the estimated genotype variances at loci 1 and 2 are a1 + b1 and a2 + b2, respectively, which can be used to calculate D and D'.

The reliability ratio for SNP i is defined by (ai + bi)/ai. By default, we apply ash() (Stephens, 2016) to the log of these reliability ratios before adjusting the Pearson correlation. Standard errors are required before using ash(), but these are easily obtained using the central limit theorem and the delta-method.

Author(s)

David Gerard

References

Gerard, David. Scalable Bias-corrected Linkage Disequilibrium Estimation Under Genotype Uncertainty. Heredity, 127(4), 357–362, 2021. doi:10.1038/s41437-021-00462-5.
T. Robertson and J. D. Cryer. An iterative procedure for estimating the mode. Journal of the American Statistical Association, 69(348):1012–1016, 1974. doi:10.1080/01621459.1974.10480246.
M. Stephens. False discovery rates: a new deal. Biostatistics, 18(2):275–294, 10 2016. doi:10.1093/biostatistics/kxw041.

Examples

data("gp")

ldout <- ldfast(gp, "r")
ldout$ldmat
ldout$rr
ldout$semat

ldout <- ldfast(gp, "D")
ldout$ldmat
ldout$rr
ldout$semat

ldout <- ldfast(gp, "Dprime")
ldout$ldmat
ldout$rr
ldout$semat

[Package ldsep version 2.1.5 Index]