R: LD clumping

bed_clumping {bigsnpr}

R Documentation

LD clumping

Description

For a bigSNP:

snp_pruning(): LD pruning. Similar to "⁠--indep-pairwise (size+1) 1 thr.r2⁠" in PLINK. This function is deprecated (see this article).
snp_clumping() (and bed_clumping()): LD clumping. If you do not provide any statistic to rank SNPs, it would use minor allele frequencies (MAFs), making clumping similar to pruning.
snp_indLRLDR(): Get SNP indices of long-range LD regions for the human genome.

Usage

bed_clumping(
  obj.bed,
  ind.row = rows_along(obj.bed),
  S = NULL,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  exclude = NULL,
  ncores = 1
)

snp_clumping(
  G,
  infos.chr,
  ind.row = rows_along(G),
  S = NULL,
  thr.r2 = 0.2,
  size = 100/thr.r2,
  infos.pos = NULL,
  is.size.in.bp = NULL,
  exclude = NULL,
  ncores = 1
)

snp_pruning(
  G,
  infos.chr,
  ind.row = rows_along(G),
  size = 49,
  is.size.in.bp = FALSE,
  infos.pos = NULL,
  thr.r2 = 0.2,
  exclude = NULL,
  nploidy = 2,
  ncores = 1
)

snp_indLRLDR(infos.chr, infos.pos, LD.regions = LD.wiki34)

Arguments

`obj.bed`	Object of type bed, which is the mapping of some bed file. Use `obj.bed <- bed(bedfile)` to get this object.
`ind.row`	An optional vector of the row indices (individuals) that are used. If not specified, all rows are used. Don't use negative indices.
`S`	A vector of column statistics which express the importance of each SNP (the more important is the SNP, the greater should be the corresponding statistic). For example, if `S` follows the standard normal distribution, and "important" means significantly different from 0, you must use `abs(S)` instead. If not specified, MAFs are computed and used.
`thr.r2`	Threshold over the squared correlation between two SNPs. Default is `0.2`.
`size`	For one SNP, window size around this SNP to compute correlations. Default is `100 / thr.r2` for clumping (0.2 -> 500; 0.1 -> 1000; 0.5 -> 200). If not providing `infos.pos` (`NULL`, the default), this is a window in number of SNPs, otherwise it is a window in kb (genetic distance). I recommend that you provide the positions if available.
`exclude`	Vector of SNP indices to exclude anyway. For example, can be used to exclude long-range LD regions (see Price2008). Another use can be for thresholding with respect to p-values associated with `S`.
`ncores`	Number of cores used. Default doesn't use parallelism. You may use nb_cores.
`G`	A FBM.code256 (typically `⁠<bigSNP>$genotypes⁠`). You shouldn't have missing values. Also, remember to do quality control, e.g. some algorithms in this package won't work if you use SNPs with 0 MAF.
`infos.chr`	Vector of integers specifying each SNP's chromosome. Typically `⁠<bigSNP>$map$chromosome⁠`.
`infos.pos`	Vector of integers specifying the physical position on a chromosome (in base pairs) of each SNP. Typically `⁠<bigSNP>$map$physical.pos⁠`.
`is.size.in.bp`	Deprecated.
`nploidy`	Number of trials, parameter of the binomial distribution. Default is `2`, which corresponds to diploidy, such as for the human genome.
`LD.regions`	A `data.frame` with columns "Chr", "Start" and "Stop". Default use the table of 34 long-range LD regions that you can find there.

Value

snp_clumping() (and bed_clumping()): SNP indices that are kept.
snp_indLRLDR(): SNP indices to be used as (part of) the 'exclude' parameter of snp_clumping().

References

Price AL, Weale ME, Patterson N, et al. Long-Range LD Can Confound Genome Scans in Admixed Populations. Am J Hum Genet. 2008;83(1):132-135. doi:10.1016/j.ajhg.2008.06.005

Examples

test <- snp_attachExtdata()
G <- test$genotypes

# clumping (prioritizing higher MAF)
ind.keep <- snp_clumping(G, infos.chr = test$map$chromosome,
                         infos.pos = test$map$physical.pos,
                         thr.r2 = 0.1)

# keep most of them -> not much LD in this simulated dataset
length(ind.keep) / ncol(G)

[Package bigsnpr version 1.12.2 Index]