snp.pruning {ASRgenomics}R Documentation

Reduces the number of redundant markers on a molecular matrix M by pruning


For a given molecular dataset \boldsymbol{M} (in the format 0, 1 and 2) it produces a reduced molecular matrix by eliminating "redundant" markers using pruning techniques. This function finds and drops some of the SNPs in high linkage disequilibrium (LD).


  M = NULL,
  map = NULL,
  marker = NULL,
  chrom = NULL,
  pos = NULL,
  method = c("correlation"),
  criteria = c("callrate", "maf"),
  pruning.thr = 0.95,
  by.chrom = FALSE,
  window.n = 50,
  overlap.n = 5,
  iterations = 10,
  seed = NULL,
  message = TRUE



A matrix with marker data of full form (n \times p), with n individuals and p markers. Individual and marker names are assigned to rownames and colnames, respectively. Data in matrix is coded as 0, 1, 2 (integer or numeric) (default = NULL).


(Optional) A data frame with the map information with p rows. If NULL a dummy map is generated considering a single chromosome and sequential positions for markers. A map is mandatory if by.chrom = TRUE, where also option chrom must also be non-null.


A character indicating the name of the column in data frame map with the identification of markers. This is mandatory if map is provided (default = NULL).


A character indicating the name of the column in data frame map with the identification of chromosomes. This is mandatory if map is provided (default = NULL).


A character indicating the name of the column in data frame map with the identification of marker positions (default = NULL).


A character indicating the method (or algorithm) to be used as reference for identifying redundant markers. The only method currently available is based on correlations (default = "correlation").


A character indicating the criteria to choose which marker to drop from a detected redundant pair. Options are: "callrate" (the marker with fewer missing values will be kept) and "maf" (the marker with higher minor allele frequency will be kept) (default = "callrate").


A threshold value to identify redundant markers with Pearson's correlation larger than the value provided (default = 0.95).


If TRUE the pruning is performed independently by chromosome (default = FALSE).


A numeric value with number of markers to consider in each window to perform pruning (default = 50).


A numeric value with number of markers to overlap between consecutive windows (default = 5).


An integer indicating the number of sequential times the pruning procedure should be executed on remaining markers. If no markers are dropped in a given iteration/run, the algorithm will stop (default = 10).


An integer to be used as seed for reproducibility. In case the criteria has the same values for a given pair of markers, one will be dropped at random (default = NULL).


If TRUE diagnostic messages are printed on screen (default = TRUE).


Pruning is recommended as redundancies can affect the quality of matrices used for downstream analyses. The algorithm used is based on the Pearson's correlation between markers as a proxy for LD. In the event of a pairwise correlation higher than the selected threshold markers will be eliminated as specified by: call rate, minor allele frequency. In case of tie, one marker will be dropped at random.

Filtering markers (qc.filtering) is of high relevance before pruning. Poor quality markers (e.g., monomorphic markers) may prevent correlations from being calculated and may affect eliminations.



# Read and filter genotypic data.
M.clean <- qc.filtering(
 M = geno.pine655,
 maf = 0.05,
 marker.callrate = 0.20, ind.callrate = 0.20,
 Fis = 1, heterozygosity = 0.98,
 na.string = "-9",
 plots = FALSE)$M.clean

# Prune correlations > 0.9.
Mpr <- snp.pruning(
 M = M.clean, pruning.thr = 0.90,
 by.chrom = FALSE, window.n = 40, overlap.n = 10)
Mpr$Mpruned[1:5, 1:5]

[Package ASRgenomics version 1.1.3 Index]