R: Quality control filtering of molecular matrix M for...

qc.filtering {ASRgenomics}

R Documentation

Quality control filtering of molecular matrix M for downstream analyses

Description

Reads molecular data in the format 0, 1, 2 and performs some basic quality control filters and simple imputation. Matrix provided is of the full form (n \times p), with n individuals and p markers. Individual and marker names are assigned to rownames and colnames, respectively. Filtering can be done with the some of the following options by specifying thresholds for: missing values on individuals, missing values on markers, minor allele frequency, inbreeding Fis value (of markers), and observed heterozygosity (of markers). String used for identifying missing values can be specified. If requested, missing values will be imputed based on the mean of each SNP.

Usage

qc.filtering(
  M = NULL,
  base = FALSE,
  na.string = NA,
  map = NULL,
  marker = NULL,
  chrom = NULL,
  pos = NULL,
  ref = NULL,
  marker.callrate = 1,
  ind.callrate = 1,
  maf = 0,
  heterozygosity = 1,
  Fis = 1,
  impute = FALSE,
  Mrecode = FALSE,
  plots = TRUE,
  digits = 2,
  message = TRUE
)

Arguments

`M`	A matrix with SNP data of full form (`n \times p`), with `n` individuals and `p` markers Individual and marker names are assigned to `rownames` and `colnames`, respectively. Data in matrix is coded as 0, 1, 2 (integer or numeric) (default = `NULL`).
`base`	If `TRUE` matrix `\boldsymbol{M}` is considered as bi-allele SNP data format (character) and the SNPs are recoded to numerical values before performing the quality control filters (default = `FALSE`) (currently deprecated).
`na.string`	A character that will be interpreted as `NA` values (default = `"NA"`).
`map`	(Optional) A data frame with the map information with `p` rows (default = `NULL`).
`marker`	A character indicating the name of the column in data frame `map` with the identification of markers. This is mandatory if `map` is provided (default = `NULL`).
`chrom`	A character indicating the name of the column in data frame `map` with the identification of chromosomes (default = `NULL`).
`pos`	A character indicating the name of the column in data frame `map` with the identification of marker positions (default = `NULL`).
`ref`	A character indicating the name of the column in the map containing the reference allele for recoding. If absent, then conversion will be based on the major allele (most frequent). The marker information of a given individuals with two of the specified major alleles in `ref` will be coded as 2 (default = `NULL`).
`marker.callrate`	A numerical value between 0 and 1 used to remove SNPs with a rate of missing values equal or larger than this value (default = 1, i.e. no removing).
`ind.callrate`	A numerical value between 0 and 1 used to remove individuals with a rate of missing values equal or larger than this value (default = 1, i.e. no removing).
`maf`	A numerical value between 0 and 1 used to remove SNPs with a Minor Allele Frequency (MAF) below this value (default = 0, i.e. no removing).
`heterozygosity`	A numeric value indicating the maximum value of accepted observed heterozygosity (Ho) (default = 1, i.e. no removing).
`Fis`	A numeric value indicating the maximum value of accepted inbreeding (Fis) following the equation `\|1 - (Ho/He)\|` (default = 1, i.e. no removing).
`impute`	If `TRUE` imputation of missing values is done using the mean of each SNP (default = `FALSE`).
`Mrecode`	If `TRUE` it provides the recoded `\boldsymbol{M}` matrix from the bi-allelic to numeric SNP (default = `FALSE`) (currently deprecated).
`plots`	If `TRUE` generates graphical output of the quality control based on the original input matrix (default = `TRUE`).
`digits`	Set up the number of digits used to round the output matrix (default = 2).
`message`	If `TRUE` diagnostic messages are printed on screen (default = `TRUE`).

Details

Warning: The arguments base, ref, and Mrecode currently are deprecated and will be removed on the next version of ASRgenomics. Use function snp.recode to recode the matrix prior to using qc.filtering.

The filtering process is carried out as expressed in the following simplified pseudo-code that consists on a loop repeated twice:

for i in 1 to 2

Filter markers based on call rate.

Filter individuals based on call rate.

Filter markers based on minor allele frequency.

Filter markers based on observed heterozygosity.

Filter markers based on inbreeding.

end for

Value

A list with the following elements:

M.clean: the cleaned \boldsymbol{M} matrix after the quality control filters have been applied.
map: if provided, a cleaned map data frame after the quality control filters have been applied.
plot.missing.ind: a plot of missing data per individual (original marker matrix).
plot.missing.SNP: a plot of missing data per SNP (original marker matrix).
plot.heteroz: a plot of observed heterozygocity per SNP (original marker matrix).
plot.Fis: a plot of Fis per SNP (original marker matrix).
plot.maf: a plot of the minor allele frequency (original marker matrix).

Examples

# Example: Pine dataset from ASRgenomics (coded as 0,1,2 with missing as -9).

M.clean <- qc.filtering(
 M = geno.pine926,
 maf = 0.05,
 marker.callrate = 0.9, ind.callrate = 0.9,
 heterozygosity = 0.9, Fis = 0.6,
 na.string = "-9")
ls(M.clean)
M.clean$M.clean[1:5, 1:5]
dim(M.clean$M.clean)
head(M.clean$map)
M.clean$plot.maf
M.clean$plot.missing.ind
M.clean$plot.missing.SNP
M.clean$plot.heteroz
M.clean$plot.Fis


# Example: Salmon dataset (coded as 0,1,2 with missing as NA).

M.clean <- qc.filtering(
 M = geno.salmon,
 maf = 0.02,
 marker.callrate = 0.10, ind.callrate = 0.20,
 heterozygosity = 0.9, Fis = 0.4)
M.clean$M.clean[1:5, 1:5]
dim(M.clean$M.clean)
head(M.clean$map)
M.clean$plot.maf
M.clean$plot.missing.ind
M.clean$plot.missing.SNP
M.clean$plot.heteroz
M.clean$plot.Fis

[Package ASRgenomics version 1.1.4 Index]