qc.filtering {ASRgenomics}R Documentation

Quality control filtering of molecular matrix M for downstream analyses

Description

Reads molecular data in the format 0, 1, 2 and performs some basic quality control filters and simple imputation. Matrix provided is of the full form (n \times p), with n individuals and p markers. Individual and marker names are assigned to rownames and colnames, respectively. Filtering can be done with the some of the following options by specifying thresholds for: missing values on individuals, missing values on markers, minor allele frequency, inbreeding Fis value (of markers), and observed heterozygosity (of markers). String used for identifying missing values can be specified. If requested, missing values will be imputed based on the mean of each SNP.

Usage

qc.filtering(
  M = NULL,
  base = FALSE,
  na.string = NA,
  map = NULL,
  marker = NULL,
  chrom = NULL,
  pos = NULL,
  ref = NULL,
  marker.callrate = 1,
  ind.callrate = 1,
  maf = 0,
  heterozygosity = 1,
  Fis = 1,
  impute = FALSE,
  Mrecode = FALSE,
  plots = TRUE,
  digits = 2,
  message = TRUE
)

Arguments

M

A matrix with SNP data of full form (n \times p), with n individuals and p markers Individual and marker names are assigned to rownames and colnames, respectively. Data in matrix is coded as 0, 1, 2 (integer or numeric) (default = NULL).

base

If TRUE matrix \boldsymbol{M} is considered as bi-allele SNP data format (character) and the SNPs are recoded to numerical values before performing the quality control filters (default = FALSE) (currently deprecated).

na.string

A character that will be interpreted as NA values (default = "NA").

map

(Optional) A data frame with the map information with p rows (default = NULL).

marker

A character indicating the name of the column in data frame map with the identification of markers. This is mandatory if map is provided (default = NULL).

chrom

A character indicating the name of the column in data frame map with the identification of chromosomes (default = NULL).

pos

A character indicating the name of the column in data frame map with the identification of marker positions (default = NULL).

ref

A character indicating the name of the column in the map containing the reference allele for recoding. If absent, then conversion will be based on the major allele (most frequent). The marker information of a given individuals with two of the specified major alleles in ref will be coded as 2 (default = NULL).

marker.callrate

A numerical value between 0 and 1 used to remove SNPs with a rate of missing values equal or larger than this value (default = 1, i.e. no removing).

ind.callrate

A numerical value between 0 and 1 used to remove individuals with a rate of missing values equal or larger than this value (default = 1, i.e. no removing).

maf

A numerical value between 0 and 1 used to remove SNPs with a Minor Allele Frequency (MAF) below this value (default = 0, i.e. no removing).

heterozygosity

A numeric value indicating the maximum value of accepted observed heterozygosity (Ho) (default = 1, i.e. no removing).

Fis

A numeric value indicating the maximum value of accepted inbreeding (Fis) following the equation |1 - (Ho/He)| (default = 1, i.e. no removing).

impute

If TRUE imputation of missing values is done using the mean of each SNP (default = FALSE).

Mrecode

If TRUE it provides the recoded \boldsymbol{M} matrix from the bi-allelic to numeric SNP (default = FALSE) (currently deprecated).

plots

If TRUE generates graphical output of the quality control based on the original input matrix (default = TRUE).

digits

Set up the number of digits used to round the output matrix (default = 2).

message

If TRUE diagnostic messages are printed on screen (default = TRUE).

Details

Warning: The arguments base, ref, and Mrecode currently are deprecated and will be removed on the next version of ASRgenomics. Use function snp.recode to recode the matrix prior to using qc.filtering.

The filtering process is carried out as expressed in the following simplified pseudo-code that consists on a loop repeated twice:

for i in 1 to 2

    Filter markers based on call rate.

    Filter individuals based on call rate.

    Filter markers based on minor allele frequency.

    Filter markers based on observed heterozygosity.

    Filter markers based on inbreeding.

end for

Value

A list with the following elements:

Examples

# Example: Pine dataset from ASRgenomics (coded as 0,1,2 with missing as -9).

M.clean <- qc.filtering(
 M = geno.pine926,
 maf = 0.05,
 marker.callrate = 0.9, ind.callrate = 0.9,
 heterozygosity = 0.9, Fis = 0.6,
 na.string = "-9")
ls(M.clean)
M.clean$M.clean[1:5, 1:5]
dim(M.clean$M.clean)
head(M.clean$map)
M.clean$plot.maf
M.clean$plot.missing.ind
M.clean$plot.missing.SNP
M.clean$plot.heteroz
M.clean$plot.Fis


# Example: Salmon dataset (coded as 0,1,2 with missing as NA).

M.clean <- qc.filtering(
 M = geno.salmon,
 maf = 0.02,
 marker.callrate = 0.10, ind.callrate = 0.20,
 heterozygosity = 0.9, Fis = 0.4)
M.clean$M.clean[1:5, 1:5]
dim(M.clean$M.clean)
head(M.clean$map)
M.clean$plot.maf
M.clean$plot.missing.ind
M.clean$plot.missing.SNP
M.clean$plot.heteroz
M.clean$plot.Fis



[Package ASRgenomics version 1.1.4 Index]