qc.filtering {ASRgenomics} | R Documentation |
Quality control filtering of molecular matrix M for downstream analyses
Description
Reads molecular data in the format 0, 1, 2 and performs some basic quality control
filters and simple imputation.
Matrix provided is of the full form (n \times p
), with n
individuals and p
markers.
Individual and marker names are assigned to rownames
and colnames
,
respectively. Filtering can be done with the some of the following options by
specifying thresholds for:
missing values on individuals, missing values on markers, minor allele frequency,
inbreeding Fis value (of markers), and observed heterozygosity (of markers).
String used for identifying missing values can be specified.
If requested, missing values will be imputed based on the mean of each SNP.
Usage
qc.filtering(
M = NULL,
base = FALSE,
na.string = NA,
map = NULL,
marker = NULL,
chrom = NULL,
pos = NULL,
ref = NULL,
marker.callrate = 1,
ind.callrate = 1,
maf = 0,
heterozygosity = 1,
Fis = 1,
impute = FALSE,
Mrecode = FALSE,
plots = TRUE,
digits = 2,
message = TRUE
)
Arguments
M |
A matrix with SNP data of full form ( |
base |
If |
na.string |
A character that will be interpreted as |
map |
(Optional) A data frame with the map information with |
marker |
A character indicating the name of the column in data frame |
chrom |
A character indicating the name of the column in data frame |
pos |
A character indicating the name of the column in data frame |
ref |
A character indicating the name of the column in the map containing the reference allele for
recoding. If absent, then conversion will be based on the major allele (most frequent).
The marker information of a given individuals with two of the specified major alleles
in |
marker.callrate |
A numerical value between 0 and 1 used to remove SNPs with a rate of missing values equal or larger than this value (default = 1, i.e. no removing). |
ind.callrate |
A numerical value between 0 and 1 used to remove individuals with a rate of missing values equal or larger than this value (default = 1, i.e. no removing). |
maf |
A numerical value between 0 and 1 used to remove SNPs with a Minor Allele Frequency (MAF) below this value (default = 0, i.e. no removing). |
heterozygosity |
A numeric value indicating the maximum value of accepted observed heterozygosity (Ho) (default = 1, i.e. no removing). |
Fis |
A numeric value indicating the maximum value of accepted inbreeding (Fis) following
the equation |
impute |
If |
Mrecode |
If |
plots |
If |
digits |
Set up the number of digits used to round the output matrix (default = 2). |
message |
If |
Details
Warning: The arguments base
, ref
, and Mrecode
currently are deprecated and will
be removed on the next version of ASRgenomics
.
Use function snp.recode to recode the matrix prior to using qc.filtering
.
The filtering process is carried out as expressed in the following simplified pseudo-code that consists on a loop repeated twice:
for i in 1 to 2
Filter markers based on call rate.
Filter individuals based on call rate.
Filter markers based on minor allele frequency.
Filter markers based on observed heterozygosity.
Filter markers based on inbreeding.
end for
Value
A list with the following elements:
M.clean
: the cleaned\boldsymbol{M}
matrix after the quality control filters have been applied.map
: if provided, a cleanedmap
data frame after the quality control filters have been applied.plot.missing.ind
: a plot of missing data per individual (original marker matrix).plot.missing.SNP
: a plot of missing data per SNP (original marker matrix).plot.heteroz
: a plot of observed heterozygocity per SNP (original marker matrix).plot.Fis
: a plot of Fis per SNP (original marker matrix).plot.maf
: a plot of the minor allele frequency (original marker matrix).
Examples
# Example: Pine dataset from ASRgenomics (coded as 0,1,2 with missing as -9).
M.clean <- qc.filtering(
M = geno.pine926,
maf = 0.05,
marker.callrate = 0.9, ind.callrate = 0.9,
heterozygosity = 0.9, Fis = 0.6,
na.string = "-9")
ls(M.clean)
M.clean$M.clean[1:5, 1:5]
dim(M.clean$M.clean)
head(M.clean$map)
M.clean$plot.maf
M.clean$plot.missing.ind
M.clean$plot.missing.SNP
M.clean$plot.heteroz
M.clean$plot.Fis
# Example: Salmon dataset (coded as 0,1,2 with missing as NA).
M.clean <- qc.filtering(
M = geno.salmon,
maf = 0.02,
marker.callrate = 0.10, ind.callrate = 0.20,
heterozygosity = 0.9, Fis = 0.4)
M.clean$M.clean[1:5, 1:5]
dim(M.clean$M.clean)
head(M.clean$map)
M.clean$plot.maf
M.clean$plot.missing.ind
M.clean$plot.missing.SNP
M.clean$plot.heteroz
M.clean$plot.Fis