check_het_and_miss {plinkQC} | R Documentation |
Identification of individuals with outlying missing genotype or heterozygosity rates
Description
Runs and evaluates results from plink –missing (missing genotype rates
per individual) and plink –het (heterozygosity rates per individual).
Non-systematic failures in genotyping and outlying heterozygosity (hz) rates
per individual are often proxies for DNA sample quality. Larger than expected
heterozygosity can indicate possible DNA contamination.
The mean heterozygosity in PLINK is computed as hz_mean = (N-O)/N, where
N: number of non-missing genotypes and O:observed number of homozygous
genotypes for a given individual.
Mean heterozygosity can differ between populations and SNP genotyping panels.
Within a population and genotyping panel, a reduced heterozygosity rate can
indicate inbreeding - these individuals will then likely be returned by
check_relatedness
as individuals that fail the relatedness
filters. check_het_and_miss
creates a scatter plot with the
individuals' missingness rates on x-axis and their heterozygosity rates on
the y-axis.
Usage
check_het_and_miss(
indir,
name,
qcdir = indir,
imissTh = 0.03,
hetTh = 3,
run.check_het_and_miss = TRUE,
label_fail = TRUE,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
interactive = FALSE,
verbose = FALSE,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
path2plink = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.het and name.imiss. |
qcdir |
[character] /path/to/directory where name.het as returned by plink –het and name.imiss as returned by plink –missing will be saved. Per default qcdir=indir. If run.check_het_and_miss is FALSE, it is assumed that plink –missing and plink –het have been run and qcdir/name.imiss and qcdir/name.het are present. User needs writing permission to qcdir. |
imissTh |
[double] Threshold for acceptable missing genotype rate per individual; has to be proportion between (0,1) |
hetTh |
[double] Threshold for acceptable deviation from mean heterozygosity per individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0. |
run.check_het_and_miss |
[logical] Should plink –missing and plink
–het be run to determine genotype missingness and heterozygosity rates; if
FALSE, it is assumed that plink –missing and plink –het have been run and
qcdir/name.imiss and qcdir/name.het are present;
|
label_fail |
[logical] Set TRUE, to add fail IDs as text labels in scatter plot. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_het_imiss); all highlight_samples IIDs have to be present in the IIDs of the name.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_het_imiss) via ggplot2::ggsave(p=p_het_imiss , other_arguments) or pdf(outfile) print(p_het_imiss) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Details
check_het_and_miss
wraps around
run_check_missingness
,
run_check_heterozygosity
and
evaluate_check_het_and_miss
.
If run.check_het_and_miss is TRUE, run_check_heterozygosity
and
run_check_missingness
are executed ; otherwise it is assumed
that plink –missing and plink –het have been run externally and
qcdir/name.het and qcdir/name.imiss exist. check_het_and_miss
will fail with missing file error otherwise.
For details on the output data.frame fail_imiss and fail_het, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#imiss and https://www.cog-genomics.org/plink/1.9/formats#het
Value
Named [list] with i) fail_imiss [data.frame] containing FID (Family ID), IID (Within-family ID), MISS_PHENO (Phenotype missing? (Y/N)), N_MISS (Number of missing genotype call(s), not including obligatory missings), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) of individuals failing missing genotype check and ii) fail_het [data.frame] containing FID (Family ID), IID (Within-family ID), O(HOM) (Observed number of homozygotes), E(HOM) (Expected number of homozygotes), N(NM) (Number of non-missing autosomal genotypes), F (Method-of-moments F coefficient estimate) of individuals failing outlying heterozygosity check and iii) p_het_imiss, a ggplot2-object 'containing' a scatter plot with the samples' missingness rates on x-axis and their heterozygosity rates on the y-axis, which can be shown by print(p_het_imiss).
Examples
## Not run:
indir <- system.file("extdata", package="plinkQC")
name <- "data"
path2plink <- "path/to/plink"
# whole dataset
fail_het_miss <- check_het_and_miss(indir=indir, name=name,
interactive=FALSE,path2plink=path2plink)
# subset of dataset with sample highlighting
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")
fail_het_miss <- check_het_and_miss(indir=indir, name=name,
interactive=FALSE,path2plink=path2plink,
remove_individuals=remove_individuals_file,
highlight_samples=highlight_samples[,2], highlight_type = c("text", "shape"))
## End(Not run)