check_ancestry {plinkQC} | R Documentation |
Identification of individuals of divergent ancestry
Description
Runs and evaluates results of plink –pca on merged genotypes from
individuals to be QCed and individuals of reference population of known
genotypes. Currently, check ancestry only supports automatic selection of
individuals of European descent. It uses information from principal
components 1 and 2 returned by plink –pca to find the center of the European
reference samples (mean(PC1_europeanRef), mean(PC2_europeanRef). It then
computes the maximum Euclidean distance (maxDist) of the European reference
samples from this centre. All study samples whose Euclidean distance from the
centre falls outside the circle described by the radius r=europeanTh* maxDist
are considered non-European and their IDs are returned as failing the
ancestry check.
check_ancestry
creates a scatter plot of PC1 versus PC2 colour-coded
for samples of the reference populations and the study population.
Usage
check_ancestry(
indir,
name,
qcdir = indir,
prefixMergedDataset,
europeanTh = 1.5,
defaultRefSamples = c("HapMap", "1000Genomes"),
refPopulation = c("CEU", "TSI"),
refSamples = NULL,
refColors = NULL,
refSamplesFile = NULL,
refColorsFile = NULL,
refSamplesIID = "IID",
refSamplesPop = "Pop",
refColorsColor = "Color",
refColorsPop = "Pop",
studyColor = "#2c7bb6",
legend_labels_per_row = 6,
run.check_ancestry = TRUE,
interactive = FALSE,
verbose = verbose,
highlight_samples = NULL,
highlight_type = c("text", "label", "color", "shape"),
highlight_text_size = 3,
highlight_color = "#c51b8a",
highlight_shape = 17,
highlight_legend = FALSE,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
keep_individuals = NULL,
remove_individuals = NULL,
exclude_markers = NULL,
extract_markers = NULL,
path2plink = NULL,
showPlinkOutput = TRUE
)
Arguments
indir |
[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files. |
name |
[character] prefix of plink files, i.e. name.bed, name.bim, name.fam. |
qcdir |
[character] /path/to/directory where prefixMergedDataset.eigenvec results as returned by plink –pca should be saved. Per default qcdir=indir. If run.check_ancestry is FALSE, it is assumed that plink –pca prefixMergedDataset has been run and qcdir/prefixMergedDataset.eigenvec exists.User needs writing permission to qcdir. |
prefixMergedDataset |
[character] Prefix of merged dataset (study and reference samples) used in plink –pca, resulting in prefixMergedDataset.eigenvec. |
europeanTh |
[double] Scaling factor of radius to be drawn around center of European reference samples, with study samples inside this radius considered to be of European descent and samples outside this radius of non-European descent. The radius is computed as the maximum Euclidean distance of European reference samples to the centre of European reference samples. |
defaultRefSamples |
[character] Option to use pre-downloaded individual and population identifiers from either the 1000Genomes or HapMap project. If refSamples and refSamplesFile are not provided, the HapMap identifiers (or 1000Genomes is specified) will be used as default and the function will fail if the reference samples in the prefixMergedDataset do not match these reference samples. If refColors and refColorsFile are not provided, this also sets default colors for the reference populations. |
refPopulation |
[vector] Vector with population identifiers of European reference population. Identifiers have to be corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refSamples |
[data.frame] Dataframe with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. Either refSamples or refSamplesFile have to be specified. |
refColors |
[data.frame, optional] Dataframe with population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor]. If not provided and is.null(refColorsFile) default colors are used. |
refSamplesFile |
[character] /path/to/File/with/reference samples. Needs columns with sample identifiers [refSamplesIID] corresponding to IIDs in prefixMergedDataset.eigenvec and population identifier [refSamplesPop] corresponding to population IDs [refColorsPop] in refColorsfile/refColors. |
refColorsFile |
[character, optional] /path/to/File/with/Population/Colors containing population IDs in column [refColorsPop] and corresponding colour-code for PCA plot in column [refColorsColor].If not provided and is.null(refColors) default colors for are used. |
refSamplesIID |
[character] Column name of reference sample IDs in refSamples/refSamplesFile. |
refSamplesPop |
[character] Column name of reference sample population IDs in refSamples/refSamplesFile. |
refColorsColor |
[character] Column name of population colors in refColors/refColorsFile |
refColorsPop |
[character] Column name of reference sample population IDs in refColors/refColorsFile. |
studyColor |
[character] Colour to be used for study population. |
legend_labels_per_row |
[integer] Number of population names per row in PCA plot. |
run.check_ancestry |
[logical] Should plink –pca be run to
determine principal components of merged dataset; if FALSE, it is assumed
that plink –pca has been run successfully and
qcdir/prefixMergedDataset.eigenvec is present;
|
interactive |
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off(). |
verbose |
[logical] If TRUE, progress info is printed to standard out. |
highlight_samples |
[character vector] Vector of sample IIDs to highlight in the plot (p_ancestry); all highlight_samples IIDs have to be present in the IIDs of the prefixMergedDataset.fam file. |
highlight_type |
[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape. |
highlight_text_size |
[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type). |
highlight_color |
[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type). |
highlight_shape |
[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec |
highlight_legend |
[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape". |
legend_text_size |
[integer] Size for legend text. |
legend_title_size |
[integer] Size for legend title. |
axis_text_size |
[integer] Size for axis text. |
axis_title_size |
[integer] Size for axis title. |
title_size |
[integer] Size for plot title. |
keep_individuals |
[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
remove_individuals |
[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals. |
exclude_markers |
[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
extract_markers |
[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers. |
path2plink |
[character] Absolute path to PLINK executable
(https://www.cog-genomics.org/plink/1.9/) i.e.
plink should be accessible as path2plink -h. The full name of the executable
should be specified: for windows OS, this means path/plink.exe, for unix
platforms this is path/plink. If not provided, assumed that PATH set-up works
and PLINK will be found by |
showPlinkOutput |
[logical] If TRUE, plink log and error messages are printed to standard out. |
Value
Named [list] with i) fail_ancestry, containing a [data.frame] with FID and IID of non-European individuals and ii) p_ancestry, a ggplot2-object 'containing' a scatter plot of PC1 versus PC2 colour-coded for samples of the reference populations and the study population, which can be shown by print(p_ancestry).
Examples
## Not run:
indir <- system.file("extdata", package="plinkQC")
name <- "data"
fail_ancestry <- check_ancestry(indir=indir, name=name,
refSamplesFile=paste(indir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(indir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE,
run.check_ancestry=FALSE)
# highlight samples
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
fail_ancestry <- check_ancestry(indir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE,
highlight_samples = highlight_samples[,2],
run.check_ancestry=FALSE,
highlight_type = c("text", "shape"))
## End(Not run)