cleanData {plinkQC}R Documentation

Create plink dataset with individuals and markers passing quality control

Description

Individuals that fail per-individual QC and markers that fail per-marker QC are removed from indir/name.bim/.bed/.fam and a new, dataset with the remaining individuals and markers is created as qcdir/name.clean.bim/.bed/.fam.

Usage

cleanData(
  indir,
  name,
  qcdir = indir,
  filterSex = TRUE,
  filterHeterozygosity = TRUE,
  filterSampleMissingness = TRUE,
  filterAncestry = TRUE,
  filterRelated = TRUE,
  filterSNPMissingness = TRUE,
  lmissTh = 0.01,
  filterHWE = TRUE,
  hweTh = 1e-05,
  filterMAF = TRUE,
  macTh = 20,
  mafTh = NULL,
  path2plink = NULL,
  verbose = FALSE,
  keep_individuals = NULL,
  remove_individuals = NULL,
  exclude_markers = NULL,
  extract_markers = NULL,
  showPlinkOutput = TRUE
)

Arguments

indir

[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.

name

[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam.

qcdir

[character] /path/to/directory where results will be written to. If perIndividualQC was conducted, this directory should be the same as qcdir specified in perIndividualQC, i.e. it contains name.fail.IDs with IIDs of individuals that failed QC. User needs writing permission to qcdir. Per default, qcdir=indir.

filterSex

[logical] Set to exclude samples that failed the sex check (via check_sex or perIndividualQC). Requires file qcdir/name.fail-sexcheck.IDs (automatically created by perIndividualQC if do.evaluate_check_sex set to TRUE).

filterHeterozygosity

[logical] Set to exclude samples that failed check for outlying heterozygosity rates (via check_het_and_miss or perIndividualQC). Requires file qcdir/name.fail-het.IDs (automatically created by perIndividualQC if do.evaluate_check_het_and_miss set to TRUE).

filterSampleMissingness

[logical] Set to exclude samples that failed check for excessive missing genotype rates (via check_het_and_miss or perIndividualQC). Requires file qcdir/name.fail-imiss.IDs (automatically created by perIndividualQC if do.evaluate_check_het_and_miss set to TRUE).

filterAncestry

[logical] Set to exclude samples that failed ancestry check (via check_ancestry or perIndividualQC). Requires file qcdir/name.fail-ancestry.IDs (automatically created by perIndividualQC if do.check_ancestry set to TRUE).

filterRelated

[logical] Set to exclude samples that failed relatedness check (via check_relatedness or perIndividualQC). Requires file qcdir/name.fail-IBD.IDs (automatically created by perIndividualQC if do.evaluate_check_relatedness set to TRUE).

filterSNPMissingness

[logical] Set to exclude markers that have excessive missing rates across samples (via check_snp_missingness or perMarkerQC). Requires lmissTh to be set.

lmissTh

[double] Threshold for acceptable variant missing rate across samples.

filterHWE

[logical] Set to exclude markers that fail HWE exact test (via check_hwe or perMarkerQC). Requires hweTh to be set.

hweTh

[double] Significance threshold for deviation from HWE.

filterMAF

[logical] Set to exclude markers that fail minor allele frequency or minor allele count threshold (via check_maf or perMarkerQC). Requires mafTh or macTh to be set.

macTh

[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples).

mafTh

[double] Threshold for minor allele frequency cut-off.

path2plink

[character] Absolute path to PLINK executable (https://www.cog-genomics.org/plink/1.9/) i.e. plink should be accessible as path2plink -h. The full name of the executable should be specified: for windows OS, this means path/plink.exe, for unix platforms this is path/plink. If not provided, assumed that PATH set-up works and PLINK will be found by exec('plink').

verbose

[logical] If TRUE, progress info is printed to standard out.

keep_individuals

[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.

remove_individuals

[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.

exclude_markers

[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.

extract_markers

[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.

showPlinkOutput

[logical] If TRUE, plink log and error messages are printed to standard out.

Value

names [list] with i) passIDs, containing a [data.frame] with family [FID] and individual [IID] IDs of samples that pass the QC, ii) failIDs, containing a [data.frame] with family [FID] and individual [IID] IDs of samples that fail the QC.

Examples

package.dir <- find.package('plinkQC')
indir <- file.path(package.dir, 'extdata')
qcdir <- tempdir()
name <- "data"
path2plink <- '/path/to/plink'
# the following code is not run on package build, as the path2plink on the
# user system is not known.
## Not run: 
# Run qc on all samples and markers in the dataset
## Run individual QC checks
fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
refSamplesFile=paste(qcdir, "/HapMap_ID2Pop.txt",sep=""),
refColorsFile=paste(qcdir, "/HapMap_PopColors.txt", sep=""),
prefixMergedDataset="data.HapMapIII", interactive=FALSE, verbose=FALSE,
path2plink=path2plink)

## Run marker QC checks
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)

## Create new dataset of individuals and markers passing QC
ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15,
verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE,
filterRelated=TRUE)

# Run qc on subset of samples and markers in the dataset
highlight_samples <- read.table(system.file("extdata", "keep_individuals",
package="plinkQC"))
remove_individuals_file <- system.file("extdata", "remove_individuals",
package="plinkQC")

fail_individuals <- perIndividualQC(indir=indir, qcdir=qcdir, name=name,
dont.check_ancestry = TRUE, interactive=FALSE, verbose=FALSE,
highlight_samples = highlight_samples[,2], highlight_type = "label",
remove_individuals = remove_individuals_file, path2plink=path2plink)

## Run marker QC checks
fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name,
path2plink=path2plink)

## Create new dataset of individuals and markers passing QC
ids_all <- cleanData(indir=indir, qcdir=qcdir, name=name, macTh=15,
verbose=TRUE, path2plink=path2plink, filterAncestry=FALSE,
remove_individuals = remove_individuals_file)

## End(Not run)

[Package plinkQC version 0.3.4 Index]