vcftable {vcfppR} | R Documentation |
read VCF/BCF contents into R data structure
Description
The swiss army knife for reading VCF/BCF into R data types rapidly and easily.
Usage
vcftable(
vcffile,
region = "",
samples = "-",
vartype = "all",
format = "GT",
ids = NULL,
qual = 0,
pass = FALSE,
info = TRUE,
collapse = TRUE,
setid = FALSE,
mac = 0
)
Arguments
vcffile |
path to the VCF/BCF file |
region |
region to subset in bcftools-like style: "chr1", "chr1:1-10000000" |
samples |
samples to subset in bcftools-like style. comma separated list of samples to include (or exclude with "^" prefix). e.g. "id01,id02", "^id01,id02". |
vartype |
restrict to specific type of variants. supports "snps","indels", "sv", "multisnps","multiallelics" |
format |
the FORMAT tag to extract. default "GT" is extracted. |
ids |
character vector. restrict to sites with ID in the given vector. default NULL won't filter any sites. |
qual |
numeric. restrict to variants with QUAL > qual. |
pass |
logical. restrict to variants with FILTER = "PASS". |
info |
logical. drop INFO column in the returned list. |
collapse |
logical. It acts on the FORMAT. If the FORMAT to extract is "GT", the dim of raw genotypes matrix of diploid is (M, 2 * N), where M is #markers and N is #samples. default TRUE will collapse the genotypes for each sample such that the matrix is (M, N). Set this to FALSE if one wants to maintain the phasing order, e.g. "1|0" is parsed as c(1, 0) with collapse=FALSE. If the FORMAT to extract is not "GT", then with collapse=TRUE it will try to turn a list of the extracted vector into a matrix. However, this raises issues when one variant is mutliallelic resulting in more vaules than others. |
setid |
logical. reset ID column as CHR_POS_REF_ALT. |
mac |
integer. restrict to variants with minor allele count higher than the value. |
Details
vcftable
uses the C++ API of vcfpp, which is a wrapper of htslib, to read VCF/BCF files.
Thus, it has the full functionalities of htslib, such as restrict to specific variant types,
samples and regions. For the memory efficiency reason, the vcftable
is designed
to parse only one tag at a time in the FORMAT column of the VCF. In default, only the matrix of genotypes,
i.e. "GT" tag, are returned by vcftable
, but there are many other tags supported by the format
option.
Value
Return a list containing the following components:
- samples
: character vector;
the samples ids in the VCF file after subsetting- chr
: character vector;
the CHR column in the VCF file- pos
: character vector;
the POS column in the VCF file- id
: character vector;
the ID column in the VCF file- ref
: character vector;
the REF column in the VCF file- alt
: character vector;
the ALT column in the VCF file- qual
: character vector;
the QUAL column in the VCF file- filter
: character vector;
the FILTER column in the VCF file- info
: character vector;
the INFO column in the VCF file- format
: matrix of either integer of numberic values depending on the tag to extract;
a specifiy tag in the FORMAT column to be extracted
Author(s)
Zilong Li zilong.dk@gmail.com
Examples
library('vcfppR')
vcffile <- system.file("extdata", "raw.gt.vcf.gz", package="vcfppR")
res <- vcftable(vcffile, "chr21:1-5050000", vartype = "snps")
str(res)