VCF2RADdata {polyRAD} | R Documentation |
Create a RADdata Object from a VCF File
Description
This function reads a Variant Call Format (VCF) file containing allelic read depth
and SNP alignment positions, such as can be produced by TASSEL or GATK, and
generates a RADdata
dataset to be used for genotype calling in
polyRAD.
Usage
VCF2RADdata(file, phaseSNPs = TRUE, tagsize = 80, refgenome = NULL,
tol = 0.01, al.depth.field = "AD", min.ind.with.reads = 200,
min.ind.with.minor.allele = 10, possiblePloidies = list(2),
taxaPloidy = 2L, contamRate = 0.001,
samples = VariantAnnotation::samples(VariantAnnotation::scanVcfHeader(file)),
svparam = VariantAnnotation::ScanVcfParam(fixed = "ALT", info = NA,
geno = al.depth.field,
samples = samples),
yieldSize = 5000, expectedAlleles = 5e+05, expectedLoci = 1e+05,
maxLoci = NA)
Arguments
file |
The path to a VCF file to be read. This can be uncompressed, bgzipped using
Samtools or Bioconductor, or a |
phaseSNPs |
If |
tagsize |
The read length, minus any barcode sequence, that was used for genotyping. In TASSEL,
this is the same as the kmerLength option. This argument is used for grouping
SNPs into haplotypes and is ignored if |
refgenome |
Optional. The name of a FASTA file, or an |
tol |
The proportion by which two SNPs can differ in read depth and still be merged
into one group for phasing. Ignored if |
al.depth.field |
The name of the genotype field in the VCF file that contains read depth at each allele. This should be "AD" unless your format is very unusual. |
min.ind.with.reads |
Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with reads. |
min.ind.with.minor.allele |
Integer used for filtering SNPs. To be retained, a SNP must have at least this many samples with the minor allele. When there are more than two alleles, at least two alleles must have at least this many samples with reads for the SNP to be retained. |
possiblePloidies |
A list indicating inheritance modes that might be encountered in the
dataset. See |
taxaPloidy |
A single integer, or an integer vector with one value per taxon, indicating
ploidy. See |
contamRate |
A number indicating the expected sample cross-contamination rate. See
|
samples |
A character vector containing the names of samples from the file to
export to the |
svparam |
A |
yieldSize |
An integer indicating the number of lines of the file to read at once. Increasing this number will make the function faster but consume more RAM. |
expectedAlleles |
An integer indicating the approximate number of alleles that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function. |
expectedLoci |
An integer indicating the approximate number of loci that are expected to be imported after filtering and phasing. If this number is too low, the function may slow down considerably. Increasing this number increases the amount of RAM used by the function. |
maxLoci |
An integer indicating the approximate maximum number of loci to return. If
provided, the function will stop reading the file once it has found at least
this many loci that pass filtering and phasing. This argument is intended to
be used for generating small |
Details
This function requires the BioConductor package VariantAnnotation. See https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html for installation instructions.
If you anticipate running VCF2RADdata
on the same file more than once,
it is recommended to run bgzip
and indexTabix
from the package
Rsamtools once before running VCF2RADdata
. See examples.
min.ind.with.minor.allele
is used for filtering SNPs as the VCF file is
read. Additionally, because phasing SNPs into haplotypes can cause some
haplotypes to fail to pass this threshold, VCF2RADdata
internally runs
MergeRareHaplotypes
with
min.ind.with.haplotype = min.ind.with.minor.allele
, then
RemoveMonomorphicLoci
, before returning the
final RADdata
object.
Value
A RADdata
object.
Note
In the python
directory of the polyRAD installation, there is a
script called tassel_vcf_tags.py
that can identify the full tag
sequence(s) for every allele imported by VCF2RADdata
.
Author(s)
Lindsay V. Clark
References
Variant Call Format specification: http://samtools.github.io/hts-specs/
TASSEL GBSv2 pipeline: https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline
GATK: https://gatk.broadinstitute.org/hc/en-us
Tassel4-Poly: https://github.com/guilherme-pereira/tassel4-poly
See Also
MakeTasselVcfFilter
for filtering to a smaller VCF file before
reading with VCF2RADdata
.
To export to VCF: RADdata2VCF
Other data import functions: readStacks
, readHMC
,
readTagDigger
, readTASSELGBSv2
,
readProcessIsoloci
, readDArTag
Examples
# get the example VCF installed with polyRAD
exampleVCF <- system.file("extdata", "Msi01genes.vcf", package = "polyRAD")
# loading VariantAnnotation namespace takes >10s,
# so is excluded from CRAN checks
require(VariantAnnotation)
# Compress and index the VCF before reading, if not already done
if(!file.exists(paste(exampleVCF, "bgz", sep = "."))){
vcfBG <- bgzip(exampleVCF)
indexTabix(vcfBG, "vcf")
}
# Read into RADdata object
myRAD <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500)
# Example of subsetting by genomic region (first 200 kb on Chr01)
mysv <- ScanVcfParam(fixed = "ALT", info = NA, geno = "AD",
samples = samples(scanVcfHeader(exampleVCF)),
which = GRanges("01", IRanges(1, 200000)))
myRAD2 <- VCF2RADdata(exampleVCF, expectedLoci = 100, expectedAlleles = 500,
svparam = mysv, yieldSize = NA_integer_)