vcf2diem {diemr} | R Documentation |
Convert vcf files to diem format
Description
Reads vcf files and writes genotypes of the most frequent alleles based on chromosome positions to diem format.
Usage
vcf2diem(SNP, filename, chunk = 1L, requireHomozygous = TRUE)
Arguments
SNP |
character vector with a path to the '.vcf' or '.vcf.gz' file, or an |
filename |
character vector with a path where to save the converted genotypes. |
chunk |
numeric indicating by how many markers should the result be split into separate files. |
requireHomozygous |
logical whether to require the marker to have at least one homozygous individual for each allele. |
Details
Importing vcf files larger than 1GB, and those containing multiallelic
genotypes is not recommended. Instead, use the path to the
vcf file in SNP
. vcf2diem
then reads the file line by line, which is
a preferred solution for data conversion, especially for
very large and complex genomic datasets.
The number of files vcf2diem
creates depends on the chunk
argument
and class of the SNP
object.
Values of
chunk < 100
are interpreted as the number of files into which to split data inSNP
. ForSNP
object of classvcfR
, the number of markers per file is calculated from the dimensions ofSNP
. When class ofSNP
ischaracter
, the number of markers per file is approximated from a model with a message. If this number of markers per file is inappropriate for the expected output, provide the intended number of markers per file inchunk
greater than 100 (values greater than 10000 are recommended for genomic data).vcf2diem
will scan the whole input specified in theSNP
file, creating additional output files until the last line inSNP
is reached.Values of
chunk >= 100
mean that each output file in diem format will containchunk
number of lines with the data inSNP
.
When the vcf file contains markers not informative for genome polarisation,
those are removed and listed in a file ending with omittedSites.txt in the
directory specified in the SNP
argument or in the working directory.
The omitted loci are identified by their information in the CHROM and POS columns,
and include the QUAL column data. The last column is an integer specifying
the reason why the respective marker was omitted. The reasons why markers are
not informative for genome polarisation using diem
are:
Marker has fewer than 2 alleles representing substitutions.
Required homozygous individuals for the 2 most frequent alleles are not present (optional, controlled by the
requireHomozygous
argument).The second most frequent allele is found only in one heterozygous individual.
Dataset is invariant for the most frequent allele.
Dataset is invariant for the allele listed as the first ALT in the vcf input.
The CHROM, POS, and QUAL information for loci included in the converted files are listed in the file ending with includedSites.txt. Additional columns show which allele is encoded as 0 in its homozygous state and which is encoded as 2.
Value
No value returned, called for side effects.
Author(s)
Natalia Martinkova
Filip Jagos 521160@mail.muni.cz
Jachym Postulka 506194@mail.muni.cz
Examples
## Not run:
# vcf2diem will write files to a working directory or a specified folder
# make sure the working directory or the folder are at a location with write permission
myofile <- system.file("extdata", "myotis.vcf", package = "diemr")
vcf2diem(SNP = myofile, filename = "test1")
vcf2diem(SNP = myofile, filename = "test2", chunk = 3)
## End(Not run)