gentobd {BinaryDosage}R Documentation

Convert a gen file to a binary dosage file


Routine to read information from a gen file and create a binary dosage file. Note: This routine can take a long time to run if the gen file is large.


  snpcolumns = 1L:5L,
  startcolumn = 6L,
  impformat = 3L,
  chromosome = character(),
  header = c(FALSE, TRUE),
  gz = FALSE,
  sep = "\t",
  format = 4L,
  subformat = 0L,
  snpidformat = 0L,
  bdoptions = character(0)



A vector of file names. The first is the name of the gen file. The second is name of the sample file that contains the subject information.


Column numbers containing chromosome, snpid, location, reference allele, alternate allele, respectively. This must be an integer vector. All values must be positive except for the chromosome. The value for the chromosome may be -1 or -0. -1 indicates that the chromosome value is passed to the routine using the chromosome parameter. 0 indicates that the chromosome value is in the snpid and that the snpid has the format chromosome:other_data. Default value is c(1L, 2L, 3L, 4L, 5L).


Column number of first column with genetic probabilities or dosages. Must be an integer value. Default value is 6L.


Number of genetic data values per subject. 1 indicates dosage only, 2 indicates P(g=0) and P(g=1) only, 3 indicates P(g=0), P(g=1), and P(g=2). Default value is 3L.


Chromosome value to use if the first value of the snpcolumns is equal to 0. Default value is character().


Indicators if the gen and sample files have headers. If the gen file does not have a header. A sample file must be included. Default value is c(FALSE, TRUE).


Indicator if file is compressed using gzip. Default value is FALSE.


Separator used in the gen file. Default value is "\t"


Vector of names of the output files. The binary dosage file name is first. The family and map files follow. For format 4, no family and map file names are needed.


The format of the output binary dosage file. Allowed values are 1, 2, 3, and 4. The default value is 4. Using the default value is recommended.


The subformat of the format of the output binary dosage file. A value of 1 or 3 indicates that only the dosage value is saved. A value of 2 or 4 indicates the dosage and genetic probabilities will be output. Values of 3 or 4 are only allowed with formats 3 and 4. If a value of zero if provided, and genetic probabilities are in the vcf file, subformat 2 will be used for formats 1 and 2, and subformat 4 will be used for formats 3 and 4. If the vcf file does not contain genetic probabilities, subformat 1 will be used for formats 1 and 2, and subformat 3 will be used for formats 3 and 4. The default value is 0.


The format that the SNP ID will be saved as. -1 - SNP ID not written. 0 - same as in the VCF file. 1 - chromosome:location. 2 - chromosome:location:reference_allele:alternate_allele. If snpidformat is 1 and the VCF file uses format 2, an error is generated. Default value is 0.


Character array containing any of the following value, "aaf", "maf", "rsq". The presence of any of these values indicates that the specified values should be calculates and stored in the binary dosage file. These values only apply to format 4.




# Find the gen file names
gen3afile <- system.file("extdata", "set3a.imp", package = "BinaryDosage")
gen3asample <- system.file("extdata", "set3a.sample", package = "BinaryDosage")
# Get temporary output file name
bdfiles <- tempfile()
# Convert the file
gentobd(genfiles = c(gen3afile, gen3asample),
        snpcolumns = c(0L, 2L:5L),
        bdfiles = bdfiles)
# Verify the file was written correctly
bdinfo <- getbdinfo(bdfiles = bdfiles)

[Package BinaryDosage version 1.0.0 Index]