ReadMarker {Eagle}R Documentation

Read marker data.

Description

A function for reading in different types of snp marker data.

Usage

ReadMarker(
  filename = NULL,
  type = "text",
  missing = NULL,
  AA = NULL,
  AB = NULL,
  BB = NULL,
  availmemGb = 16,
  quiet = TRUE
)

Arguments

filename

contains the name of the marker file. The file name needs to be in quotes. If the file is not in the working directory, then the full path to the file is required.

type

specify the type of file. Choices are 'text' (the default) , PLINK, and vcf.

missing

the number or character for a missing genotype in the text file. There is no need to specify this for a vcf or PLINK ped file. Missing allele values in a vcf file are coded as "." and missing allele values in a PLINK file must be coded as '0' or '-'.

AA

the character or number corresponding to the 'AA' snp genotype in the marker genotype file. This need only be specified if the file type is 'text'. If a character then it must be in quotes.

AB

the character or number corresponding to the 'AB' snp genotype in the marker genotype file. This need only be specified if the file type is 'text'. This can be left unspecified if there are no heterozygous genotypes (i.e. the individuals are inbred). Only a single heterozygous genotype is allowed ('Eagle' does not distinguish between 'AB' and 'BA'). If specified and a character, it must be in quotes.

BB

the character or number corresponding to the 'BB' snp genotype in the marker genotype file. This need only be specified if the file type is 'text'. If a character, then it must be in quotes.

availmemGb

a numeric value. It specifies the amount of available memory (in Gigabytes). This should be set to be as large as possible for best performance.

quiet

a logical value. If set to TRUE, additional runtime output is printed.

Details

ReadMarker can handle three different types of marker data; namely, genotype data in a plain text file, PLINK ped files, and vcf files.

Reading in a plain text file containing the marker genotypes

To load a text file that contains snp genotypes, run ReadMarker with filename set to the name of the file, and AA, AB, BB set to the corresponding genotype values. The genotype values in the text file can be numeric, character, or a mix of both.

We make the following assumptions

For example, suppose we have a space separated text file with marker genotype data collected from five snp loci on three individuals where the snp genotype AA has been coded 0, the snp genotype AB has been coded 1, the snp genotype BB has been coded 2, and missing genotypes are coded as 99

0 1 2 0 2
1 1 0 2 0
2 2 1 1 99

The file is called geno.txt and is located in the directory /my/dir/.

To load these data, we would use the command

geno_obj <- ReadMarker(filename='/my/dir/geno.txt', AA=0, AB=1, BB=2, type='text', missing=99)

where the results from running the function are placed in geno_obj.

As another example, suppose we have a space separated text file with marker genotype data collected from five snp loci on three individuals where the snp genotype AA has been coded a/a, the snp genotype AB has been coded a/b, and the snp genotype BB has been coded b/b

a/a a/b b/b a/a b/b
a/b a/b a/a b/b a/a
b/b b/b a/b a/b NA

The file is called geno.txt and is located in the same directory from which R is being run (i.e. the working directory).

To load these data, we would use the command

geno_obj <- ReadMarker(filename='geno.txt', AA='a/a', AB='a/b', BB='b/b', 
                                       type='text', missing = 'NA')

where the results from running the function are placed in geno_obj.

Reading in a PLINK ped file

PLINK is a well known toolkit for the analysis of genome-wide association data. See https://www.cog-genomics.org/plink2 for details.

Full details of PLINK ped files can be found https://www.cog-genomics.org/plink/1.9/formats#ped. Briefly, the PED file is a space delimited file (tabs are not allowed): the first six columns are mandatory:

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype

Here, these columns can be any values since ReadMarker ignores these columns.

Genotypes (column 7 onwards) can be any character (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic. All snps must have two alleles specified. Missing alleles (i.e 0 or -) are allowed. No column headings should be given.

As an example, suppose we have data on three individuals genotyped for four snp loci

FAM001 101 0 0 1 0 A G C C C G A A
FAM001 201 0 0 2 0 A A C T G G T A
FAM001 300 101 201 2 0 G A T T C G A T

Then to load these data, we would use the command

geno_obj <- ReadMarker(filename='PLINK.ped', type='PLINK')

where geno_obj is used by AM, and the file PLINK.ped is located in the working directory (i.e. the directory from which R is being run).

Reading in a vcf file

VCF is a tab separated text file containing meta-information lines, a header line, and data lines. The data lines contain information about a position in the genome.

It is assumed that genotype information has been recorded on samples for each position.

Loci with more than two alleles will be removed automatically.

Eagle will only accept a single (uncompressed) vcf file. If chromosomal information has been recorded in separate vcf files, these files need to be merged into a single vcf file. This can be done by using the BCFtools utility set with command line "bcftools concat".

Value

To allow Eagle to handle data larger than the memory capacity of a machine, ReadMarker doesn't load the marker data into memory. Instead, it writes a reformatted version of the marker data, and its transpose, to the harddrive. These two files are only temporary, being removed at the end of the R session. The object returned by ReadMarker is a list object with the elements tmpM , tmpMt, and dim_of_M which is the full file name (name and path) of the reformatted file for the marker data, the full file name of the reformatted file for the transpose of the marker data, and a 2 element vector with the first element the number of individuals and the second element the number of marker loci.

Examples

  #--------------------------------
  #  Example 1
  #-------------------------------
  #
  # Read in the genotype data contained in the text file geno.txt
  #
  # The function system.file() gives the full file name (name + full path).
  complete.name <- system.file('extdata', 'geno.txt', package='Eagle')
  # 
  # The full path and name of the file is
  print(complete.name)
  
  # Here, 0 values are being treated as genotype AA,
  # 1 values are being treated as genotype AB, 
  # and 2 values are being treated as genotype BB. 
  # 4 gigabytes of memory has been specified. 
  # The file is space separated with the rows the individuals
  # and the columns the snp loci.
  geno_obj <- ReadMarker(filename=complete.name, type='text', AA=0, AB=1, BB=2, availmemGb=4) 
   
  # view list contents of geno_obj
  print(geno_obj)

  #--------------------------------
  #  Example 2
  #-------------------------------
  #
  # Read in the allelic data contained in the PLINK ped file geno.ped
  #
  # The function system.file() gives the full file name (name + full path).
  complete.name <- system.file('extdata', 'geno.ped', package='Eagle')

  # 
  # The full path and name of the file is
  print(complete.name)
  
  # Here,  the first 6 columns are being ignored and the allelic 
  # information in columns 7 -  10002 is being converted into a reformatted file. 
  # 4 gigabytes of memory has been specified. 
  # The file is space separated with the rows the individuals
  # and the columns the snp loci.
  geno_obj <- ReadMarker(filename=complete.name, type='PLINK', availmemGb=4) 
   
  # view list contents of geno_obj
  print(geno_obj)



  #--------------------------------
  #  Example 3
  #-------------------------------
  #
  #
  # Read in the genotype data contained in the vcf file geno.vcf
  #
  # The function system.file() gives the full file name (name + full path).
  complete.name <- system.file('extdata', 'geno.vcf', package='Eagle')
  # 
  # The full path and name of the file is
  print(complete.name)
  
  # The file contains 5 marker loci recorded on 3 individuals
  # Two of the loci contain multiple alleles and are removed. 
  # A summary of the file is printed once the file has been read.
  geno_obj <- ReadMarker(filename=complete.name, type="vcf", availmemGb=4) 
   
  # view list contents of geno_obj
  print(geno_obj)


[Package Eagle version 2.5 Index]