getCausalSNPs {PhenotypeSimulator}R Documentation

Draw random SNPs from genotypes.

Description

Draw random SNPs from genotypes provided or external genotype files. When drawing from external genotype files, only lines of randomly chosen SNPs are read, which is recommended for large genotype files. See details for more information. The latter option currently supports file in simple delim-formats (with specified delimiter and optional number of fields to skip) and the bimbam and the oxgen format.

Usage

getCausalSNPs(
  N,
  NrCausalSNPs = 20,
  genotypes = NULL,
  chr = NULL,
  NrSNPsOnChromosome = NULL,
  NrChrCausal = NULL,
  genoFilePrefix = NULL,
  genoFileSuffix = NULL,
  format = "delim",
  delimiter = ",",
  header = FALSE,
  skipFields = NULL,
  probabilities = FALSE,
  sampleID = "ID_",
  verbose = TRUE
)

Arguments

N

Number [integer] of samples to simulate.

NrCausalSNPs

Number [integer] of SNPs to chose at random.

genotypes

[NrSamples x totalNrSNPs] Matrix of genotypes [integer]/ [double].

chr

Vector of chromosome(s) [integer] to chose NrCausalSNPs from; only used when external genotype data is provided i.e. !is.null(genoFilePrefix).

NrSNPsOnChromosome

Vector of number(s) of SNPs [integer] per entry in chr (see above); has to be the same length as chr. If not provided, number of SNPS in file will be determined from line count (which can be slow for large files); (optional) header lines will be ignored, so accurate number of SNPs not lines in file should be specified.

NrChrCausal

Number [integer] of causal chromosomes to sample NrCausalSNPs from (as opposed to the actual chromosomes to chose from via chr ); only used when external genotype data is provided i.e. !is.null(genoFilePrefix).

genoFilePrefix

full path/to/chromosome-wise-genotype-file-ending- before-"chrChromosomeNumber" (no '~' expansion!) [string].

genoFileSuffix

[string] Following chromosome number including .fileformat (e.g. ".csv"); File described by genoFilePrefix-genoFileSuffix has to be a text format i.e. comma/tab/space separated.

format

Name [string] of genotype file format. Options are: "oxgen", "bimbam" or "delim". See readStandardGenotypes for details.

delimiter

Field separator [string] of genotypefile or genoFilePrefix-genoFileSuffix file if format == 'delim'.

header

[logical] Can be set to indicate if genoFilePrefix-genoFileSuffix file has a header for format == 'delim'. See details.

skipFields

Number [integer] of fields (columns) to skip in genoFilePrefix-genoFileSuffix file if format == 'delim'. See details.

probabilities

[boolean]. If set to TRUE, the genotypes in the files described by genoFilePrefix-genoFileSuffix are provided as triplets of probabilities (p(AA), p(Aa), p(aa)) and are converted into their expected genotype frequencies by 0*p(AA) + p(Aa) + 2p(aa) via probGen2expGen.

sampleID

Prefix [string] for naming samples (will be followed by sample number from 1 to N when constructing id_samples)

verbose

[boolean] If TRUE, progress info is printed to standard out

Details

In order to chose SNPs from external genotype files without reading them into memory, genotypes for each chromosome need to be accesible as [SNPs x samples] in a separate file, containing "chrChromosomenumber" (e.g chr22) in the file name (e.g. /path/to/dir/related_nopopstructure_chr22.csv). All genotype files need to be saved in the same directory. genoFilePrefix (/path/to/dir/related_nopopstructure_) and genoFileSuffix (.csv) specify the strings leading and following the "chrChromosomenumber". If format== delim, the first column in each file needs to be the SNP_ID, the first row can either contain sample IDs or the first row of genotypes (specified with header). Subsequent columns containing additional SNP information can be skipped by setting skipFields. If format==oxgen or bimbam, files need to be in the oxgen or bimbam format (see readStandardGenotypes for details) and no additional information about delim, header or skipFields will be considered. getCausalSNPs generates a vector of chromosomes from which to sample the SNPs. For each of the chromosomes, it counts the number of SNPs in the chromosome file and creates vectors of random numbers ranging from 1:NrSNPSinFile. Only the lines corresponding to these numbers are then read into R. The example data provided for chromosome 22 contains genotypes (50 samples) of the first 500 SNPs on chromosome 22 with a minor allele frequency of greater than 2 Genomes project.

Value

[N x NrCausalSNPs] Matrix of randomly drawn genotypes [integer]/ [double]

See Also

standardiseGenotypes

Examples

# get causal SNPs from genotypes simulated within PhenotypeSimulator
geno <- simulateGenotypes(N=10, NrSNP=10)
causalSNPsFromSimulatedGenoStandardised <- getCausalSNPs(N=10,
NrCausalSNPs=10, genotypes=geno$genotypes)

# Get causal SNPs by sampling lines from large SNP files
genotypeFile <- system.file("extdata/genotypes/",
"genotypes_chr22.csv",
package = "PhenotypeSimulator")
genoFilePrefix <- gsub("chr.*", "", genotypeFile)
genoFileSuffix <- ".csv"
causalSNPsFromLines <- getCausalSNPs(N=50, NrCausalSNPs=10, chr=22,
genoFilePrefix=genoFilePrefix,
genoFileSuffix=genoFileSuffix)

[Package PhenotypeSimulator version 0.3.4 Index]