create_phenotypes {simplePHENOTYPES}R Documentation

Simulation of single/multiple traits under different models and genetic architectures.

Description

Simulation of single/multiple traits under different models and genetic architectures.

Usage

create_phenotypes(
  geno_obj = NULL,
  geno_file = NULL,
  geno_path = NULL,
  QTN_list = list(add = list(NULL), dom = list(NULL), epi = list(NULL)),
  prefix = NULL,
  rep = NULL,
  ntraits = 1,
  h2 = NULL,
  mean = NULL,
  model = NULL,
  architecture = "pleiotropic",
  add_QTN_num = NULL,
  dom_QTN_num = NULL,
  epi_QTN_num = NULL,
  epi_type = NULL,
  epi_interaction = 2,
  pleio_a = NULL,
  pleio_d = NULL,
  pleio_e = NULL,
  trait_spec_a_QTN_num = NULL,
  trait_spec_d_QTN_num = NULL,
  trait_spec_e_QTN_num = NULL,
  add_effect = NULL,
  big_add_QTN_effect = NULL,
  same_add_dom_QTN = FALSE,
  dom_effect = NULL,
  degree_of_dom = 1,
  epi_effect = NULL,
  type_of_ld = "indirect",
  ld_min = 0.2,
  ld_max = 0.8,
  ld_method = "composite",
  sim_method = "geometric",
  vary_QTN = FALSE,
  cor = NULL,
  cor_res = NULL,
  QTN_variance = FALSE,
  seed = NULL,
  home_dir = NULL,
  output_dir = NULL,
  export_gt = FALSE,
  output_format = "long",
  to_r = FALSE,
  out_geno = NULL,
  chr_prefix = "chr",
  remove_QTN = FALSE,
  warning_file_saver = TRUE,
  constraints = list(maf_above = NULL, maf_below = NULL, hets = NULL),
  maf_cutoff = NULL,
  nrows = Inf,
  na_string = "NA",
  SNP_effect = "Add",
  SNP_impute = "Middle",
  quiet = FALSE,
  verbose = TRUE,
  RNGversion = "3.5.1"
)

Arguments

geno_obj

Marker data set loaded as an R object. Currently either HapMap or numericalized files (code as aa = -1, Aa = 0 and AA = 1, e.g. 'data("SNP55K_maize282_maf04")') are accepted. These and other file formats (VCF, GDS, and Plink Bed/Ped files) may be read from file with 'geno_file' or 'geno_path'. Only one of 'geno_obj', 'geno_file' or 'geno_path' should be provided.

geno_file

Name of a marker data set to be read from a file. If in a different folder, the whole path should be provided. Formats accepted are Numeric, HapMap, VCF, GDS, and Plink Bed/Ped files. Notice that the major allele will always be 1 and the minor allele -1. Thus, when using Plink Bed files, the dosage information will be converted to the opposite value.

geno_path

Path to a folder containing the marker data set file/files (e.g., separated by chromosome). Formats accepted are: Numeric, HapMap, VCF, GDS, and Plink Bed/Ped files

QTN_list

A list of specific markers to be used as QTNs. If one wants to specify the QTNs instead of selecting them randomly, at least one of the following elements should be provided: 'QTN_list$add', 'QTN_list$dom', and/or 'QTN_list$epi'. The element '$add', '$dom', and '$epi' are lists containing a vector of markers for each of the traits to be simulated. For example, to simulate 2 traits controlled by 1 pleiotropic and 2 trait-specific additive QTNs, the user would create a list of marker names 'marker_list <- list(add = list(trait1 = c("marker1", "marker2", "marker3"), trait2 = c("marker1", "marker4", "marker5")))' and set 'QTN_list = marker_list'. On the other hand, to simulate a single trait controlled by 1 additive and 2 dominance QTNs, the marker list would be 'marker_list <- list(add = list("marker1"), dom = list(c("marker2", "marker3")))'. Notice that these vectors with maker names is used in the order they appear. For instance, in the list 'marker_list <- list(add = list(trait9 = c("marker1"), trait4 = c("marker5")))', the vector names itself ("trait9" and "trait4") are ignored and "trait9" will be the vector of markers used to simulate the first trait and "trait4" will be the vector of markers used to simulate the second trait. Also, when using 'QTN_list', many parameters used for selecting QTNs will be ignored (e.g., 'constraints').

prefix

If 'geno_path' points to a folder with files other than the marker data set, a part of the data set name may be used to select the desired files (e.g., prefix = "Chr" would read files Chr1.hmp.txt, ..., Chr10.hmp.txt but not HapMap.hmp.txt).

rep

The number of experiments (replicates of a trait with the same genetic architecture) to be simulated.

ntraits

The number of multi-trait phenotypes to simulate under pleiotropic, partially [pleiotropic], and LD (spurious pleiotropy) architectures (see 'architecture'). If not assigned, a single trait will be simulated. Currently, the only option for the LD architecture is 'ntraits = 2'.

h2

The heritability for each traits being simulated. It could be either a vector with length equals to 'ntraits', or a matrix with ncol equals to 'ntraits'. If the later is used, the simulation will loop over the number of rows and will generate a result for each row. If a single trait is being simulated and h2 is a vector, one simulation of each heritability value will be conducted. Either none or all traits are expected to have 'h2 = 0'.

mean

A vector with the mean (intercept) value for each of the simulated traits. If omitted, the simulated traits will be centered to zero.

model

The genetic model to be assumed. The options are "A" (additive), "D" (dominance), "E" (epistatic) as well as any combination of those models such as "AE", "DE" or "ADE".

architecture

The genetic architecture to be simulated. Should be provided if ‘ntraits' > 1. Possible options are: ’pleiotropic' (default), for traits being controlled by the same QTNs; 'partially', for traits being controlled by pleiotropic and trait-specific QTNs; 'LD', for traits being exclusively controlled by different QTNs in "direct" or "indirect" (See 'type_of_ld', 'ld_min', and 'ld_max' below) linkage disequilibrium. Currently the only option for 'architecture = "LD"' is 'ntraits = 2'.

add_QTN_num

The number of additive quantitative trait nucleotides (QTNs) to be simulated.

dom_QTN_num

The number of dominance QTNs to be simulated.

epi_QTN_num

The number of epistatic (Currently, only additive x additive epistasis are simulated) QTNs to be simulated.

epi_type

to be implemented...

epi_interaction

Number of markers that compose an epistatic QTN. If 'epi_interaction = 2' (default), a 2-way interaction (marker1 x marker2) will be used to simulate epistatic QTNs. If 'epi_interaction = 3' a 3-way interaction (marker1 x marker2 x marker3) will be used instead.

pleio_a

The number of pleiotropic additive QTNs to be used if 'architecture = "partially"'. When 'sim_method = custom' (see below), the first effects will be assigned to the pleiotropic QTNs and the last to the trait-specific ones. For instance, in a scenario where ntraits = 2, pleio_a = 2, trait_spec_a_QTN_num = 1, and add_effect = list( trait1 = c(0.1, 0.2, 0.3), trait2 = c(0.4, 0.5, 0.6)), the trait-specific QTNs for trait 1 and trait 2 will be 0.3 and 0.6, respectively. The first two allelic effects will be assigned to the pleiotropic QTNs.

pleio_d

The number of pleiotropic dominance QTNs to be used if 'architecture = "partially"' (See pleio_a for details).

pleio_e

The number of pleiotropic epistatic QTNs to be used if 'architecture = "partially"' (See pleio_a for details).

trait_spec_a_QTN_num

The number of trait-specific additive QTNs if 'architecture = "partially"'. It should be a vector of length equals to 'ntraits'.

trait_spec_d_QTN_num

The number of trait-specific dominance QTNs if 'architecture = "partially"'. It should be a vector of length equals to 'ntraits'.

trait_spec_e_QTN_num

The number of trait-specific epistatic QTNs if 'architecture = "partially"'. It should be a vector of length equals to 'ntraits'.

add_effect

Additive effect size to be simulated. It may be either a vector (assuming 'ntraits' = 1 or one allelic effect per trait to create a geometric series ['sim_method = "geometric"']) or a list of length = 'ntraits', i.e., if 'ntraits' > 1, a list with one vector of additive effects should be provided for each trait. Unless 'big_add_QTN_effect' is provided, the length of each vector should be equal to the number of additive QTNs being simulated.

big_add_QTN_effect

Additive effect size for one possible major effect quantitative trait nucleotide. If 'ntraits' > 1, big_add_QTN_effect should have length equals 'ntraits'. If 'add_QTN_num' > 1, this large effect will be assigned to the fist QTN.

same_add_dom_QTN

A boolean for selecting markers to be both additive and dominance QTNs.

dom_effect

Similar to the 'add_effect', it could be either a vector or a list. Optional if 'same_add_dom_QTN = TRUE'.

degree_of_dom

If the same set of QTNs are being used for simulating additive and dominance effects, the dominance allelic effect could be a proportion of the additive allelic effect. In other words, 'degree_of_dom' equals to 0.5, 1, 1.5 will simulate, partial dominance, complete dominance and overdominance, respectively.

epi_effect

Epistatic (additive x additive) effect size to be simulated. Similar to the 'add_effect', it could be either a vector or a list.

type_of_ld

Type of LD used to simulate spurious pleiotropy. If "indirect" (default), an intermediate marker is selected from which two adjacent markers (one upstream and another downstream) will be chosen based on its LD with the intermediate marker to be the QTNs. Optionally, in the "direct" method, one marker is selected to be a QTN for trait 1, and a second marker is selected based on its LD with the first selected marker to be the QTN for trait 2.

ld_min

Minimum Linkage disequilibrium for selecting QTNs when 'architecture = LD'. The default is 'ld_min = 0.2' (markers should have a minimum LD of 0.2 to be used as QTNs).

ld_max

Maximum Linkage disequilibrium for selecting QTNs when 'architecture = LD'. The default is 'ld_max = 0.8' (markers should have an LD of at maximum 0.8 to be used as QTNs).

ld_method

Four methods can be used to calculate linkage disequilibrium values: "composite" for LD composite measure (Default), "r" for R coefficient (by EM algorithm assuming HWE, it could be negative), "dprime" for D', and "corr" for correlation coefficient (see snpgdsLDpair from package SNPRelate).

sim_method

Provide the method of simulating allelic effects. The options available are "geometric" and "custom". For multiple QTNs, a geometric series may be simulated, i.e., if add_effect = 0.5, the effect size of the first QTNs will be 0.5, the effect size of the second QTN will be 0.5^2, and the effect of the n^th QTN will be 0.5^n.

vary_QTN

A boolean that determines if the same set of quantitative trait nucleotide (QTN) should be used to generate genetic effects for each experiment ('vary_QTN = FALSE') or if a different set of QTNs should be used for each replication ('vary_QTN = TRUE').

cor

Option to simulate traits with a predefined genetic correlation. It should be a correlation matrix with a number of rows = 'ntraits'. Default = NULL. Notice that when opting for controlling the correlation, the genetic effects are transformed using Cholesky decomposition. In this case, the correlation of genetic effects for different traits will be as provided, but due to the transformation, the actual allelic effects of correlated traits may be different than the input allelic effect.

cor_res

Option to simulate traits with a predefined residual correlation. It should be a correlation matrix with number of rows = 'ntraits'. If NULL, an identity matrix (independent residuals) will be used.

QTN_variance

Whether or not the percentage of the phenotypic variance explained by each QTN (QTN variance / phenotypic variance) should be exported. The default is FALSE. Notice that this is calculated prior to any transformation, such as the whitening/coloring transformation used to assign user-specified correlation to the genetic effect. In may not reflect the actual variance explained when the data is transformed.

seed

Value to be used by set.seed. If NULL (default), runif(1, 0, 1000000) will be used. Notice that at each sampling step, a different seed generated based on the 'seed' parameter used. For example, if one uses 'seed = 123', when simulating the 10th replication of trait 1, the seed to be used is 'round( (123 * 10 * 10) * 1)'. On the other hand, for simulating the 21st replication of trait 2, the seed to be used will be 'round( (123 * 21 * 21) * 2)'. The master seed (unique value required to reproduce results) is saved at the top of the log file. Unless verbose = FALSE the actual seed used in every simulation is exported along with simulated phenotypes.

home_dir

Directory where files should be saved. It may be home_dir = getwd().

output_dir

Name to be used to create a folder inside 'home_dir' and save output files.

export_gt

If TRUE genotypes of selected QTNs will be saved at file. If FALSE (default), only the QTN information will be saved.

output_format

Four options are available for saving simulated phenotypes: 'multi-file', saves each simulation in a separate file; 'long' (default for multiple traits), appends each experiment (rep) to the last one (by row); 'wide', saves experiments by column (default for single trait) and 'gemma', saves .fam files to be used by gemma or other software that uses plink bed files. (renaming .fam file with the same name of the bim and bed files is necessary).

to_r

Option for outputting the simulated results as an R data.frame in addition to saving it to file. If TRUE, results need to be assigned to an R object (see vignette).

out_geno

Optionally saves the numericalized genotype either as "numeric" (see vignettes for an example data), "BED" or "gds". The default is NULL.

chr_prefix

If input file format is VCF and out_geno = "BED", and a prefix is used in the chromosomes names, chr_prefix may be used to avoid issues in converting to bed files (e.g., chr_prefix = "chr" in "chr01").

remove_QTN

Whether or not a copy of the genotypic file should be saved without the simulated QTNs. The default is FALSE. If 'vary_QTN = TRUE', the question "Are you sure that you want to save one genotypic file/rep (remove_QTN = TRUE and vary_QTN = TRUE) [type yes or no] ?" will pop up to avoid saving multiple large files unintentionally

warning_file_saver

Skips the interactive question and saves all files when 'remove_QTN = TRUE' and 'vary_QTN = TRUE'.

constraints

Set constraints for QTN selection. Currently, the options are maf_above (the minimum value of minor allele frequency, a double between 0 - 0.5), maf_below (the maximum value of minor allele frequency, a double between 0 - 0.5),and hets ('include' and 'remove'). All of these options are NULL by default ('list(maf_above = NULL, maf_below = NULL, hets = NULL )' ). For instance, if the parameters used are 'constraints = list(maf_above = 0.3, maf_below = 0.44, hets = "include")', only heterozygote markers with minor allele frequency between 0.3 and 0.44 will be selected to be QTNs. The option "remove" would only select homozygote markers to be QTNs.

maf_cutoff

Option for filtering the data set based on minor allele frequency (Not to be confounded with the constraints option which will only filter possible QTNs). It may be useful when outputting the genotypic data set.

nrows

Option for loading only part of a data set. Used when marker data is in numeric or HapMap format. Please see data.table::fread for details.

na_string

Tell create_phenotypes what character represents missing data (default is "NA"). Used when the input marker data is numeric or HapMap.

SNP_effect

Parameter used for numericalization. The options are: Add (AA = 1, Aa = 0, aa = -1), Dom (AA = -1, Aa = 0, aa = -1), Left (AA = 1, Aa = -1, aa = -1), Right (AA = 1, Aa = 1, aa = -1). The default option is Add.

SNP_impute

Naive imputation for HapMap numericalization. The options are: Major (NA <- 1), Middle (NA <- 0), and Minor (NA <- -1).

quiet

Whether or not the log file should pop up into R once the simulation is done.

verbose

If FALSE, suppress all prints and suppress individual seed numbers from being saved to file. The master seed (unique value required to reproduce results) is saved at the top of the log file.

RNGversion

Parameter to set the random number generator. Different R versions may be selected, the default value is '3.5.1'.

Value

Single or multi-trait phenotypes in one of many formats. Numericalized marker data set with or without the selected QTNs. Diagnostic files (log, QTN information, summary of LD between QTNs, proportion of phenotypic variance explained by each QTN).

Author(s)

Samuel B Fernandes and Alexander E Lipka Last update: Jan 19, 2021

References

Fernandes, S.B., and Lipka, A.E., 2020 simplePHENOTYPES: SIMulation of pleiotropic, linked and epistatic SIMulation of Pleiotropic, Linked and Epistatic PHENOTYPES. BMC Bioinformatics 21(1):491, doi: 10.1186/s12859-020-03804-y

Examples

# Simulate 50 replications of a single phenotype.
data("SNP55K_maize282_maf04")
pheno <- 
  create_phenotypes(
    geno_obj = SNP55K_maize282_maf04,
    add_QTN_num = 3,
    add_effect = 0.2,
    big_add_QTN_effect = 0.9,
    rep = 10,
    h2 = 0.7,
    model = "A",
    to_r = TRUE,
    home_dir = tempdir(),
    quiet = T
    )
# For more examples, please run the following:
# vignette("simplePHENOTYPES")


[Package simplePHENOTYPES version 1.3.0 Index]