simulateSNPglm {scrime} | R Documentation |
Simulation of SNP data
Description
Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify either a binary or a quantitative response by a logistic or linear regression model, respectively.
Usage
simulateSNPglm(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL,
beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, p.cutoff = 0.5,
err.fun = NULL, rand = NA, ...)
Arguments
n.obs |
number of observations that should be generated. |
n.snp |
number of SNPs that should be generated. |
list.ia |
a list consisting of numeric vectors (or values) specifying the genotypes
of the SNPs that should be explanatory for the response. Each of these vectors
must be composed of some of the numbers -3, -2, -1, 1, 2, 3, where 1 denotes
the homozygous reference genotype, 2 the heterozygous genotype, and 3 the
homozygous variant genotype, and a minus before these numbers means that
the corresponding SNP should be not of this genotype. If, e.g., one of the
vectors is given by
For more details, see Details. Must be specified if |
list.snp |
a list consisting of numeric vectors specifying the SNPs that compose
the interactions used in the regression model. Each of these vectors must have the
same length as the corresponding vector in |
beta0 |
a numeric value specifying the intercept of the regression model. |
beta |
a non-negative numeric value or vector of the same length as |
maf |
either an integer, or a vector of length 2 or |
sample.y |
should the values of the response in the logistic regression model be randomly drawn
using the probabilities of the respective observations for being a case? If |
p.cutoff |
a probability, i.e.\ a numeric value between 0 and 1, naming the cutoff for an
observation to be called a case if |
err.fun |
a function for generating the error that is added to the linear model to determine
the value of the (quantitative) response. If |
rand |
a numeric value for setting the random number generator in a reproducible state. |
... |
further arguments of the function specified by |
Details
simulateSNPglm
first simulates a matrix consisting of n.obs
observations and n.snp
SNPs, where the minor allele frequencies of these SNPs are given by maf
.
Note that all SNPs are currently simulated independently of each other such that they are unlinked.
Afterwards, the response is determined by a regression model using the specifications of list.ia
,
list.snp
, beta0
and beta
. Depending on whether err.fun
is specified or not,
a linear or a logistic regression model is used, respectively, i.e.\ the response Y
is continuous or binary.
By default, a logistic regression model
logit(Prob(Y = 1
)) = beta0
+ beta[1]
* L_1
+ beta[2]
* L_2
+ ...
is fitted, since err.fun = NULL
.
If both list.ia
and list.snp
are NULL
, then interactions similar to the one
considered, e.g., in Nunkesser et al.\ (2007) or Schwender et al.\ (2007) are used, i.e.\
L_1
= (SNP6 != 1) & (SNP7 == 1)
\ \ \ and
L_2
= (SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1)
,
by setting list.ia = list(c(-1, 1), c(1, 1, 1))
and
list.snp = list(c(6, 7), c(3, 9, 10))
.
Using the above model Prob(Y = 1
) is computed for each observation, and its value of the response is
determined either by a random draw from a Bernoulli distribution using this probability (if sample.y = TRUE
),
or by evaluating if Prob(Y = 1
) >
p.cutoff
(if sample.y = FALSE
).
If err.fun
is specified, then the linear model
Y
= beta0
+ beta[1]
* L_1
+ beta[2]
* L_2
+ ... + error
is used to determine the values of the response Y
, where the values for error are given
by the output of a call of err.fun
.
Value
An object of class simSNPglm
consisting of
x |
a matrix with |
y |
a vector of length |
beta0 |
the value of the intercept. |
beta |
the vector of parameters. |
ia |
a character vector naming the explanatory interactions. |
maf |
a vector of length |
prob |
a vector of length |
err |
a vector of length |
p.cutoff |
the value of |
err.call |
a character string naming the call of the error function (if |
Author(s)
Holger Schwender, holger.schwender@udo.edu
References
Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. and Wegener, I.\ (2007). Detecting High-Order Interactions of Single Nucleotide Polymorphisms Using Genetic Programming. Bioinformatics, 23, 3280-3288.
Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.
See Also
simulateSNPs
, summary.simSNPglm
, simulateSNPcatResponse
Examples
## Not run:
# The simulated data set described in Details.
sim1 <- simulateSNPglm()
sim1
# A bit more information: Table of probabilities of being a case
# vs. numbers of cases and controls.
summary(sim1)
# Calling an observation a case if its probability of being
# a case is larger than 0.5 (the default for p.cutoff).
sim2 <- simulateSNPglm(sample.y = FALSE)
summary(sim2)
# If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3) and
# ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions
# (or variables) that are explanatory for the response,
# list.ia and list.snp are specified as follows.
list.ia <- list(c(-2, 1), 3, c(-1,3))
list.snp <- list(c(4, 3), 5, c(12,9))
# The binary response and the data set consisting of
# 600 observations and 25 SNPs, where the minor allele
# frequency of each SNP is randomly drawn from a
# uniform distribution with minimum 0.1 and maximum 0.4,
# is then generated by
sim3 <- simulateSNPglm(n.obs = 600, n.snp = 25,
list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4))
sim3
summary(sim3)
# If the response should be quantitative, err.fun has
# to be specified. To use a normal distribution with mean 0
# (default in rnorm) and a standard deviation of 2
# as the distribution of the error, call
simulateSNPglm(err.fun = rnorm, sd = 2)
## End(Not run)