simDat.eQTL.scRNAseq {powerEQTL} | R Documentation |
Generate Gene Expression Levels Of One Gene And Genotypes Of One SNP For Subjects With Multiple Cells Based On ZINB Mixed Effects Regression Model
Description
Generate gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on ZINB mixed effects regression model.
Usage
simDat.eQTL.scRNAseq(nSubj = 50,
nCellPerSubj = 100,
zero.p = 0.01,
m.int = 0,
sigma.int = 1,
slope = 1,
theta = 1,
MAF = 0.45)
Arguments
nSubj |
integer. Total number of subjects. |
nCellPerSubj |
integer. Number of cells per subject. |
zero.p |
numeric. Probability that an excess zero occurs. |
m.int |
numeric. Mean of random intercept (see details). |
sigma.int |
numeric. Standard deviation of random intercept (see details). |
slope |
numeric. Slope (see details). |
theta |
numeric. dispersion parameter of negative binomial distribution.
The smaller |
MAF |
numeric. Minor allele frequency of the SNP. |
Details
This function simulates gene expression levels of one gene and genotypes of one SNP for subjects with multiple cells based on zero-inflated negative binomial (ZINB) regression model with only one covariate: genotype. That is, the read counts of a gene follows a mixture of 2-component distributions. One component takes only one value: zero. The other component is negative binomial distribution, which takes non-negative values 0, 1, 2, .... The log mean of the negative binomial distribution is linear function of the genotype.
Denote as the read counts for the
-th cell of
the
-th subject,
,
,
is the number of subjects, and
is the number of cells per subject.
Denote as the probability that
is an excess zero.
With probability
,
follows a negative binomial distribution
, where
is the mean (i.e.,
) and
is the dispersion parameter.
The variance of the NB distribution is
.
The relationship between gene expression and genotype for the
-th subject is characterized by the equation
where is the random intercept following a normal
distribution
to account for within-subject correlation of gene expression,
is the mean of the random intercept,
is the standard deviation of the random intercept,
is the slope, and
is the additive-coded genotype for the SNP with minor allele frequency
.
We assume that the SNP satisfies the Hardy-Weinberg Equilibrium. That is, the
probabilities of the 3 genotypes are
, respectively.
For simplicity, we assume that excess zeros are caused by technical issues, hence are not related to genotypes.
Value
A data frame with 3 columns:
id |
subject id |
geno |
additive-coded genotype of the SNP |
counts |
gene expression of the gene |
Author(s)
Xianjun Dong <XDONG@rics.bwh.harvard.edu>, Xiaoqi Li<xli85@bwh.harvard.edu>, Tzuu-Wang Chang <Chang.Tzuu-Wang@mgh.harvard.edu>, Scott T. Weiss <restw@channing.harvard.edu>, Weiliang Qiu <weiliang.qiu@gmail.com>
References
Dong X, Li X, Chang T-W, Scherzer CR, Weiss ST, and Qiu W. powerEQTL: An R package and shiny application for sample size and power calculation of bulk tissue and single-cell eQTL analysis. Bioinformatics, 2021;, btab385
Examples
frame = simDat.eQTL.scRNAseq(nSubj = 5,
nCellPerSubj = 3,
zero.p = 0.01,
m.int = 0,
sigma.int = 1,
slope = 1,
theta = 1,
MAF = 0.45)
print(dim(frame))
print(frame[1:10,])