simData {optBiomarker} | R Documentation |
Simulation of microarray data
Description
The function simulates microarray data for two-group comparison with user supplied parameters such as number of biomarkers (genes or proteins), sample size, biological and experimental (technical) variation, replication, differential expression, and correlation between biomarkers.
Usage
simData(nTrain=100,
nGr1=floor(nTrain/2),
nBiom=50,nRep=3,
sdW=1.0,
sdB=1.0,
rhoMax=NULL, rhoMin=NULL, nBlock=NULL,bsMin=3, bSizes=NULL, gamma=NULL,
sigma=0.1,diffExpr=TRUE,
foldMin=2,
orderBiom=TRUE,
baseExpr=NULL)
Arguments
nTrain |
Training set size,.i.e., the total number of biological
samples in group 1 ( |
nGr1 |
Size of group 1. Defaults to |
nBiom |
Number of biomarkers (genes, probes or proteins). |
nRep |
Number of technical replications. |
sdW |
Experimental (technical) variation ( |
sdB |
Biological variation ( |
rhoMax |
Maximum Pearson's correlation coefficient between
biomarkers. To ensure positive definiteness, allowed values are
restricted between 0 and 0.95 inclusive. If |
rhoMin |
Minimum Pearson's correlation coefficient between
biomarkers. To ensure positive definiteness, allowed values are
restricted between 0 and 0.95 inclusive. If |
nBlock |
Number of blocks in the block diagonal (Hub-Toeplitz)
correlation matrix. If |
bsMin |
Minimum block size. |
bSizes |
A vector of length |
gamma |
Specifies a correlation structure. If |
sigma |
Standard deviation of the normal distribution (before truncation) where fold changes are generated from. See details. |
diffExpr |
Logical. Should systematic difference be introduced between the data of the two groups? |
foldMin |
Minimum value of fold changes. See details. |
orderBiom |
Logical. Should columns (biomarkers) be arranged in order of differential expression? |
baseExpr |
A vector of length |
Details
Differential expressions are introduced by adding z\delta
to the data
of group 2 where \delta
values are generated from a truncated normal
distribution and z
is randomly selected from (-1,1)
to
characterise up- or down-regulation.
Assuming that Y ~is~ N(\mu, \sigma^2)
, and A=[a_1,a_2]
, a subset of
-Inf <y < Inf
, the conditional distribution of Y
given A
is called truncated normal distribution:
f(y, \mu, \sigma)= (1/\sigma) \phi((y-\mu)/\sigma) / (\Phi((a2-\mu)/\sigma) -
\Phi((a_1-\mu)/\sigma))
for a_1 <= y <= a_2
, and 0 otherwise,
where \mu
is the mean of the original Normal distribution before truncation,
\sigma
is the corresponding standard deviation,a_2
is the upper truncation point,
a_1
is the lower truncation point, \phi(x)
is the density of the
standard normal distribution, and \Phi(x)
is the distribution function
of the standard normal distribution. For simData
function, we
consider a_1=log_2(\code{foldMin})
and a_2=Inf
. This ensures that the
biomarkers are differentially expressed by a fold change of
foldMin
or more.
Value
A dataframe of dimension nTrain
by nBiom+1
. The first
column is a factor (class
) representing the group memberships of
the samples.
Author(s)
Mizanur Khondoker, Till Bachmann, Peter Ghazal
Maintainer: Mizanur Khondoker mizanur.khondoker@gmail.com.
References
Khondoker, M. R., Till T. Bachmann, T. T., Mewissen, M., Dickinson, P. et al.(2010). Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules. Journal of Bioinformatics and Computational Biology, 8, 945-965.
See Also
Examples
simData(nTrain=10,nBiom=3)