gen.data {BioMark} | R Documentation |
Simulate data sets
Description
The functions gen.data
and gen.data2
generate one or more
two-class data matrices where the first nbiom
variables are changed
in the treatment class. The aim is to provide an easy means to evaluate the
performance of biomarker identification methods. Function
gen.data
samples from a multivariate normal distribution;
gen.data2
generates spiked data either by adding differences to
the first columns, or by multiplying with factors given by the
user. Note that whereas gen.data
will provide completely new
simulated data, both for the control and treatment classes,
gen.data2
essentially only changes the biomarker part of the
treated class.
Usage
gen.data(ncontrol, ntreated = ncontrol, nvar, nbiom = 5, group.diff = 0.5,
nsimul = 100, means = rep(0, nvar), cormat = diag(nvar))
gen.data2(X, ncontrol, nbiom, spikeI,
type = c("multiplicative", "additive"),
nsimul = 100, stddev = .05)
Arguments
ncontrol , ntreated |
Numbers of objects in the two classes. If only
ncontrol is given, the two classes are assumed to be of equal size,
or, in the case of |
nvar |
Number of variables. |
nbiom |
Number of biomarkers, i.e. the number of variables to be changed in the treatment class compared to the control class. The variables that are changed are always the first variables in the data matrix. |
group.diff |
group difference; the average difference between values of the biomarkers in the two classes. |
nsimul |
Number of data sets to simulate. |
means |
Mean values of all variables, a vector. |
cormat |
Correlation matrix to be used in the simulation. Default is the identity matrix. |
X |
Experimental data matrix, without group differences. |
spikeI |
A vector of at least three different numbers, used to generate new values for the biomarker variables in the treated class. |
type |
Whether to use multiplication (useful when simulating cases where things like "twofold differences" are relevant), or addition (in the case of absolute differences in the treatment and control groups). |
stddev |
Additional noise: in every simulation, normally
distributed noise with a standard deviation of
|
Details
The spikeI
argument in function gen.data2
provides the numbers that will be used to artificially "spike" the
biomarker variables, either by multiplication (the default) or by
addition. To obtain approximate two-fold differences, for example, one
could use spikeI = c(1.8, 2.0, 2.2)
. At least three different
values should be given since in most cases more than one set will be
simulated and we require different values in the biomarker
variables.
Value
A list with the following elements:
X |
An array of dimension |
Y |
The class vector. |
n.biomarkers |
The number of biomarkers. |
Note that the biomarkers are always in the first nbiom
columns
of the data matrix.
Author(s)
Ron Wehrens
Examples
## Not run:
X <- gen.data(10, nvar = 200)
names(X)
dim(X$X)
set.seed(7)
simdat <- gen.data(10, nvar = 1200, nbiom = 22, nsimul = 1,
group.diff = 2)
simdat.stab <- get.biom(simdat$X[,,1], simdat$Y, fmethod = "all",
type = "stab", ncomp = 3, scale.p = "auto")
## show LASSO success
traceplot(simdat.stab, lty = 1, col = rep(2:1, c(22, 1610)))
data(SpikePos)
real.markers <- which(SpikePos$annotation$found.in.standards > 0)
X.no.diff <- SpikePos$data[1:20, -real.markers]
set.seed(7)
simdat2 <- gen.data2(X.no.diff, ncontrol = 10, nbiom = 22,
spikeI = c(1.2, 1.4, 2), nsimul = 1)
simdat2.stab <- get.biom(simdat2$X[,,1], simdat$Y,
fmethod = "all", type = "stab", ncomp = 3,
scale.p = "auto")
## show LASSO success
traceplot(simdat2.stab, lty = 1, col = rep(2:1, c(22, 1610)))
## End(Not run)