R: Simulate data sets

gen.data {BioMark}

R Documentation

Simulate data sets

Description

The functions gen.data and gen.data2 generate one or more two-class data matrices where the first nbiom variables are changed in the treatment class. The aim is to provide an easy means to evaluate the performance of biomarker identification methods. Function gen.data samples from a multivariate normal distribution; gen.data2 generates spiked data either by adding differences to the first columns, or by multiplying with factors given by the user. Note that whereas gen.data will provide completely new simulated data, both for the control and treatment classes, gen.data2 essentially only changes the biomarker part of the treated class.

Usage

gen.data(ncontrol, ntreated = ncontrol, nvar, nbiom = 5, group.diff = 0.5,
         nsimul = 100, means = rep(0, nvar), cormat = diag(nvar))
gen.data2(X, ncontrol, nbiom, spikeI,
          type = c("multiplicative", "additive"),
          nsimul = 100, stddev = .05)

Arguments

`ncontrol`, `ntreated`	Numbers of objects in the two classes. If only ncontrol is given, the two classes are assumed to be of equal size, or, in the case of `gen.data2`, the remainder of the samples are taken to be the treatment samples.
`nvar`	Number of variables.
`nbiom`	Number of biomarkers, i.e. the number of variables to be changed in the treatment class compared to the control class. The variables that are changed are always the first variables in the data matrix.
`group.diff`	group difference; the average difference between values of the biomarkers in the two classes.
`nsimul`	Number of data sets to simulate.
`means`	Mean values of all variables, a vector.
`cormat`	Correlation matrix to be used in the simulation. Default is the identity matrix.
`X`	Experimental data matrix, without group differences.
`spikeI`	A vector of at least three different numbers, used to generate new values for the biomarker variables in the treated class.
`type`	Whether to use multiplication (useful when simulating cases where things like "twofold differences" are relevant), or addition (in the case of absolute differences in the treatment and control groups).
`stddev`	Additional noise: in every simulation, normally distributed noise with a standard deviation of `stddev * mean(spikeI)` will be added to `spikeI` before generating the actual simulated data.

Details

The spikeI argument in function gen.data2 provides the numbers that will be used to artificially "spike" the biomarker variables, either by multiplication (the default) or by addition. To obtain approximate two-fold differences, for example, one could use spikeI = c(1.8, 2.0, 2.2). At least three different values should be given since in most cases more than one set will be simulated and we require different values in the biomarker variables.

Value

A list with the following elements:

`X`	An array of dimension `nobj1 + nobj2` times `nvar` times `nsimul`.
`Y`	The class vector.
`n.biomarkers`	The number of biomarkers.

Note that the biomarkers are always in the first nbiom columns of the data matrix.

Author(s)

Ron Wehrens

Examples

## Not run: 
X <- gen.data(10, nvar = 200)
names(X)
dim(X$X)

set.seed(7)
simdat <- gen.data(10, nvar = 1200, nbiom = 22, nsimul = 1,
                   group.diff = 2)
simdat.stab <- get.biom(simdat$X[,,1], simdat$Y, fmethod = "all",
                        type = "stab", ncomp = 3, scale.p = "auto")
## show LASSO success
traceplot(simdat.stab, lty = 1, col = rep(2:1, c(22, 1610)))

data(SpikePos)
real.markers <- which(SpikePos$annotation$found.in.standards > 0)
X.no.diff <- SpikePos$data[1:20, -real.markers]

set.seed(7)
simdat2 <- gen.data2(X.no.diff, ncontrol = 10, nbiom = 22,
                     spikeI = c(1.2, 1.4, 2), nsimul = 1)
simdat2.stab <- get.biom(simdat2$X[,,1], simdat$Y,
                         fmethod = "all", type = "stab", ncomp = 3,
                         scale.p = "auto")
## show LASSO success
traceplot(simdat2.stab, lty = 1, col = rep(2:1, c(22, 1610)))

## End(Not run)

[Package BioMark version 0.4.5 Index]