sim.data {penalizedSVM} | R Documentation |
Simulation of microarray data
Description
Simulation of 'n' samples. Each sample has 'sg' genes, only 'nsg' of them are called significant and have influence on class labels. All other '(ng - nsg)' genes are called ballanced. All gene ratios are drawn from a multivariate normal distribution. There is a posibility to create blocks of highly correlated genes.
Usage
sim.data(n = 256, ng = 1000, nsg = 100,
p.n.ratio = 0.5,
sg.pos.factor= 1, sg.neg.factor= -1,
# correlation info:
corr = FALSE, corr.factor = 0.8,
# block info:
blocks = FALSE, n.blocks = 6, nsg.block = 1, ng.block = 5,
seed = 123, ...)
Arguments
n |
number of samples, logistic regression works well if |
ng |
number of genes |
nsg |
number of significant genes |
p.n.ratio |
ratio between positive and negative significant genes (default 0.5) |
sg.pos.factor |
impact factor of positive significant genes on the classifaction, default: 1 |
sg.neg.factor |
impact factor of negative significant genes on the classifaction,default: -1 |
corr |
are the genes correalted to each other? (default FALSE). see Details |
corr.factor |
correlation factorfor genes, between 0 and 1 (default 0.8) |
blocks |
are blocks of highly correlated genes are allowed? (default FALSE) |
n.blocks |
number of blocks |
nsg.block |
number of significant genes per block |
ng.block |
number of genes per block |
seed |
seed |
... |
additional argument(s) |
Details
If no blockes (n.blocks=0 or blocks=FALSE) are defined and corr=TRUE
create covarance matrix for all genes! with decrease of correlation : cov(i,j)=cov(j,i)= corr.factor^(i-j)
Value
x |
matrix of simulated data. Genes in rows and samples in columns |
y |
named vector of class labels |
seed |
seed |
Author(s)
Wiebke Werft, Natalia Becker
See Also
Examples
my.seed<-123
# 1. simulate 20 samples, with 100 genes in each. Only the first two genes
# have an impact on the class labels.
# All genes are assumed to be i.i.d.
train<-sim.data(n = 20, ng = 100, nsg = 3, corr=FALSE, seed=my.seed )
print(str(train))
# 2. change the proportion between positive and negative significant genes
#(from 0.5 to 0.8)
train<-sim.data(n = 20, ng = 100, nsg = 10, p.n.ratio = 0.8, seed=my.seed )
rownames(train$x)[1:15]
# [1] "pos1" "pos2" "pos3" "pos4" "pos5" "pos6" "pos7" "pos8"
# [2] "neg1" "neg2" "bal1" "bal2" "bal3" "bal4" "bal5"
# 3. assume to have correlation for positive significant genes,
# negative significant genes and 'balanced' genes separatly.
train<-sim.data(n = 20, ng = 100, nsg = 10, corr=TRUE, seed=my.seed )
#cor(t(train$x[1:15,]))
# 4. add 6 blocks of 5 genes each and only one significant gene per block.
# all genes in the block are correlated with constant correlation factor
# corr.factor=0.8
train<-sim.data(n = 20, ng = 100, nsg = 6, corr=TRUE, corr.factor=0.8,
blocks=TRUE, n.blocks=6, nsg.block=1, ng.block=5, seed=my.seed )
print(str(train))
# first block
#cor(t(train$x[1:5,]))
# second block
#cor(t(train$x[6:10,]))