s_generate_data {eclust} | R Documentation |
Generate linear response data and test and training sets for simulation study
Description
create a function that takes as input, the number of genes, the true beta vector, the gene expression matrix created from the generate_blocks function and returns a list of data matrix, as well as correlation matrices, TOM matrices, cluster information, training and test data
Usage
s_generate_data(p, X, beta, binary_outcome = FALSE,
cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
"diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
include_interaction = F, signal_to_noise_ratio = 1,
eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
"gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
"canberra", "binary", "minkowski"), n_clusters,
agglomeration_method = c("complete", "average", "ward.D2", "single",
"ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
B = 10)
Arguments
p |
number of genes in design matrix |
X |
gene expression matrix of size n x p using the
|
beta |
true beta coefficient vector |
binary_outcome |
Logical. Should a binary outcome be generated. Default
is |
cluster_distance |
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following
|
n |
total number of subjects |
n0 |
total number of subjects with E=0 |
include_interaction |
Should an interaction with the environment be generated as part of the response. Default is FALSE. |
signal_to_noise_ratio |
signal to noise ratio, default is 1 |
eclust_distance |
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See
|
cluster_method |
Cluster the data using hierarchical clustering or
prototype clustering. Defaults |
cut_method |
what method to use to cut the dendrogram. |
distance_method |
one of "euclidean","maximum","manhattan", "canberra",
"binary","minkowski" to be passed to |
n_clusters |
Number of clusters specified by the user. Only applicable
when |
agglomeration_method |
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
nPC |
number of principal components to extract from each cluster. Default is 1. Only 1 or 2 is allowed. |
K.max |
the maximum number of clusters to consider, must be at least
two. Only used if |
B |
integer, number of Monte Carlo (“bootstrap”) samples. Only used if
|
Details
To generate a binary outcome we first generate a continuous outcome
Y which is X^T \beta
, defined p = 1/(1 + exp(-Y ))
and used
this to generate a two-class outcome z with Pr(z = 1) = p
and
Pr(z = 0) = 1 - p
.
Value
list of (in the following order)
- beta_truth
a 1 column matrix containing the true beta coefficient vector
- similarity
an object of class similarity which is the similarity matrix specified by the
cluster_distance
argument- similarityEclust
an object of class similarity which is the similarity matrix specified by the
eclust_distance
argument- DT
data.table of simulated data from the
s_response
function- Y
The simulated response
- X0
the n0 x p design matrix for the unexposed subjects
- X1
the n1 x p design matrix for the exposed subjects
- X_train
the training design matrix for all subjects
- X_test
the test set design matrix for all subjects
- Y_train
the training set response
- Y_test
the test set response
- DT_train
the training response and training design matrix in a single data.frame object
- DT_test
the test response and training design matrix in a single data.frame object
- S0
a character vector of the active genes i.e. the ones that are associated with the response
- n_clusters_All
the number of clusters identified by using the similarity matrix specified by the
cluster_distance
argument- n_clusters_Eclust
the number of clusters identified by using the similarity matrix specified by the
eclust_distance
argument- n_clusters_Addon
the sum of
n_clusters_All
andn_clusters_Eclust
- clustersAll
the cluster membership of each gene based on the
cluster_distance
matrix- clustersAddon
the cluster membership of each gene based on both the
cluster_distance
matrix and theeclust_distance
matrix. Note that each gene will appear twice here- clustersEclust
the cluster membership of each gene based on the
eclust_distance
matrix- gene_groups_inter
cluster membership of each gene with a penalty factor used for the group lasso
- gene_groups_inter_Addon
cluster membership of each gene with a penalty factor used for the group lasso
- tom_train_all
the TOM matrix based on all training subjects
- tom_train_diff
the absolute difference of the exposed and unexposed TOM matrices:
|TOM_{E=1} - TOM_{E=0}|
- tom_train_e1
the TOM matrix based on training exposed subjects only
- tom_train_e0
the TOM matrix based on training unexposed subjects only
- corr_train_all
the Pearson correlation matrix based on all training subjects
- corr_train_diff
the absolute difference of the exposed and unexposed Pearson correlation matrices:
|Cor_{E=1} - Cor_{E=0}|
- corr_train_e1
the Pearson correlation matrix based on training exposed subjects only
- corr_train_e0
the Pearson correlation matrix based on training unexposed subjects only
- fisherScore
The fisher scoring matrix. see
u_fisherZ
for details- corScor
The correlation scoring matrix:
|Cor_{E=1} + Cor_{E=0} - 2|
- mse_null
The MSE for the null model
- DT_train_folds
The 10 training folds used for the stability measures
- X_train_folds
The 10 X training folds (the same as in DT_train_folds)
- Y_train_folds
The 10 Y training folds (the same as in DT_train_folds)
Note
this function calls the s_response
to generate phenotype as a
function of the gene expression data. This function also returns other
information derived from the simulated data including the test and training
sets, the correlation and TOM matrices and the clusters.
the PCs and averages need to be calculated in the fitting functions, because these will change based on the CV fold
Examples
library(magrittr)
# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"
#in this simulation its blocks 3 and 4 that are important
#leaveOut: optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.01,
maxCor = 1,
corPower = 1,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE,
leaveOut = 1:4)
d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.4,
maxCor = 1,
corPower = 0.3,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE)
truemodule1 <- d1$setLabels
X <- rbind(d0$datExpr, d1$datExpr) %>%
magrittr::set_colnames(paste0("Gene", 1:p)) %>%
magrittr::set_rownames(paste0("Subject",1:n))
betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)
# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
nActive/2, betaMean - 0.1, betaMean + 0.1)
# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)
## Not run:
result <- s_generate_data(p = p, X = X,
beta = beta,
include_interaction = TRUE,
cluster_distance = cluster_distance,
n = n, n0 = n0,
eclust_distance = Ecluster_distance,
signal_to_noise_ratio = SNR,
distance_method = distanceMethod,
cluster_method = clustMethod,
cut_method = cutMethod,
agglomeration_method = agglomerationMethod,
nPC = 1)
names(result)
## End(Not run)