s_generate_data_mars {eclust} | R Documentation |
Generate non linear response and test and training sets for non-linear simulation study
Description
create a function that takes as input, the number of genes, the true beta vector, the gene expression matrix created from the generate_blocks function and returns a list of data matrix, as well as correlation matrices, TOM matrices, cluster information, training and test data
Usage
s_generate_data_mars(p, X, beta, binary_outcome = FALSE, truemodule, nActive,
cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
"diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
include_interaction = F, signal_to_noise_ratio = 1,
eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
"gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
"canberra", "binary", "minkowski"), n_clusters,
agglomeration_method = c("complete", "average", "ward.D2", "single",
"ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
B = 10)
Arguments
p |
number of genes in design matrix |
X |
gene expression matrix of size n x p using the
|
beta |
true beta coefficient vector |
binary_outcome |
Logical. Should a binary outcome be generated. Default
is |
truemodule |
numeric vector of the true module membership used in the
|
nActive |
number of active genes in the response used in the
|
cluster_distance |
character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following
|
n |
total number of subjects |
n0 |
total number of subjects with E=0 |
include_interaction |
Should an interaction with the environment be generated as part of the response. Default is FALSE. |
signal_to_noise_ratio |
signal to noise ratio, default is 1 |
eclust_distance |
character representing which matrix from the training
set that you want to use to cluster the genes based on the environment. See
|
cluster_method |
Cluster the data using hierarchical clustering or
prototype clustering. Defaults |
cut_method |
what method to use to cut the dendrogram. |
distance_method |
one of "euclidean","maximum","manhattan", "canberra",
"binary","minkowski" to be passed to |
n_clusters |
Number of clusters specified by the user. Only applicable
when |
agglomeration_method |
the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). |
nPC |
number of principal components. Can be 1 or 2. |
K.max |
the maximum number of clusters to consider, must be at least
two. Only used if |
B |
integer, number of Monte Carlo (“bootstrap”) samples. Only used if
|
Value
list of (in the following order)
- beta_truth
a 1 column matrix containing the true beta coefficient vector
- similarity
an object of class similarity which is the similarity matrix specified by the
cluster_distance
argument- similarityEclust
an object of class similarity which is the similarity matrix specified by the
eclust_distance
argument- DT
data.table of simulated data from the
s_response
function- Y
The simulated response
- X0
the n0 x p design matrix for the unexposed subjects
- X1
the n1 x p design matrix for the exposed subjects
- X_train
the training design matrix for all subjects
- X_test
the test set design matrix for all subjects
- Y_train
the training set response
- Y_test
the test set response
- DT_train
the training response and training design matrix in a single data.frame object
- DT_test
the test response and training design matrix in a single data.frame object
- S0
a character vector of the active genes i.e. the ones that are associated with the response
- n_clusters_All
the number of clusters identified by using the similarity matrix specified by the
cluster_distance
argument- n_clusters_Eclust
the number of clusters identified by using the similarity matrix specified by the
eclust_distance
argument- n_clusters_Addon
the sum of
n_clusters_All
andn_clusters_Eclust
- clustersAll
the cluster membership of each gene based on the
cluster_distance
matrix- clustersAddon
the cluster membership of each gene based on both the
cluster_distance
matrix and theeclust_distance
matrix. Note that each gene will appear twice here- clustersEclust
the cluster membership of each gene based on the
eclust_distance
matrix- gene_groups_inter
cluster membership of each gene with a penalty factor used for the group lasso
- gene_groups_inter_Addon
cluster membership of each gene with a penalty factor used for the group lasso
- tom_train_all
the TOM matrix based on all training subjects
- tom_train_diff
the absolute difference of the exposed and unexposed TOM matrices:
|TOM_{E=1} - TOM_{E=0}|
- tom_train_e1
the TOM matrix based on training exposed subjects only
- tom_train_e0
the TOM matrix based on training unexposed subjects only
- corr_train_all
the Pearson correlation matrix based on all training subjects
- corr_train_diff
the absolute difference of the exposed and unexposed Pearson correlation matrices:
|Cor_{E=1} - Cor_{E=0}|
- corr_train_e1
the Pearson correlation matrix based on training exposed subjects only
- corr_train_e0
the Pearson correlation matrix based on training unexposed subjects only
- fisherScore
The fisher scoring matrix. see
u_fisherZ
for details- corScor
The correlation scoring matrix:
|Cor_{E=1} + Cor_{E=0} - 2|
- mse_null
The MSE for the null model
- DT_train_folds
The 10 training folds used for the stability measures
- X_train_folds
The 10 X training folds (the same as in DT_train_folds)
- Y_train_folds
The 10 Y training folds (the same as in DT_train_folds)
Examples
library(magrittr)
# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"
#in this simulation its blocks 3 and 4 that are important
#leaveOut: optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.01,
maxCor = 1,
corPower = 1,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE,
leaveOut = 1:4)
d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.4,
maxCor = 1,
corPower = 0.3,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE)
truemodule1 <- d1$setLabels
X <- rbind(d0$datExpr, d1$datExpr) %>%
magrittr::set_colnames(paste0("Gene", 1:p)) %>%
magrittr::set_rownames(paste0("Subject",1:n))
betaMainEffect <- vector("double", length = p)
# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
nActive/2, betaMean - 0.1, betaMean + 0.1)
# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)
result <- s_generate_data_mars(p = p, X = X,
beta = beta,
binary_outcome = FALSE,
truemodule = truemodule1,
nActive = nActive,
include_interaction = FALSE,
cluster_distance = cluster_distance,
n = n, n0 = n0,
eclust_distance = Ecluster_distance,
signal_to_noise_ratio = SNR,
distance_method = distanceMethod,
cluster_method = clustMethod,
cut_method = cutMethod,
agglomeration_method = agglomerationMethod,
nPC = 1)
names(result)