s_generate_data_mars {eclust}R Documentation

Generate non linear response and test and training sets for non-linear simulation study

Description

create a function that takes as input, the number of genes, the true beta vector, the gene expression matrix created from the generate_blocks function and returns a list of data matrix, as well as correlation matrices, TOM matrices, cluster information, training and test data

Usage

s_generate_data_mars(p, X, beta, binary_outcome = FALSE, truemodule, nActive,
  cluster_distance = c("corr", "corr0", "corr1", "tom", "tom0", "tom1",
  "diffcorr", "difftom", "corScor", "tomScor", "fisherScore"), n, n0,
  include_interaction = F, signal_to_noise_ratio = 1,
  eclust_distance = c("fisherScore", "corScor", "diffcorr", "difftom"),
  cluster_method = c("hclust", "protoclust"), cut_method = c("dynamic",
  "gap", "fixed"), distance_method = c("euclidean", "maximum", "manhattan",
  "canberra", "binary", "minkowski"), n_clusters,
  agglomeration_method = c("complete", "average", "ward.D2", "single",
  "ward.D", "mcquitty", "median", "centroid"), nPC = 1, K.max = 10,
  B = 10)

Arguments

p

number of genes in design matrix

X

gene expression matrix of size n x p using the generate_blocks function

beta

true beta coefficient vector

binary_outcome

Logical. Should a binary outcome be generated. Default is FALSE. See details on how a binary outcome is generated

truemodule

numeric vector of the true module membership used in the s_response_mars function. Modules 3 and 4 are active in the response. See s_response_mars function for details.

nActive

number of active genes in the response used in the s_response_mars

cluster_distance

character representing which matrix from the training set that you want to use to cluster the genes. Must be one of the following

  • corr, corr0, corr1, tom, tom0, tom1, diffcorr, difftom, corScor, tomScor, fisherScore

n

total number of subjects

n0

total number of subjects with E=0

include_interaction

Should an interaction with the environment be generated as part of the response. Default is FALSE.

signal_to_noise_ratio

signal to noise ratio, default is 1

eclust_distance

character representing which matrix from the training set that you want to use to cluster the genes based on the environment. See cluster_distance for avaialble options. Should be different from cluster_distance. For example, if cluster_distance=corr and EclustDistance=fisherScore. That is, one should be based on correlations ignoring the environment, and the other should be based on correlations accounting for the environment. This function will always return this add on

cluster_method

Cluster the data using hierarchical clustering or prototype clustering. Defaults cluster_method="hclust". Other option is protoclust, however this package must be installed before proceeding with this option

cut_method

what method to use to cut the dendrogram. 'dynamic' refers to dynamicTreeCut library. 'gap' is Tibshirani's gap statistic clusGap using the 'Tibs2001SEmax' rule. 'fixed' is a fixed number specified by the n_clusters argument

distance_method

one of "euclidean","maximum","manhattan", "canberra", "binary","minkowski" to be passed to dist function.

n_clusters

Number of clusters specified by the user. Only applicable when cut_method="fixed"

agglomeration_method

the agglomeration method to be used. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).

nPC

number of principal components. Can be 1 or 2.

K.max

the maximum number of clusters to consider, must be at least two. Only used if cutMethod='gap'

B

integer, number of Monte Carlo (“bootstrap”) samples. Only used if cutMethod='gap'

Value

list of (in the following order)

beta_truth

a 1 column matrix containing the true beta coefficient vector

similarity

an object of class similarity which is the similarity matrix specified by the cluster_distance argument

similarityEclust

an object of class similarity which is the similarity matrix specified by the eclust_distance argument

DT

data.table of simulated data from the s_response function

Y

The simulated response

X0

the n0 x p design matrix for the unexposed subjects

X1

the n1 x p design matrix for the exposed subjects

X_train

the training design matrix for all subjects

X_test

the test set design matrix for all subjects

Y_train

the training set response

Y_test

the test set response

DT_train

the training response and training design matrix in a single data.frame object

DT_test

the test response and training design matrix in a single data.frame object

S0

a character vector of the active genes i.e. the ones that are associated with the response

n_clusters_All

the number of clusters identified by using the similarity matrix specified by the cluster_distance argument

n_clusters_Eclust

the number of clusters identified by using the similarity matrix specified by the eclust_distance argument

n_clusters_Addon

the sum of n_clusters_All and n_clusters_Eclust

clustersAll

the cluster membership of each gene based on the cluster_distance matrix

clustersAddon

the cluster membership of each gene based on both the cluster_distance matrix and the eclust_distance matrix. Note that each gene will appear twice here

clustersEclust

the cluster membership of each gene based on the eclust_distance matrix

gene_groups_inter

cluster membership of each gene with a penalty factor used for the group lasso

gene_groups_inter_Addon

cluster membership of each gene with a penalty factor used for the group lasso

tom_train_all

the TOM matrix based on all training subjects

tom_train_diff

the absolute difference of the exposed and unexposed TOM matrices: |TOM_{E=1} - TOM_{E=0}|

tom_train_e1

the TOM matrix based on training exposed subjects only

tom_train_e0

the TOM matrix based on training unexposed subjects only

corr_train_all

the Pearson correlation matrix based on all training subjects

corr_train_diff

the absolute difference of the exposed and unexposed Pearson correlation matrices: |Cor_{E=1} - Cor_{E=0}|

corr_train_e1

the Pearson correlation matrix based on training exposed subjects only

corr_train_e0

the Pearson correlation matrix based on training unexposed subjects only

fisherScore

The fisher scoring matrix. see u_fisherZ for details

corScor

The correlation scoring matrix: |Cor_{E=1} + Cor_{E=0} - 2|

mse_null

The MSE for the null model

DT_train_folds

The 10 training folds used for the stability measures

X_train_folds

The 10 X training folds (the same as in DT_train_folds)

Y_train_folds

The 10 Y training folds (the same as in DT_train_folds)

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)
names(result)

[Package eclust version 0.1.0 Index]