s_mars_clust {eclust} R Documentation

## Fit MARS Models on Simulated Cluster Summaries

### Description

This function creates summaries of the given clusters (e.g. 1st PC or average), and then runs Friedman's MARS on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016.

### Usage

s_mars_clust(x_train, x_test, y_train, y_test, s0, summary = c("pc", "avg"),
model = c("MARS"), exp_family = c("gaussian", "binomial"), gene_groups,
true_beta = NULL, topgenes = NULL, stability = F, filter = F,
include_E = T, include_interaction = F, clust_type = c("CLUST",
"ECLUST"), nPC = 1)


### Arguments

 x_train ntrain x p matrix of simulated training set where ntrain is the number of training observations and p is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names x_test ntest x p matrix of simulated training set where ntest is the number of training observations and p is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names y_train numeric vector of length ntrain representing the responses for the training subjects. If continuous then you must set exp_family = "gaussion". For exp_family="binomial" should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class) y_test numeric vector of length ntest representing the responses for the test subjects. If continuous then you must set exp_family = "gaussion". For exp_family="binomial" should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class). s0 chracter vector of the active feature names, i.e., the features in x_train that are truly associated with the response. summary the summary of each cluster. Can be the principal component or average. Default is summary = "pc" which takes the first number_pc principal components. Currently a maximum of 2 principal components can be chosen. model Type of non-linear model to be fit. Currently only Friedman's MARS is supported. exp_family Response type. See details for y_train argument above. gene_groups data.frame that contains the group membership for each feature. The first column is called 'gene' and the second column should be called 'cluster'. The 'gene' column identifies the features and must be the same identifiers in the x_train,x_test matrices. The 'cluster' column is a numeric integer indicating the cluster group membership. A cluster group membership of 0 implies the feature did not cluster into any group. true_beta numeric vector of true beta coefficients topgenes List of features to keep if filter=TRUE. Default is topgenes = NULL which means all features are kept for the analysis stability Should stability measures be calculated. Default is stability=FALSE. See details filter Should analysis be run on a subset of features. Default is filter = FALSE include_E Should the environment variable be included in the regression analysis. Default is include_E = TRUE include_interaction Should interaction effects between the features in x_train and the environment variable be fit. Default is include_interaction=TRUE clust_type Method used to cluster the features. This is used for naming the output only and has no consequence for the results. clust_type = "CLUST" is the default which means that the environment varible was not used in the clustering step. clust_type = "ECLUST" means that the environment variable was used in the clustering aspect. nPC Number of principal components if summary = "pc". Default is nPC = 1. Can be either 1 or 2.

### Details

This function first does 10 fold cross-validation to tune the degree (either 1 or 2) using the train function with method="earth" and nprune is fixed at 1000. Then the earth function is used, with nk = 1000 and pmethod = "backward" to fit the MARS model using the optimal degree from the 10-fold CV.

### Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = TRUE then this function returns a p x 2 data.frame or data.table of regression coefficients without the intercept. The output of this is used for subsequent calculations of stability.

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

 mse or AUC Test set mean squared error if exp_family = "gaussion" or test set Area under the curve if exp_family = "binomial" calculated using the roc function RMSE Square root of the mse. Only applicable if exp_family = "gaussion" Shat Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model TPR true positive rate FPR false positive rate Correct Sparsity Correct true positives + correct true negative coefficients divided by the total number of features CorrectZeroMain Proportion of correct true negative main effects CorrectZeroInter Proportion of correct true negative interactions IncorrectZeroMain Proportion of incorrect true negative main effects IncorrectZeroInter Proportion of incorrect true negative interaction effects

### Examples

## Not run:
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.01,
maxCor = 1,
corPower = 1,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE,
leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.4,
maxCor = 1,
corPower = 0.3,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE)

truemodule1 <- d1$setLabels X <- rbind(d0$datExpr, d1\$datExpr) %>%
magrittr::set_colnames(paste0("Gene", 1:p)) %>%
magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
beta = beta,
binary_outcome = FALSE,
truemodule = truemodule1,
nActive = nActive,
include_interaction = FALSE,
cluster_distance = cluster_distance,
n = n, n0 = n0,
eclust_distance = Ecluster_distance,
signal_to_noise_ratio = SNR,
distance_method = distanceMethod,
cluster_method = clustMethod,
cut_method = cutMethod,
agglomeration_method = agglomerationMethod,
nPC = 1)

mars_res <- s_mars_clust(x_train = result[["X_train"]],
x_test = result[["X_test"]],
y_train = result[["Y_train"]],
y_test = result[["Y_test"]],
s0 = result[["S0"]],
summary = "pc",
exp_family = "gaussian",