s_mars_clust {eclust} | R Documentation |
Fit MARS Models on Simulated Cluster Summaries
Description
This function creates summaries of the given clusters (e.g. 1st PC or average), and then runs Friedman's MARS on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016.
Usage
s_mars_clust(x_train, x_test, y_train, y_test, s0, summary = c("pc", "avg"),
model = c("MARS"), exp_family = c("gaussian", "binomial"), gene_groups,
true_beta = NULL, topgenes = NULL, stability = F, filter = F,
include_E = T, include_interaction = F, clust_type = c("CLUST",
"ECLUST"), nPC = 1)
Arguments
x_train |
|
x_test |
|
y_train |
numeric vector of length |
y_test |
numeric vector of length |
s0 |
chracter vector of the active feature names, i.e., the features in
|
summary |
the summary of each cluster. Can be the principal component or
average. Default is |
model |
Type of non-linear model to be fit. Currently only Friedman's MARS is supported. |
exp_family |
Response type. See details for |
gene_groups |
data.frame that contains the group membership for each
feature. The first column is called 'gene' and the second column should be
called 'cluster'. The 'gene' column identifies the features and must be the
same identifiers in the |
true_beta |
numeric vector of true beta coefficients |
topgenes |
List of features to keep if |
stability |
Should stability measures be calculated. Default is
|
filter |
Should analysis be run on a subset of features. Default is
|
include_E |
Should the environment variable be included in the
regression analysis. Default is |
include_interaction |
Should interaction effects between the features in
|
clust_type |
Method used to cluster the features. This is used for
naming the output only and has no consequence for the results.
|
nPC |
Number of principal components if |
Details
This function first does 10 fold cross-validation to tune the degree
(either 1 or 2) using the train
function with
method="earth"
and nprune is fixed at 1000. Then the
earth
function is used, with nk = 1000
and
pmethod = "backward"
to fit the MARS model using the optimal degree
from the 10-fold CV.
Value
This function has two different outputs depending on whether
stability = TRUE
or stability = FALSE
If stability = TRUE
then this function returns a p x 2
data.frame or data.table of regression coefficients without the intercept.
The output of this is used for subsequent calculations of stability.
If stability = FALSE
then returns a vector with the following
elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+)
for definitions of each measure of performance):
mse or AUC |
Test set
mean squared error if |
RMSE |
Square root of the mse. Only
applicable if |
Shat |
Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model |
TPR |
true positive rate |
FPR |
false positive rate |
Correct Sparsity |
Correct true positives + correct true negative coefficients divided by the total number of features |
CorrectZeroMain |
Proportion of correct true negative main effects |
CorrectZeroInter |
Proportion of correct true negative interactions |
IncorrectZeroMain |
Proportion of incorrect true negative main effects |
IncorrectZeroInter |
Proportion of incorrect true negative interaction effects |
Examples
## Not run:
library(magrittr)
# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"
#in this simulation its blocks 3 and 4 that are important
#leaveOut: optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.01,
maxCor = 1,
corPower = 1,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE,
leaveOut = 1:4)
d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.4,
maxCor = 1,
corPower = 0.3,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE)
truemodule1 <- d1$setLabels
X <- rbind(d0$datExpr, d1$datExpr) %>%
magrittr::set_colnames(paste0("Gene", 1:p)) %>%
magrittr::set_rownames(paste0("Subject",1:n))
betaMainEffect <- vector("double", length = p)
# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
nActive/2, betaMean - 0.1, betaMean + 0.1)
# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)
result <- s_generate_data_mars(p = p, X = X,
beta = beta,
binary_outcome = FALSE,
truemodule = truemodule1,
nActive = nActive,
include_interaction = FALSE,
cluster_distance = cluster_distance,
n = n, n0 = n0,
eclust_distance = Ecluster_distance,
signal_to_noise_ratio = SNR,
distance_method = distanceMethod,
cluster_method = clustMethod,
cut_method = cutMethod,
agglomeration_method = agglomerationMethod,
nPC = 1)
mars_res <- s_mars_clust(x_train = result[["X_train"]],
x_test = result[["X_test"]],
y_train = result[["Y_train"]],
y_test = result[["Y_test"]],
s0 = result[["S0"]],
summary = "pc",
exp_family = "gaussian",
gene_groups = result[["clustersAddon"]],
clust_type = "ECLUST")
unlist(mars_res)
## End(Not run)