s_pen_clust {eclust}  R Documentation 
This function creates summaries of the given clusters (e.g. 1st PC or average), and then fits a penalized regression model on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Can run lasso, elasticnet, SCAD or MCP models
s_pen_clust(x_train, x_test, y_train, y_test, s0, gene_groups,
summary = c("pc", "avg"), model = c("lasso", "elasticnet", "scad", "mcp"),
exp_family = c("gaussian", "binomial"), filter = F, topgenes = NULL,
stability = F, include_E = T, include_interaction = F,
clust_type = c("CLUST", "ECLUST"), number_pc = 1)
x_train 

x_test 

y_train 
numeric vector of length 
y_test 
numeric vector of length 
s0 
chracter vector of the active feature names, i.e., the features in

gene_groups 
data.frame that contains the group membership for each
feature. The first column is called 'gene' and the second column should be
called 'cluster'. The 'gene' column identifies the features and must be the
same identifiers in the 
summary 
the summary of each cluster. Can be the principal component or
average. Default is 
model 
Regression model to be fit on cluster summaries. Default is

exp_family 
Response type. See details for 
filter 
Should analysis be run on a subset of features. Default is

topgenes 
List of features to keep if 
stability 
Should stability measures be calculated. Default is

include_E 
Should the environment variable be included in the
regression analysis. Default is 
include_interaction 
Should interaction effects between the features in

clust_type 
Method used to cluster the features. This is used for
naming the output only and has no consequence for the results.

number_pc 
Number of principal components if 
The stability of feature importance is defined as the variability of feature weights under perturbations of the training set, i.e., small modifications in the training set should not lead to considerable changes in the set of important covariates (Toloşi, L., & Lengauer, T. (2011)). A feature selection algorithm produces a weight, a ranking, and a subset of features. In the CLUST and ECLUST methods, we defined a predictor to be nonzero if its corresponding cluster representative weight was nonzero. Using 10fold cross validation (CV), we evaluated the similarity between two features and their rankings using Pearson and Spearman correlation, respectively. For each CV fold we reran the models and took the average Pearson/Spearman correlation of the 10 choose 2 combinations of estimated coefficients vectors. To measure the similarity between two subsets of features we took the average of the Jaccard distance in each fold. A Jaccard distance of 1 indicates perfect agreement between two sets while no agreement will result in a distance of 0.
This function has two different outputs depending on whether
stability = TRUE
or stability = FALSE
If stability = TRUE
then this function returns a p x 2
data.frame or data.table of regression coefficients without the intercept.
The output of this is used for subsequent calculations of stability.
If stability = FALSE
then returns a vector with the following
elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+)
for definitions of each measure of performance):
mse or AUC 
Test set
mean squared error if 
RMSE 
Square root of the mse. Only
applicable if 
Shat 
Number of nonzero estimated regression coefficients. The nonzero estimated regression coefficients are referred to as being selected by the model 
TPR 
true positive rate 
FPR 
false positive rate 
Correct Sparsity 
Correct true positives + correct true negative coefficients divided by the total number of features 
CorrectZeroMain 
Proportion of correct true negative main effects 
CorrectZeroInter 
Proportion of correct true negative interactions 
IncorrectZeroMain 
Proportion of incorrect true negative main effects 
IncorrectZeroInter 
Proportion of incorrect true negative interaction effects 
nclusters 
number of estimated clusters by the

number_pc=2
will not work if there is only one feature in an
estimated cluster
Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 19861994.
Bhatnagar, SR., Yang, Y., Blanchette, M., Bouchard, L., Khundrakpam, B., Evans, A., Greenwood, CMT. (2016+). An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Preprint
Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719720.
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, http://www.stanford.edu/~hastie/Papers/glmnet.pdf
Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232253.
library(magrittr)
# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"
#in this simulation its blocks 3 and 4 that are important
#leaveOut: optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 < s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.01,
maxCor = 1,
corPower = 1,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE,
leaveOut = 1:4)
d1 < s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
minCor = 0.4,
maxCor = 1,
corPower = 0.3,
propNegativeCor = 0.3,
backgroundNoise = 0.5,
signed = FALSE)
truemodule1 < d1$setLabels
X < rbind(d0$datExpr, d1$datExpr) %>%
magrittr::set_colnames(paste0("Gene", 1:p)) %>%
magrittr::set_rownames(paste0("Subject",1:n))
betaMainEffect < vector("double", length = p)
betaMainInteractions < vector("double", length = p)
# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] < runif(
nActive/2, betaMean  0.1, betaMean + 0.1)
# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] < runif(
nActive/2, betaMean+2  0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] < runif(nActive, alphaMean  0.1, alphaMean + 0.1)
beta < c(betaMainEffect, betaE, betaMainInteractions)
result < s_generate_data(p = p, X = X,
beta = beta,
include_interaction = TRUE,
cluster_distance = cluster_distance,
n = n, n0 = n0,
eclust_distance = Ecluster_distance,
signal_to_noise_ratio = SNR,
distance_method = distanceMethod,
cluster_method = clustMethod,
cut_method = cutMethod,
agglomeration_method = agglomerationMethod,
nPC = 1)
pen_res < s_pen_clust(x_train = result[["X_train"]],
x_test = result[["X_test"]],
y_train = result[["Y_train"]],
y_test = result[["Y_test"]],
s0 = result[["S0"]],
gene_groups = result[["clustersAddon"]],
summary = "pc",
model = "lasso",
exp_family = "gaussian",
clust_type = "ECLUST",
include_interaction = TRUE)
unlist(pen_res)