R: Fit Penalized Regression Models on Simulated Cluster...

s_pen_clust {eclust}

R Documentation

Fit Penalized Regression Models on Simulated Cluster Summaries

Description

This function creates summaries of the given clusters (e.g. 1st PC or average), and then fits a penalized regression model on those summaries. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Can run lasso, elasticnet, SCAD or MCP models

Usage

s_pen_clust(x_train, x_test, y_train, y_test, s0, gene_groups,
  summary = c("pc", "avg"), model = c("lasso", "elasticnet", "scad", "mcp"),
  exp_family = c("gaussian", "binomial"), filter = F, topgenes = NULL,
  stability = F, include_E = T, include_interaction = F,
  clust_type = c("CLUST", "ECLUST"), number_pc = 1)

Arguments

`x_train`	`ntrain x p` matrix of simulated training set where `ntrain` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`x_test`	`ntest x p` matrix of simulated training set where `ntest` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`y_train`	numeric vector of length `ntrain` representing the responses for the training subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
`y_test`	numeric vector of length `ntest` representing the responses for the test subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class).
`s0`	chracter vector of the active feature names, i.e., the features in `x_train` that are truly associated with the response.
`gene_groups`	data.frame that contains the group membership for each feature. The first column is called 'gene' and the second column should be called 'cluster'. The 'gene' column identifies the features and must be the same identifiers in the `x_train,x_test` matrices. The 'cluster' column is a numeric integer indicating the cluster group membership. A cluster group membership of 0 implies the feature did not cluster into any group.
`summary`	the summary of each cluster. Can be the principal component or average. Default is `summary = "pc"` which takes the first `number_pc` principal components. Currently a maximum of 2 principal components can be chosen.
`model`	Regression model to be fit on cluster summaries. Default is `model="lasso"` which corresponds to glmnet mixing parameter `alpha=1`. `model="elasticnet"` corresponds to glmnet mixing parameter `alpha=0.5`, `model="mcp"` and `model="scad"` are the non-convex models from the `ncvreg` package
`exp_family`	Response type. See details for `y_train` argument above.
`filter`	Should analysis be run on a subset of features. Default is `filter = FALSE`
`topgenes`	List of features to keep if `filter=TRUE`. Default is `topgenes = NULL` which means all features are kept for the analysis
`stability`	Should stability measures be calculated. Default is `stability=FALSE`. See details
`include_E`	Should the environment variable be included in the regression analysis. Default is `include_E = TRUE`
`include_interaction`	Should interaction effects between the features in `x_train` and the environment variable be fit. Default is `include_interaction=TRUE`
`clust_type`	Method used to cluster the features. This is used for naming the output only and has no consequence for the results. `clust_type = "CLUST"` is the default which means that the environment varible was not used in the clustering step. `clust_type = "ECLUST"` means that the environment variable was used in the clustering aspect.
`number_pc`	Number of principal components if `summary = "pc"`. Default is `number_pc = 1`. Can be either 1 or 2.

Details

The stability of feature importance is defined as the variability of feature weights under perturbations of the training set, i.e., small modifications in the training set should not lead to considerable changes in the set of important covariates (Toloşi, L., & Lengauer, T. (2011)). A feature selection algorithm produces a weight, a ranking, and a subset of features. In the CLUST and ECLUST methods, we defined a predictor to be non-zero if its corresponding cluster representative weight was non-zero. Using 10-fold cross validation (CV), we evaluated the similarity between two features and their rankings using Pearson and Spearman correlation, respectively. For each CV fold we re-ran the models and took the average Pearson/Spearman correlation of the 10 choose 2 combinations of estimated coefficients vectors. To measure the similarity between two subsets of features we took the average of the Jaccard distance in each fold. A Jaccard distance of 1 indicates perfect agreement between two sets while no agreement will result in a distance of 0.

Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = TRUE then this function returns a p x 2 data.frame or data.table of regression coefficients without the intercept. The output of this is used for subsequent calculations of stability.

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

`mse or AUC`	Test set mean squared error if `exp_family = "gaussion"` or test set Area under the curve if `exp_family = "binomial"` calculated using the `roc` function
`RMSE`	Square root of the mse. Only applicable if `exp_family = "gaussion"`
`Shat`	Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model
`TPR`	true positive rate
`FPR`	false positive rate
`Correct Sparsity`	Correct true positives + correct true negative coefficients divided by the total number of features
`CorrectZeroMain`	Proportion of correct true negative main effects
`CorrectZeroInter`	Proportion of correct true negative interactions
`IncorrectZeroMain`	Proportion of incorrect true negative main effects
`IncorrectZeroInter`	Proportion of incorrect true negative interaction effects
`nclusters`	number of estimated clusters by the `cutreeDynamic` function

Note

number_pc=2 will not work if there is only one feature in an estimated cluster

References

Toloşi, L., & Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14), 1986-1994.

Bhatnagar, SR., Yang, Y., Blanchette, M., Bouchard, L., Khundrakpam, B., Evans, A., Greenwood, CMT. (2016+). An analytic approach for interpretable predictive models in high dimensional data, in the presence of interactions with exposures Preprint

Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719-720.

Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, http://www.stanford.edu/~hastie/Papers/glmnet.pdf

Breheny, P. and Huang, J. (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Statist., 5: 232-253.

Examples

library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)
betaMainInteractions <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
betaMainInteractions[which(betaMainEffect!=0)] <- runif(nActive, alphaMean - 0.1, alphaMean + 0.1)
beta <- c(betaMainEffect, betaE, betaMainInteractions)

result <- s_generate_data(p = p, X = X,
                          beta = beta,
                          include_interaction = TRUE,
                          cluster_distance = cluster_distance,
                          n = n, n0 = n0,
                          eclust_distance = Ecluster_distance,
                          signal_to_noise_ratio = SNR,
                          distance_method = distanceMethod,
                          cluster_method = clustMethod,
                          cut_method = cutMethod,
                          agglomeration_method = agglomerationMethod,
                          nPC = 1)

pen_res <- s_pen_clust(x_train = result[["X_train"]],
                       x_test = result[["X_test"]],
                       y_train = result[["Y_train"]],
                       y_test = result[["Y_test"]],
                       s0 = result[["S0"]],
                       gene_groups = result[["clustersAddon"]],
                       summary = "pc",
                       model = "lasso",
                       exp_family = "gaussian",
                       clust_type = "ECLUST",
                       include_interaction = TRUE)
unlist(pen_res)

[Package eclust version 0.1.0 Index]