R: Fit Multivariate Adaptive Regression Splines on Simulated...

s_mars_separate {eclust}

R Documentation

Fit Multivariate Adaptive Regression Splines on Simulated Data

Description

This function can run Friedman's MARS models on the untransformed design matrix. To be used with simulated data where the 'truth' is known i.e., you know which features are associated with the response. This function was used to produce the simulation results in Bhatnagar et al. 2016. Uses caret functions to tune the degree and the nprune parameters

Usage

s_mars_separate(x_train, x_test, y_train, y_test, s0, model = c("MARS"),
  exp_family = c("gaussian", "binomial"), topgenes = NULL, stability = F,
  filter = F, include_E = T, include_interaction = F, ...)

Arguments

`x_train`	`ntrain x p` matrix of simulated training set where `ntrain` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`x_test`	`ntest x p` matrix of simulated training set where `ntest` is the number of training observations and `p` is total number of predictors. This matrix needs to have named columns representing the feature names or the gene names
`y_train`	numeric vector of length `ntrain` representing the responses for the training subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class)
`y_test`	numeric vector of length `ntest` representing the responses for the test subjects. If continuous then you must set `exp_family = "gaussion"`. For `exp_family="binomial"` should be either a factor with two levels, or a two-column matrix of counts or proportions (the second column is treated as the target class; for a factor, the last level in alphabetical order is the target class).
`s0`	chracter vector of the active feature names, i.e., the features in `x_train` that are truly associated with the response.
`model`	Type of non-linear model to be fit. Currently only Friedman's MARS is supported.
`exp_family`	Response type. See details for `y_train` argument above.
`topgenes`	List of features to keep if `filter=TRUE`. Default is `topgenes = NULL` which means all features are kept for the analysis
`stability`	Should stability measures be calculated. Default is `stability=FALSE`. See details
`filter`	Should analysis be run on a subset of features. Default is `filter = FALSE`
`include_E`	Should the environment variable be included in the regression analysis. Default is `include_E = TRUE`
`include_interaction`	Should interaction effects between the features in `x_train` and the environment variable be fit. Default is `include_interaction=TRUE`
`...`	other parameters passed to `trainControl` function

Details

This function first does 10 fold cross-validation to tune the degree (either 1 or 2) using the train function with method="earth" and nprune is fixed at 1000. Then the earth function is used, with nk = 1000 and pmethod = "backward" to fit the MARS model using the optimal degree from the 10-fold CV.

Value

This function has two different outputs depending on whether stability = TRUE or stability = FALSE

If stability = TRUE then this function returns a p x 2 data.frame or data.table of regression coefficients without the intercept. The output of this is used for subsequent calculations of stability.

If stability = FALSE then returns a vector with the following elements (See Table 3: Measures of Performance in Bhatnagar et al (2016+) for definitions of each measure of performance):

`mse or AUC`	Test set mean squared error if `exp_family = "gaussion"` or test set Area under the curve if `exp_family = "binomial"` calculated using the `roc` function
`RMSE`	Square root of the mse. Only applicable if `exp_family = "gaussion"`
`Shat`	Number of non-zero estimated regression coefficients. The non-zero estimated regression coefficients are referred to as being selected by the model
`TPR`	true positive rate
`FPR`	false positive rate
`Correct Sparsity`	Correct true positives + correct true negative coefficients divided by the total number of features
`CorrectZeroMain`	Proportion of correct true negative main effects
`CorrectZeroInter`	Proportion of correct true negative interactions
`IncorrectZeroMain`	Proportion of incorrect true negative main effects
`IncorrectZeroInter`	Proportion of incorrect true negative interaction effects

Examples

## Not run: 
library(magrittr)

# simulation parameters
rho = 0.90; p = 500 ;SNR = 1 ; n = 200; n0 = n1 = 100 ; nActive = p*0.10 ; cluster_distance = "tom";
Ecluster_distance = "difftom"; rhoOther = 0.6; betaMean = 2;
alphaMean = 1; betaE = 3; distanceMethod = "euclidean"; clustMethod = "hclust";
cutMethod = "dynamic"; agglomerationMethod = "average"

#in this simulation its blocks 3 and 4 that are important
#leaveOut:  optional specification of modules that should be left out
#of the simulation, that is their genes will be simulated as unrelated
#("grey"). This can be useful when simulating several sets, in some which a module
#is present while in others it is absent.
d0 <- s_modules(n = n0, p = p, rho = 0, exposed = FALSE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.01,
                maxCor = 1,
                corPower = 1,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE,
                leaveOut = 1:4)

d1 <- s_modules(n = n1, p = p, rho = rho, exposed = TRUE,
                modProportions = c(0.15,0.15,0.15,0.15,0.15,0.25),
                minCor = 0.4,
                maxCor = 1,
                corPower = 0.3,
                propNegativeCor = 0.3,
                backgroundNoise = 0.5,
                signed = FALSE)

truemodule1 <- d1$setLabels

X <- rbind(d0$datExpr, d1$datExpr) %>%
  magrittr::set_colnames(paste0("Gene", 1:p)) %>%
  magrittr::set_rownames(paste0("Subject",1:n))

betaMainEffect <- vector("double", length = p)

# the first nActive/2 in the 3rd block are active
betaMainEffect[which(truemodule1 %in% 3)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean - 0.1, betaMean + 0.1)

# the first nActive/2 in the 4th block are active
betaMainEffect[which(truemodule1 %in% 4)[1:(nActive/2)]] <- runif(
  nActive/2, betaMean+2 - 0.1, betaMean+2 + 0.1)
beta <- c(betaMainEffect, betaE)

result <- s_generate_data_mars(p = p, X = X,
                               beta = beta,
                               binary_outcome = FALSE,
                               truemodule = truemodule1,
                               nActive = nActive,
                               include_interaction = FALSE,
                               cluster_distance = cluster_distance,
                               n = n, n0 = n0,
                               eclust_distance = Ecluster_distance,
                               signal_to_noise_ratio = SNR,
                               distance_method = distanceMethod,
                               cluster_method = clustMethod,
                               cut_method = cutMethod,
                               agglomeration_method = agglomerationMethod,
                               nPC = 1)


mars_res <- s_mars_separate(x_train = result[["X_train"]],
                            x_test = result[["X_test"]],
                            y_train = result[["Y_train"]],
                            y_test = result[["Y_test"]],
                            s0 = result[["S0"]],
                            exp_family = "gaussian")
unlist(mars_res)

## End(Not run)

[Package eclust version 0.1.0 Index]