subsemble {subsemble} | R Documentation |
An Ensemble Method for Combining Subset-Specific Algorithm Fits
Description
The Subsemble algorithm partitions the full dataset into subsets of observations, fits a specified underlying algorithm on each subset, and uses a unique form of k-fold cross-validation to output a prediction function that combines the subset-specific fits.
Usage
subsemble(x, y, newx = NULL, family = gaussian(),
learner, metalearner = "SL.glm", subsets = 3, subControl = list(),
cvControl = list(), learnControl = list(), genControl = list(),
id = NULL, obsWeights = NULL, seed = 1, parallel = "seq")
Arguments
x |
The data.frame or matrix of predictor variables. |
y |
The outcome in the training data set. Must be a numeric vector. |
newx |
The predictor variables in the test data set. The structure should match |
family |
A description of the error distribution and link function to be used in the model. This can be a character string naming a family function, a family function or the result of a call to a family function. (See '?family' for the details of family functions.) Currently allows |
learner |
A string or character vector naming the prediction algorithm(s) used to train a model on each of the subsets of |
metalearner |
A string specifying the prediction algorithm used to learn the optimal weighted combination of the sublearners (ie. models learned on subsets of the data.) This uses the API provided by the SuperLearner package, so for example, we could use |
subsets |
An integer specifying the number of subsets the data should be partitioned into, a vector of subset labels equal to the number of rows of |
subControl |
A list of parameters to control the data partitioning (subsetting) process. The logical |
cvControl |
A list of parameters to control the cross-validation process. The |
learnControl |
A list of parameters to control the learning process. Currently, the only parameter is |
genControl |
A list of general control parameters. Currently, the only parameter is |
id |
Optional cluster identification variable. Passed to the |
obsWeights |
Optional observation weights vector. As with |
seed |
A random seed to be set (integer); defaults to 1. If |
parallel |
A character string specifying optional parallelization. Use |
Value
subfits |
A list of predictive models, each of which are fit on a subset of the (rows of) data, |
metafit |
The predictive model which is learned by regressing |
subpred |
A data.frame with the predicted values from each sublearner algorithm for the rows in |
pred |
A vector containing the predicted values from the subsemble for the rows in |
Z |
The Z matrix (the cross-validated predicted values for each sublearner). |
cvRisk |
A numeric vector with the k-fold cross-validated risk estimate for each algorithm in learning library. Note that this does not contain the CV risk estimate for the Subsemble, only the individual models in the library. (Not enabled yet, set to |
family |
Returns the |
subControl |
Returns the |
cvControl |
Returns the |
learnControl |
Returns the |
subsets |
The list of subsets, which is a list of vectors of row indicies. The length of this list equals the number of subsets. |
subCVsets |
The list of subsets, further broken down into the cross-validation folds that were used. Each subset (top level list element) is partitioned into V cross-validation folds. |
ylim |
Returns range of |
seed |
An integer. Returns |
runtime |
An list of runtimes for various steps of the algorithm. The list contains |
Author(s)
Erin LeDell oss@ledell.org
References
LeDell, E. (2015) Scalable Ensemble Learning and Computationally Efficient Variance Estimation (Doctoral Dissertation). University of California, Berkeley, USA.
https://github.com/ledell/phd-thesis/blob/main/ledell-phd-thesis.pdf
Stephanie Sapp, Mark J. van der Laan & John Canny. (2014) Subsemble: An ensemble method for combining subset-specific algorithm fits. Journal of Applied Statistics, 41(6):1247-1259
https://www.tandfonline.com/doi/abs/10.1080/02664763.2013.864263
https://biostats.bepress.com/ucbbiostat/paper313/
See Also
Examples
# Load some example data.
library(subsemble)
library(cvAUC) # >= version 1.0.1
data(admissions)
# Training data.
x <- subset(admissions, select = -c(Y))[1:400,]
y <- admissions$Y[1:400]
# Test data.
newx <- subset(admissions, select = -c(Y))[401:500,]
newy <- admissions$Y[401:500]
# Set up the Subsemble.
learner <- c("SL.randomForest", "SL.glm")
metalearner <- "SL.glm"
subsets <- 2
# Train and test the model.
# With learnControl$multiType="crossprod" (the default),
# we ensemble 4 models (2 subsets x 2 learners).
fit <- subsemble(x = x, y = y, newx = newx, family = binomial(),
learner = learner, metalearner = metalearner,
subsets = subsets)
# Evaulate the model by calculating AUC on the test set.
auc <- AUC(predictions = fit$pred, labels = newy)
print(auc) # Test set AUC is: 0.937
# We can also use the predict method to generate predictions on new data afterwards.
pred <- predict(fit, newx)
auc <- AUC(predictions = pred$pred, labels = newy)
print(auc) # Test set AUC is: 0.937
# Modify the learnControl argument and train/eval a new Subsemble.
# With learnControl$multiType="divisor",
# we ensemble only 2 models (one for each subset).
fit <- subsemble(x = x, y = y, newx = newx, family = binomial(),
learner = learner, metalearner = metalearner,
subsets = subsets,
learnControl = list(multiType = "divisor"))
auc <- AUC(predictions = fit$pred, labels = newy)
print(auc) # Test set AUC is: 0.922
# An example using a single learner.
# In this case there are 3 subsets and 1 learner,
# for a total of 3 models in the ensemble.
learner <- c("SL.randomForest")
metalearner <- "SL.glmnet"
subsets <- 3
fit <- subsemble(x = x, y = y, newx = newx, family = binomial(),
learner = learner, metalearner = metalearner,
subsets = subsets)
auc <- AUC(predictions = fit$pred, labels = newy)
print(auc) # Test set AUC is: 0.925
# An example using the full data (i.e. subsets = 1).
# Here, we have an ensemble of 2 models (one for each of the 2 learners).
# This is equivalent to the Super Learner algorithm.
learner <- c("SL.randomForest", "SL.glm")
metalearner <- "SL.glm"
subsets <- 1
fit <- subsemble(x = x, y = y, newx = newx, family = binomial(),
learner = learner, metalearner = metalearner,
subsets = subsets)
auc <- AUC(predictions = fit$pred, labels = newy)
print(auc) # Test set AUC is: 0.935
# Multicore subsemble via the "parallel" package.
# To perform the cross-validation and training steps using all available cores,
# use the parallel = "multicore" option.
# More examples and information at: https://github.com/ledell/subsemble