R: Construct the random subspace ensemble classifier.

Rase {RaSEn}

R Documentation

Construct the random subspace ensemble classifier.

Description

RaSE is a general ensemble classification framework to solve the sparse classification problem. In RaSE algorithm, for each of the B1 weak learners, B2 random subspaces are generated and the optimal one is chosen to train the model on the basis of some criterion.

Usage

Rase(
  xtrain,
  ytrain,
  xval = NULL,
  yval = NULL,
  B1 = 200,
  B2 = 500,
  D = NULL,
  dist = NULL,
  base = NULL,
  super = list(type = c("separate"), base.update = TRUE),
  criterion = NULL,
  ranking = TRUE,
  k = c(3, 5, 7, 9, 11),
  cores = 1,
  seed = NULL,
  iteration = 0,
  cutoff = TRUE,
  cv = 5,
  scale = FALSE,
  C0 = 0.1,
  kl.k = NULL,
  lower.limits = NULL,
  upper.limits = NULL,
  weights = NULL,
  ...
)

Arguments

`xtrain`	n * p observation matrix. n observations, p features.
`ytrain`	n 0/1 observatons.
`xval`	observation matrix for validation. Default = `NULL`. Useful only when `criterion` = 'validation'.
`yval`	0/1 observation for validation. Default = `NULL`. Useful only when `criterion` = 'validation'.
`B1`	the number of weak learners. Default = 200.
`B2`	the number of subspace candidates generated for each weak learner. Default = 500.
`D`	the maximal subspace size when generating random subspaces. Default = `NULL`, which is `min(\sqrt n0, \sqrt n1, p)` when `base` = 'qda' and is `min(\sqrt n, p)` otherwise. For classical RaSE with a single classifier type, `D` is a positive integer. For super RaSE with multiple classifier types, `D` is a vector indicating different D values used for each base classifier type (the corresponding classifier types should be noted in the names of the vector).
`dist`	the distribution for features when generating random subspaces. Default = `NULL`, which represents the uniform distribution. First generate an integer `d` from `1,...,D` uniformly, then uniformly generate a subset with cardinality `d`.
`base`	the type of base classifier. Default = 'lda'. Can be either a single string chosen from the following options or a string/probability vector. When it indicates a single type of base classifiers, the classical RaSE model (Tian, Y. and Feng, Y., 2021(b)) will be fitted. When it is a string vector which includes multiple base classifier types, a super RaSE model (Zhu, J. and Feng, Y., 2021) will be fitted, by samling base classifiers with equal probabilty. It can also be a probability vector with row names corresponding to the specific classifier type, in which case a super RaSE model will be trained by sampling base classifiers in the given sampling probability. lda: linear discriminant analysis. `lda` in `MASS` package. qda: quadratic discriminant analysis. `qda` in `MASS` package. knn: k-nearest neighbor. `knn`, `knn.cv` in `class` package and `knn3` in `caret` package. logistic: logistic regression. `glm` in `stats` package and `glmnet` in `glmnet` package. tree: decision tree. `rpart` in `rpart` package. svm: support vector machine. `svm` in `e1071` package. randomforest: random forest. `randomForest` in `randomForest` package. gamma: Bayesian classifier for multivariate gamma distribution with independent marginals.
`super`	a list of control parameters for super RaSE (Zhu, J. and Feng, Y., 2021). Not used when base equals to a single string. Should be a list object with the following components: type: the type of super RaSE. Currently the only option is 'separate', meaning that subspace distributions are different for each type of base classifiers. base.update: indicates whether the sampling probability of base classifiers should be updated during iterations or not. Logistic, default = TRUE.
`criterion`	the criterion to choose the best subspace for each weak learner. For the classical RaSE (when `base` includes a single classifier type), default = 'ric' when `base` = 'lda', 'qda', 'gamma'; default = 'ebic' and set `gam` = 0 when `base` = 'logistic'; default = 'loo' when `base` = 'knn'; default = 'training' when `base` = 'tree', 'svm', 'randomforest'. For the super RaSE (when `base` indicates multiple classifiers or the sampling probability of multiple classifiers), default = 'cv' with the number of folds `cv` = 5, and it can only be 'cv', 'training' or 'auc'. ric: minimizing ratio information criterion with parametric estimation (Tian, Y. and Feng, Y., 2021(b)). Available when `base` = 'lda', 'qda', 'gamma' or 'logistic'. nric: minimizing ratio information criterion with non-parametric estimation (Tian, Y. and Feng, Y., 2021(b)). Available when `base` = 'lda', 'qda', 'gamma' or 'logistic'. training: minimizing training error. Not available when `base` = 'knn'. loo: minimizing leave-one-out error. Only available when `base` = 'knn'. validation: minimizing validation error based on the validation data. Available for all base classifiers. auc: minimizing negative area under the ROC curve (AUC). Currently it is estimated on training data via function `auc` from package `ModelMetrics`. It is available for all classier choices. cv: minimizing k-fold cross-validation error. k equals to the value of `cv`. Default = 5. Not available when `base` = 'gamma'. aic: minimizing Akaike information criterion (Akaike, H., 1973). Available when `base` = 'lda' or 'logistic'. AIC = -2 * log-likelihood + \|S\| * 2. bic: minimizing Bayesian information criterion (Schwarz, G., 1978). Available when `base` = 'lda' or 'logistic'. BIC = -2 * log-likelihood + \|S\| * log(n). ebic: minimizing extended Bayesian information criterion (Chen, J. and Chen, Z., 2008; 2012). Need to assign value for `gam`. When `gam` = 0, it denotes the classical BIC. Available when `base` = 'lda' or 'logistic'. EBIC = -2 * log-likelihood + \|S\| * log(n) + 2 * \|S\| * gam * log(p).
`ranking`	whether the function outputs the selected percentage of each feature in B1 subspaces. Logistic, default = TRUE.
`k`	the number of nearest neightbors considered when `base` = 'knn'. Only useful when `base` = 'knn'. Default = (3, 5, 7, 9, 11).
`cores`	the number of cores used for parallel computing. Default = 1.
`seed`	the random seed assigned at the start of the algorithm, which can be a real number or `NULL`. Default = `NULL`, in which case no random seed will be set.
`iteration`	the number of iterations. Default = 0.
`cutoff`	whether to use the empirically optimal threshold. Logistic, default = TRUE. If it is FALSE, the threshold will be set as 0.5.
`cv`	the number of cross-validations used. Default = 5. Only useful when `criterion` = 'cv'.
`scale`	whether to normalize the data. Logistic, default = FALSE.
`C0`	a positive constant used when `iteration` > 1. Default = 0.1. See Tian, Y. and Feng, Y., 2021(b) for details.
`kl.k`	the number of nearest neighbors used to estimate RIC in a non-parametric way. Default = `NULL`, which means that `k0 = floor(\sqrt n0)` and `k1 = floor(\sqrt n1)`. See Tian, Y. and Feng, Y., 2021(b) for details. Only available when `criterion` = 'nric'.
`lower.limits`	the vector of lower limits for each coefficient in logistic regression. Should be a vector of length equal to the number of variables (the column number of `xtrain`). Each of these must be non-positive. Default = `NULL`, meaning that lower limits are `-Inf` for all coefficients. Only available when `base` = 'logistic'. When it's activated, function `glmnet` will be used to fit logistic regression models, in which case the minimum subspace size is required to be larger than 1. The default subspace size distribution will be changed to uniform distribution on (2, ..., D).
`upper.limits`	the vector of upper limits for each coefficient in logistic regression. Should be a vector of length equal to the number of variables (the column number of `xtrain`). Each of these must be non-negative. Default = `NULL`, meaning that upper limits are `Inf` for all coefficients. Only available when `base` = 'logistic'. When it's activated, function `glmnet` will be used to fit logistic regression models, in which case the minimum subspace size is required to be larger than 1. The default subspace size distribution will be changed to uniform distribution on (2, ..., D).
`weights`	observation weights. Should be a vector of length equal to training sample size (the length of `ytrain`). It will be normailized inside the algorithm. Each component of weights must be non-negative. Default is `NULL`, representing equal weight for each observation. Only available when `base` = 'logistic'. When it's activated, function `glmnet` will be used to fit logistic regression models, in which case the minimum subspace size is required to be larger than 1. The default subspace size distribution will be changed to uniform distribution on (2, ..., D).
`...`	additional arguments.

Value

An object with S3 class 'RaSE' if base indicates a single base classifier.

`marginal`	the marginal probability for each class.
`base`	the type of base classifier.
`criterion`	the criterion to choose the best subspace for each weak learner.
`B1`	the number of weak learners.
`B2`	the number of subspace candidates generated for each weak learner.
`D`	the maximal subspace size when generating random subspaces.
`iteration`	the number of iterations.
`fit.list`	sequence of B1 fitted base classifiers.
`cutoff`	the empirically optimal threshold.
`subspace`	sequence of subspaces correponding to B1 weak learners.
`ranking`	the selected percentage of each feature in B1 subspaces.
`scale`	a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to `NULL` when the data is not scaled in `RaSE` model fitting.

An object with S3 class 'super_RaSE' if base includes multiple base classifiers or the sampling probability of multiple classifiers.

`marginal`	the marginal probability for each class.
`base`	the list of B1 base classifier types.
`criterion`	the criterion to choose the best subspace for each weak learner.
`B1`	the number of weak learners.
`B2`	the number of subspace candidates generated for each weak learner.
`D`	the maximal subspace size when generating random subspaces.
`iteration`	the number of iterations.
`fit.list`	sequence of B1 fitted base classifiers.
`cutoff`	the empirically optimal threshold.
`subspace`	sequence of subspaces correponding to B1 weak learners.
`ranking.feature`	the selected percentage of each feature corresponding to each type of classifier.
`ranking.base`	the selected percentage of each classifier type in the selected B1 learners.
`scale`	a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to `NULL` when the data is not scaled in `RaSE` model fitting.

Author(s)

Ye Tian (maintainer, ye.t@columbia.edu) and Yang Feng. The authors thank Yu Cao (Exeter Finance) and his team for many helpful suggestions and discussions.

References

Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.

Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.

Zhu, J. and Feng, Y., 2021. Super RaSE: Super Random Subspace Ensemble Classification. https://www.preprints.org/manuscript/202110.0042

Chen, J. and Chen, Z., 2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), pp.759-771.

Chen, J. and Chen, Z., 2012. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pp.555-574.

Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, 1973 (pp. 267-281). Akademiai Kaido.

Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics, 6(2), pp.461-464.

Examples

set.seed(0, kind = "L'Ecuyer-CMRG")
train.data <- RaModel("classification", 1, n = 100, p = 50)
test.data <- RaModel("classification", 1, n = 100, p = 50)
xtrain <- train.data$x
ytrain <- train.data$y
xtest <- test.data$x
ytest <- test.data$y

# test RaSE classifier with LDA base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'lda',
cores = 2, criterion = 'ric')
mean(predict(fit, xtest) != ytest)

## Not run: 
# test RaSE classifier with LDA base classifier and 1 iteration round
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'lda',
cores = 2, criterion = 'ric')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with QDA base classifier and 1 iteration round
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'qda',
cores = 2, criterion = 'ric')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with kNN base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'knn',
cores = 2, criterion = 'loo')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with logistic regression base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'logistic',
cores = 2, criterion = 'bic')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with SVM base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'svm',
cores = 2, criterion = 'training')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with random forest base classifier
fit <- Rase(xtrain, ytrain, B1 = 20, B2 = 10, iteration = 0, base = 'randomforest',
cores = 2, criterion = 'cv', cv = 3)
mean(predict(fit, xtest) != ytest)

# fit a super RaSE classifier by sampling base learner from kNN, LDA and logistic
# regression in equal probability
fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100,
base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T),
criterion = "cv", cv = 5, iteration = 1, cores = 2)
mean(predict(fit, xtest) != ytest)

# fit a super RaSE classifier by sampling base learner from random forest, LDA and
# SVM with probability 0.2, 0.5 and 0.3
fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100,
base = c(randomforest = 0.2, lda = 0.5, svm = 0.3),
super = list(type = "separate", base.update = F),
criterion = "cv", cv = 5, iteration = 0, cores = 2)
mean(predict(fit, xtest) != ytest)

## End(Not run)

[Package RaSEn version 3.0.0 Index]