Rase {RaSEn}R Documentation

Construct the random subspace ensemble classifier.

Description

RaSE is a general ensemble classification framework to solve the sparse classification problem. In RaSE algorithm, for each of the B1 weak learners, B2 random subspaces are generated and the optimal one is chosen to train the model on the basis of some criterion.

Usage

Rase(
  xtrain,
  ytrain,
  xval = NULL,
  yval = NULL,
  B1 = 200,
  B2 = 500,
  D = NULL,
  dist = NULL,
  base = NULL,
  super = list(type = c("separate"), base.update = TRUE),
  criterion = NULL,
  ranking = TRUE,
  k = c(3, 5, 7, 9, 11),
  cores = 1,
  seed = NULL,
  iteration = 0,
  cutoff = TRUE,
  cv = 5,
  scale = FALSE,
  C0 = 0.1,
  kl.k = NULL,
  lower.limits = NULL,
  upper.limits = NULL,
  weights = NULL,
  ...
)

Arguments

xtrain

n * p observation matrix. n observations, p features.

ytrain

n 0/1 observatons.

xval

observation matrix for validation. Default = NULL. Useful only when criterion = 'validation'.

yval

0/1 observation for validation. Default = NULL. Useful only when criterion = 'validation'.

B1

the number of weak learners. Default = 200.

B2

the number of subspace candidates generated for each weak learner. Default = 500.

D

the maximal subspace size when generating random subspaces. Default = NULL, which is min(\sqrt n0, \sqrt n1, p) when base = 'qda' and is min(\sqrt n, p) otherwise. For classical RaSE with a single classifier type, D is a positive integer. For super RaSE with multiple classifier types, D is a vector indicating different D values used for each base classifier type (the corresponding classifier types should be noted in the names of the vector).

dist

the distribution for features when generating random subspaces. Default = NULL, which represents the uniform distribution. First generate an integer d from 1,...,D uniformly, then uniformly generate a subset with cardinality d.

base

the type of base classifier. Default = 'lda'. Can be either a single string chosen from the following options or a string/probability vector. When it indicates a single type of base classifiers, the classical RaSE model (Tian, Y. and Feng, Y., 2021(b)) will be fitted. When it is a string vector which includes multiple base classifier types, a super RaSE model (Zhu, J. and Feng, Y., 2021) will be fitted, by samling base classifiers with equal probabilty. It can also be a probability vector with row names corresponding to the specific classifier type, in which case a super RaSE model will be trained by sampling base classifiers in the given sampling probability.

  • lda: linear discriminant analysis. lda in MASS package.

  • qda: quadratic discriminant analysis. qda in MASS package.

  • knn: k-nearest neighbor. knn, knn.cv in class package and knn3 in caret package.

  • logistic: logistic regression. glm in stats package and glmnet in glmnet package.

  • tree: decision tree. rpart in rpart package.

  • svm: support vector machine. svm in e1071 package.

  • randomforest: random forest. randomForest in randomForest package.

  • gamma: Bayesian classifier for multivariate gamma distribution with independent marginals.

super

a list of control parameters for super RaSE (Zhu, J. and Feng, Y., 2021). Not used when base equals to a single string. Should be a list object with the following components:

  • type: the type of super RaSE. Currently the only option is 'separate', meaning that subspace distributions are different for each type of base classifiers.

  • base.update: indicates whether the sampling probability of base classifiers should be updated during iterations or not. Logistic, default = TRUE.

criterion

the criterion to choose the best subspace for each weak learner. For the classical RaSE (when base includes a single classifier type), default = 'ric' when base = 'lda', 'qda', 'gamma'; default = 'ebic' and set gam = 0 when base = 'logistic'; default = 'loo' when base = 'knn'; default = 'training' when base = 'tree', 'svm', 'randomforest'. For the super RaSE (when base indicates multiple classifiers or the sampling probability of multiple classifiers), default = 'cv' with the number of folds cv = 5, and it can only be 'cv', 'training' or 'auc'.

  • ric: minimizing ratio information criterion with parametric estimation (Tian, Y. and Feng, Y., 2021(b)). Available when base = 'lda', 'qda', 'gamma' or 'logistic'.

  • nric: minimizing ratio information criterion with non-parametric estimation (Tian, Y. and Feng, Y., 2021(b)). Available when base = 'lda', 'qda', 'gamma' or 'logistic'.

  • training: minimizing training error. Not available when base = 'knn'.

  • loo: minimizing leave-one-out error. Only available when base = 'knn'.

  • validation: minimizing validation error based on the validation data. Available for all base classifiers.

  • auc: minimizing negative area under the ROC curve (AUC). Currently it is estimated on training data via function auc from package ModelMetrics. It is available for all classier choices.

  • cv: minimizing k-fold cross-validation error. k equals to the value of cv. Default = 5. Not available when base = 'gamma'.

  • aic: minimizing Akaike information criterion (Akaike, H., 1973). Available when base = 'lda' or 'logistic'.

    AIC = -2 * log-likelihood + |S| * 2.

  • bic: minimizing Bayesian information criterion (Schwarz, G., 1978). Available when base = 'lda' or 'logistic'.

    BIC = -2 * log-likelihood + |S| * log(n).

  • ebic: minimizing extended Bayesian information criterion (Chen, J. and Chen, Z., 2008; 2012). Need to assign value for gam. When gam = 0, it denotes the classical BIC. Available when base = 'lda' or 'logistic'.

    EBIC = -2 * log-likelihood + |S| * log(n) + 2 * |S| * gam * log(p).

ranking

whether the function outputs the selected percentage of each feature in B1 subspaces. Logistic, default = TRUE.

k

the number of nearest neightbors considered when base = 'knn'. Only useful when base = 'knn'. Default = (3, 5, 7, 9, 11).

cores

the number of cores used for parallel computing. Default = 1.

seed

the random seed assigned at the start of the algorithm, which can be a real number or NULL. Default = NULL, in which case no random seed will be set.

iteration

the number of iterations. Default = 0.

cutoff

whether to use the empirically optimal threshold. Logistic, default = TRUE. If it is FALSE, the threshold will be set as 0.5.

cv

the number of cross-validations used. Default = 5. Only useful when criterion = 'cv'.

scale

whether to normalize the data. Logistic, default = FALSE.

C0

a positive constant used when iteration > 1. Default = 0.1. See Tian, Y. and Feng, Y., 2021(b) for details.

kl.k

the number of nearest neighbors used to estimate RIC in a non-parametric way. Default = NULL, which means that k0 = floor(\sqrt n0) and k1 = floor(\sqrt n1). See Tian, Y. and Feng, Y., 2021(b) for details. Only available when criterion = 'nric'.

lower.limits

the vector of lower limits for each coefficient in logistic regression. Should be a vector of length equal to the number of variables (the column number of xtrain). Each of these must be non-positive. Default = NULL, meaning that lower limits are -Inf for all coefficients. Only available when base = 'logistic'. When it's activated, function glmnet will be used to fit logistic regression models, in which case the minimum subspace size is required to be larger than 1. The default subspace size distribution will be changed to uniform distribution on (2, ..., D).

upper.limits

the vector of upper limits for each coefficient in logistic regression. Should be a vector of length equal to the number of variables (the column number of xtrain). Each of these must be non-negative. Default = NULL, meaning that upper limits are Inf for all coefficients. Only available when base = 'logistic'. When it's activated, function glmnet will be used to fit logistic regression models, in which case the minimum subspace size is required to be larger than 1. The default subspace size distribution will be changed to uniform distribution on (2, ..., D).

weights

observation weights. Should be a vector of length equal to training sample size (the length of ytrain). It will be normailized inside the algorithm. Each component of weights must be non-negative. Default is NULL, representing equal weight for each observation. Only available when base = 'logistic'. When it's activated, function glmnet will be used to fit logistic regression models, in which case the minimum subspace size is required to be larger than 1. The default subspace size distribution will be changed to uniform distribution on (2, ..., D).

...

additional arguments.

Value

An object with S3 class 'RaSE' if base indicates a single base classifier.

marginal

the marginal probability for each class.

base

the type of base classifier.

criterion

the criterion to choose the best subspace for each weak learner.

B1

the number of weak learners.

B2

the number of subspace candidates generated for each weak learner.

D

the maximal subspace size when generating random subspaces.

iteration

the number of iterations.

fit.list

sequence of B1 fitted base classifiers.

cutoff

the empirically optimal threshold.

subspace

sequence of subspaces correponding to B1 weak learners.

ranking

the selected percentage of each feature in B1 subspaces.

scale

a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to NULL when the data is not scaled in RaSE model fitting.

An object with S3 class 'super_RaSE' if base includes multiple base classifiers or the sampling probability of multiple classifiers.

marginal

the marginal probability for each class.

base

the list of B1 base classifier types.

criterion

the criterion to choose the best subspace for each weak learner.

B1

the number of weak learners.

B2

the number of subspace candidates generated for each weak learner.

D

the maximal subspace size when generating random subspaces.

iteration

the number of iterations.

fit.list

sequence of B1 fitted base classifiers.

cutoff

the empirically optimal threshold.

subspace

sequence of subspaces correponding to B1 weak learners.

ranking.feature

the selected percentage of each feature corresponding to each type of classifier.

ranking.base

the selected percentage of each classifier type in the selected B1 learners.

scale

a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to NULL when the data is not scaled in RaSE model fitting.

Author(s)

Ye Tian (maintainer, ye.t@columbia.edu) and Yang Feng. The authors thank Yu Cao (Exeter Finance) and his team for many helpful suggestions and discussions.

References

Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.

Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.

Zhu, J. and Feng, Y., 2021. Super RaSE: Super Random Subspace Ensemble Classification. https://www.preprints.org/manuscript/202110.0042

Chen, J. and Chen, Z., 2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), pp.759-771.

Chen, J. and Chen, Z., 2012. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pp.555-574.

Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In 2nd International Symposium on Information Theory, 1973 (pp. 267-281). Akademiai Kaido.

Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics, 6(2), pp.461-464.

See Also

predict.RaSE, RaModel, print.RaSE, print.super_RaSE, RaPlot, RaScreen.

Examples

set.seed(0, kind = "L'Ecuyer-CMRG")
train.data <- RaModel("classification", 1, n = 100, p = 50)
test.data <- RaModel("classification", 1, n = 100, p = 50)
xtrain <- train.data$x
ytrain <- train.data$y
xtest <- test.data$x
ytest <- test.data$y

# test RaSE classifier with LDA base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'lda',
cores = 2, criterion = 'ric')
mean(predict(fit, xtest) != ytest)

## Not run: 
# test RaSE classifier with LDA base classifier and 1 iteration round
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'lda',
cores = 2, criterion = 'ric')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with QDA base classifier and 1 iteration round
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 1, base = 'qda',
cores = 2, criterion = 'ric')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with kNN base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'knn',
cores = 2, criterion = 'loo')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with logistic regression base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'logistic',
cores = 2, criterion = 'bic')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with SVM base classifier
fit <- Rase(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, base = 'svm',
cores = 2, criterion = 'training')
mean(predict(fit, xtest) != ytest)

# test RaSE classifier with random forest base classifier
fit <- Rase(xtrain, ytrain, B1 = 20, B2 = 10, iteration = 0, base = 'randomforest',
cores = 2, criterion = 'cv', cv = 3)
mean(predict(fit, xtest) != ytest)

# fit a super RaSE classifier by sampling base learner from kNN, LDA and logistic
# regression in equal probability
fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100,
base = c("knn", "lda", "logistic"), super = list(type = "separate", base.update = T),
criterion = "cv", cv = 5, iteration = 1, cores = 2)
mean(predict(fit, xtest) != ytest)

# fit a super RaSE classifier by sampling base learner from random forest, LDA and
# SVM with probability 0.2, 0.5 and 0.3
fit <- Rase(xtrain = xtrain, ytrain = ytrain, B1 = 100, B2 = 100,
base = c(randomforest = 0.2, lda = 0.5, svm = 0.3),
super = list(type = "separate", base.update = F),
criterion = "cv", cv = 5, iteration = 0, cores = 2)
mean(predict(fit, xtest) != ytest)

## End(Not run)

[Package RaSEn version 3.0.0 Index]