RaScreen {RaSEn}R Documentation

Variable screening via RaSE.

Description

RaSE is a general framework for variable screening. In RaSE screening, to select each of the B1 subspaces, B2 random subspaces are generated and the optimal one is chosen according to some criterion. Then the selected proportions (equivalently, percentages) of variables in the B1 subspaces are used as importance measure to rank these variables.

Usage

RaScreen(
  xtrain,
  ytrain,
  xval = NULL,
  yval = NULL,
  B1 = 200,
  B2 = NULL,
  D = NULL,
  dist = NULL,
  model = NULL,
  criterion = NULL,
  k = 5,
  cores = 1,
  seed = NULL,
  iteration = 0,
  cv = 5,
  scale = FALSE,
  C0 = 0.1,
  kl.k = NULL,
  classification = NULL,
  ...
)

Arguments

xtrain

n * p observation matrix. n observations, p features.

ytrain

n 0/1 observatons.

xval

observation matrix for validation. Default = NULL. Useful only when criterion = 'validation'.

yval

0/1 observation for validation. Default = NULL. Useful only when criterion = 'validation'.

B1

the number of weak learners. Default = 200.

B2

the number of subspace candidates generated for each weak learner. Default = NULL, which will set B2 = 20*floor(p/D).

D

the maximal subspace size when generating random subspaces. Default = NULL. It means that D = min(\sqrt n0, \sqrt n1, p) when model = 'qda', and D = min(\sqrt n, p) otherwise.

dist

the distribution for features when generating random subspaces. Default = NULL, which represents the hierarchical uniform distribution. First generate an integer d from 1,...,D uniformly, then uniformly generate a subset with cardinality d.

model

the model to use. Default = 'lda' when classification = TRUE and 'lm' when classification = FALSE.

  • lm: linear regression. Only available for regression.

  • lda: linear discriminant analysis. lda in MASS package. Only available for classification.

  • qda: quadratic discriminant analysis. qda in MASS package. Only available for classification.

  • knn: k-nearest neighbor. knn, knn.cv in class package, knn3 in caret package and knnreg in caret package.

  • logistic: logistic regression. glmnet in glmnet package. Only available for classification.

  • tree: decision tree. rpart in rpart package. Only available for classification.

  • svm: support vector machine. If kernel is not identified by user, it will use RBF kernel. svm in e1071 package.

  • randomforest: random forest. randomForest in randomForest package and ranger in ranger package.

  • kernelknn: k-nearest neighbor with different kernels. It relies on function KernelKnn in KernelKnn package. Arguments method and weights_function are required. Different choices of multiple arguments are available. See documentation of function KernelKnn for details.

criterion

the criterion to choose the best subspace. Default = 'ric' when model = 'lda', 'qda'; default = 'bic' when model = 'lm' or 'logistic'; default = 'loo' when model = 'knn'; default = 'cv' and set cv = 5 when model = 'tree', 'svm', 'randomforest'.

  • ric: minimizing ratio information criterion (RIC) with parametric estimation (Tian, Y. and Feng, Y., 2020). Available for binary classification and model = 'lda', 'qda', or 'logistic'.

  • nric: minimizing ratio information criterion (RIC) with non-parametric estimation (Tian, Y. and Feng, Y., 2020; ). Available for binary classification and model = 'lda', 'qda', or 'logistic'.

  • training: minimizing training error/MSE. Not available when model = 'knn'.

  • loo: minimizing leave-one-out error/MSE. Only available when model = 'knn'.

  • validation: minimizing validation error/MSE based on the validation data.

  • cv: minimizing k-fold cross-validation error/MSE. k equals to the value of cv. Default = 5.

  • aic: minimizing Akaike information criterion (Akaike, H., 1973). Available when base = 'lm' or 'logistic'.

    AIC = -2 * log-likelihood + |S| * 2.

  • bic: minimizing Bayesian information criterion (Schwarz, G., 1978). Available when model = 'lm' or 'logistic'.

    BIC = -2 * log-likelihood + |S| * log(n).

  • ebic: minimizing extended Bayesian information criterion (Chen, J. and Chen, Z., 2008; 2012). gam value is needed. When gam = 0, it represents BIC. Available when model = 'lm' or 'logistic'.

    eBIC = -2 * log-likelihood + |S| * log(n) + 2 * |S| * gam * log(p).

k

the number of nearest neightbors considered when model = 'knn' or 'kernel'. Only useful when model = 'knn' or 'kernel'. k is required to be a positive integer. Default = 5.

cores

the number of cores used for parallel computing. Default = 1.

seed

the random seed assigned at the start of the algorithm, which can be a real number or NULL. Default = NULL, in which case no random seed will be set.

iteration

the number of iterations. Default = 0.

cv

the number of cross-validations used. Default = 5. Only useful when criterion = 'cv'.

scale

whether to normalize the data. Logistic, default = FALSE.

C0

a positive constant used when iteration > 1. See Tian, Y. and Feng, Y., 2021 for details. Default = 0.1.

kl.k

the number of nearest neighbors used to estimate RIC in a non-parametric way. Default = NULL, which means that k0 = floor(\sqrt n0) and k1 = floor(\sqrt n1). See Tian, Y. and Feng, Y., 2020 for details. Only available when criterion = 'nric'.

classification

the indicator of the problem type, which can be TRUE, FALSE or NULL. Default = NULL, which will automatically set classification = TRUE if the number of unique response value \le 10. Otherwise, it will be set as FALSE.

...

additional arguments.

Value

A list including the following items.

model

the model used in RaSE screening.

criterion

the criterion to choose the best subspace for each weak learner.

B1

the number of selected subspaces.

B2

the number of subspace candidates generated for each of B1 subspaces.

n

the sample size.

p

the dimension of data.

D

the maximal subspace size when generating random subspaces.

iteration

the number of iterations.

selected.perc

A list of length (iteration+1) recording the selected percentages of each feature in B1 subspaces. When it is of length 1, the result will be automatically transformed to a vector.

scale

a list of scaling parameters, including the scaling center and the scale parameter for each feature. Equals to NULL when the data is not scaled by RaScreen.

References

Tian, Y. and Feng, Y., 2021(a). RaSE: A variable screening framework via random subspace ensembles. Journal of the American Statistical Association, (just-accepted), pp.1-30.

Tian, Y. and Feng, Y., 2021(b). RaSE: Random subspace ensemble classification. Journal of Machine Learning Research, 22(45), pp.1-93.

Chen, J. and Chen, Z., 2008. Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), pp.759-771.

Chen, J. and Chen, Z., 2012. Extended BIC for small-n-large-P sparse GLM. Statistica Sinica, pp.555-574.

Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics, 6(2), pp.461-464.

See Also

Rase, RaRank.

Examples

set.seed(0, kind = "L'Ecuyer-CMRG")
train.data <- RaModel("screening", 1, n = 100, p = 100)
xtrain <- train.data$x
ytrain <- train.data$y

# test RaSE screening with linear regression model and BIC
fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'lm',
cores = 2, criterion = 'bic')

# Select D variables
RaRank(fit, selected.num = "D")


## Not run: 
# test RaSE screening with knn model and 5-fold cross-validation MSE
fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'knn',
cores = 2, criterion = 'cv', cv = 5)

# Select n/logn variables
RaRank(fit, selected.num = "n/logn")


# test RaSE screening with SVM and 5-fold cross-validation MSE
fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 50, iteration = 0, model = 'svm',
cores = 2, criterion = 'cv', cv = 5)

# Select n/logn variables
RaRank(fit, selected.num = "n/logn")


# test RaSE screening with logistic regression model and eBIC (gam = 0.5). Set iteration number = 1
train.data <- RaModel("screening", 6, n = 100, p = 100)
xtrain <- train.data$x
ytrain <- train.data$y

fit <- RaScreen(xtrain, ytrain, B1 = 100, B2 = 100, iteration = 1, model = 'logistic',
cores = 2, criterion = 'ebic', gam = 0.5)

# Select n/logn variables from the selected percentage after one iteration round
RaRank(fit, selected.num = "n/logn", iteration = 1)

## End(Not run)

[Package RaSEn version 3.0.0 Index]