R: Cross-validated permutation variable importance measure

CVPVI {vita}

R Documentation

Cross-validated permutation variable importance measure

Description

Compute cross-validated permutation variable importance measure from a random forest for classification and regression.

Usage


## Default S3 method:
CVPVI(X, y, k = 2, mtry= if (!is.null(y) && !is.factor(y))
                        max(floor(ncol(X)/3), 1) else floor(sqrt(ncol(X))),
    ntree = 500, nPerm = 1, parallel = FALSE, ncores = 0, seed = 123, ...)
## S3 method for class 'CVPVI'
print(x, ...)

Arguments

`X`	a data frame or a matrix of predictors.
`y`	a response vector.
`k`	an integer for the number of folds. Default is `k = 2`
`mtry`	Number of variables randomly sampled as candidates at each split for the l-th forest. Note that the default values are different for classification (`mtry=sqrt(p)` where `p` is number of variables in `x`) and regression (`mtry=p/3`).
`ntree`	Number of trees to grow for the l-th forest. Default is `ntree=500`.
`nPerm`	Number of times the l-th data set are permuted per tree for assessing variable fold-specific permutation variable importance. Default is `nPerm=1`.
`parallel`	Should the CVPVI implementation run parallel? Default is `parallel=FALSE` and the number of cores is set to one. The parallelized version of the CVPVI implementation are based on `mclapply` and so are not available on Windows.
`ncores`	The number of cores to use, i.e. at most how many child processes will be run simultaneously. Must be at least one, and parallelization requires at least two cores. If `ncores=0`, then the half of CPU cores on the current host are used.
`seed`	a single integer value to specify seeds. The "combined multiple-recursive generator" from L'Ecuyer (1999) is set as random number generator for the parallelized version of the CVPVI implementation. Default is `seed = 123`.
`...`	optional parameters for `randomForest`
`x`	for the print method, an `CVPVI` object

Details

This method randomly splits the dataset into k sets of equal size. The method constructs k random forests, where the l-th forest is constructed based on observations that are not part of the l-th set. For each forest the fold-specific permutation variable importance measure is computed using all observations in the l-th data set: For each tree, the prediction error on the l-th data set is recorded. Then the same is done after permuting the values of each predictor variable. The differences between the two prediction errors are then averaged over all trees. The cross-validated permutation variable importance is the average of all k-fold-specific permutation variable importances. For classification the mean decrease in accuracy over all classes is used and for regression the mean decrease in MSE.

Value

`fold_varim`	a p by k matrix of fold-specific permutation variable importances. For classification the mean decrease in accuracy over all classes. For regression mean decrease in MSE.
`cv_varim`	cross-validated permutation variable importances. For classification the mean decrease in accuracy over all classes. For regression mean decrease in MSE.
`type`	one of regression, classification

References

Janitza S, Celik E, Boulesteix A-L, (2015), A computationally fast variable importance test for random forest for high dimensional data,Technical Report 185, University of Munich, <http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-25587-4>

Examples

##############################
#      Classification        #
##############################
## Simulating data
X = replicate(10,rnorm(100))
X= data.frame( X) #"X" can also be a matrix
z  = with(X,5*X1 + 3*X2 + 2*X3 + 1*X4 -
            5*X5 - 9*X6 - 2*X7 + 1*X8 )
pr = 1/(1+exp(-z))         # pass through an inv-logit function
y = as.factor(rbinom(100,1,pr))
##################################################################
# cross-validated permutation variable importance
cv_vi = CVPVI(X,y,k = 2,mtry = 3,ntree = 1000,ncores = 4)
print(cv_vi)

##################################################################
#compare them with the original permutation variable importance
library("randomForest")
cl.rf = randomForest(X,y,mtry = 3,ntree = 1000, importance = TRUE)

round(cbind(importance(cl.rf, type=1, scale=FALSE),cv_vi$cv_varim),digits=5)


###############################
#      Regression            #
##############################

##################################################################
## Simulating data:
X = replicate(10,rnorm(100))
X = data.frame( X) #"X" can also be a matrix
y = with(X,2*X1 + 2*X2 + 2*X3 + 1*X4 - 2*X5 - 2*X6 - 1*X7 + 2*X8 )

##################################################################
# cross-validated permutation variable importance
cv_vi = CVPVI(X,y,k = 3,mtry = 3,ntree = 1000,ncores = 2)
print(cv_vi)
##################################################################
#compare them with the original permutation variable importance
library("randomForest")
reg.rf = randomForest(X,y,mtry = 3,ntree = 1000, importance = TRUE)

round(cbind(importance(reg.rf, type=1, scale=FALSE),cv_vi$cv_varim),digits=5)

[Package vita version 1.0.0 Index]