varSelImpSpecRF {varSelRF} | R Documentation |
Variable selection using the "importance spectrum"
Description
Perform variable selection based on a simple heuristic using the importance spectrum of the original data compared to the importance spectra from the same data with the class labels randomly permuted.
Usage
varSelImpSpecRF(forest, xdata = NULL, Class = NULL,
randomImps = NULL,
threshold = 0.1,
numrandom = 20,
whichImp = "impsUnscaled",
usingCluster = TRUE,
TheCluster = NULL, ...)
Arguments
forest |
A previously fitted random forest (see |
xdata |
A data frame or matrix, with subjects/cases in rows and variables in columns. NAs not allowed. |
Class |
The dependent variable; must be a factor. |
randomImps |
A list with a structure such as the object
return by |
.
threshold |
The threshold for the selection of variables. See details. |
numrandom |
The number of random permutations of the class labels. |
whichImp |
One of |
usingCluster |
If TRUE use a cluster to parallelize the calculations. |
TheCluster |
The name of the cluster, if one is used. |
... |
Not used. |
Details
You can either pass as arguments a valid object for randomImps
,
obtained from a previous call to randomVarImpsRF
OR
you can pass a covariate data frame and a dependent variable, and
these will be used to obtain the random importances. The former is
preferred for normal use, because this function will not returned the
computed random variable importances, and this computation can be
lengthy. If you pass both randomImps
, xdata
, and Class
,
randomImps
will be used.
To select variables, start by ordering from largest (i=1
) to smallest
(i = p
, where p
is the number of
variables), the variable importances from the original data and from
each of the data sets with permuted class labels. (So the ordering is
done in each data set independently). Compute
q_i
, the 1 - threshold
quantile of
the ordered variable importances from the permuted data at ordered
postion i
. Then,
starting from i = 1
, let i_a
be the first i
for which
the variable importance from the original data is smaller than
q_i
. Select all variables from i=1
to i = i_a - 1
.
Value
A vector with the names of the selected variables, ordered by decreasing importance.
Note
The name of this function is related to the idea of "importance spectrum plot", which is the term that Friedman \& Meulman, 2005 use in their paper.
Author(s)
Ramon Diaz-Uriarte rdiaz02@gmail.com
References
Breiman, L. (2001) Random forests. Machine Learning, 45, 5–32.
Diaz-Uriarte, R. , Alvarez de Andres, S. (2005) Variable selection from random forests: application to gene expression data. Tech. report. http://ligarto.org/rdiaz/Papers/rfVS/randomForestVarSel.html
Friedman, J., Meulman, J. (2005) Clustering objects on subsets of attributes (with discussion). J. Royal Statistical Society, Series B, 66, 815–850.
See Also
randomForest
,
varSelRF
,
varSelRFBoot
,
randomVarImpsRFplot
,
randomVarImpsRF
Examples
x <- matrix(rnorm(45 * 30), ncol = 30)
x[1:20, 1:2] <- x[1:20, 1:2] + 2
cl <- factor(c(rep("A", 20), rep("B", 25)))
rf <- randomForest(x, cl, ntree = 200, importance = TRUE)
rf.rvi <- randomVarImpsRF(x, cl,
rf,
numrandom = 20,
usingCluster = FALSE)
varSelImpSpecRF(rf, randomImps = rf.rvi)
## Not run:
## Identical, but using a cluster
psockCL <- makeCluster(2, "PSOCK")
clusterSetRNGStream(psockCL, iseed = 456)
clusterEvalQ(psockCL, library(varSelRF))
rf.rvi <- randomVarImpsRF(x, cl,
rf,
numrandom = 20,
usingCluster = TRUE,
TheCluster = psockCL)
varSelImpSpecRF(rf, randomImps = rf.rvi)
stopCluster(psockCL)
## End(Not run)