R: Select predictive variables for random forest by AVI and...

steprfAVI {steprf}

R Documentation

Select predictive variables for random forest by AVI and accuracy in a stepwise algorithm

Description

This function is to select predictive variables for random forest by their averaged variable importance (AVI) that is calculated for each model after excluding the least important variable, and returns the corresponding predictive accuracy. It is also developed for 'steprf' function.

Usage

steprfAVI(
  trainx,
  trainy,
  cv.fold = 10,
  ntree = 500,
  rpt = 20,
  predacc = "VEcv",
  importance = TRUE,
  maxk = c(4),
  nsim = 100,
  min.n.var = 2,
  corr.threshold = 0.5,
  ...
)

Arguments

`trainx`	a dataframe or matrix contains columns of predictor variables.
`trainy`	a vector of response, must have length equal to the number of rows in trainx.
`cv.fold`	integer; number of folds in the cross-validation. if > 1, then apply n-fold cross validation; the default is 10, i.e., 10-fold cross validation that is recommended.
`ntree`	number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times. By default, 500 is used.
`rpt`	iteration number of cross validation.
`predacc`	"VEcv" for vecv for numerical data, or "ccr" (i.e., correct classification rate) or "kappa" for categorical data.
`importance`	importance of predictive variables.
`maxk`	maxk split value. By default, 4 is used.
`nsim`	iteration number for 'avi'. By default, 100 is used.
`min.n.var`	minimum number of predictive variables remained in the final predictive model the default is 2. If 1 is used, then warnings: 'invalid mtry: reset to within valid range' will be issued, which should be ignored.
`corr.threshold`	correlation threshold and the defaults value is 0.5.
`...`	other arguments passed on to randomForest.

Value

A list with the following components: 1) variable.removed: variable removed based on AVI, 2) predictive.accuracy: averaged predictive accuracy of the model after excluding the variable.removed, 3) delta.accuracy: contribution to accuracy by each variable.removed, and 4) predictive.accuracy2: predictive accuracy matrix of the model after excluding the variable.removed for each iteration.

Author(s)

Jin Li

References

Li, J. (2022). Spatial Predictive Modeling with R. Boca Raton, Chapman and Hall/CRC.

Li, J. 2013. Predicting the spatial distribution of seabed gravel content using random forest, spatial interpolation methods and their hybrid methods. Pages 394-400 The International Congress on Modelling and Simulation (MODSIM) 2013, Adelaide.

Li, J., Alvarez, B., Siwabessy, J., Tran, M., Huang, Z., Przeslawski, R., Radke, L., Howard, F. and Nichol, S. (2017). "Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: Predicting sponge species richness." Environmental Modelling & Software 97: 112-129.

Li, J., Siwabessy, J., Huang, Z., Nichol, S. (2019). "Developing an optimal spatial predictive model for seabed sand content using machine learning, geostatistics and their hybrid methods." Geosciences 9 (4):180.

Li, J., Siwabessy, J., Tran, M., Huang, Z. and Heap, A., 2014. Predicting Seabed Hardness Using Random Forest in R. Data Mining Applications with R. Y. Zhao and Y. Cen. Amsterdam, Elsevier: 299-329.

Liaw, A. and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18-22.

Smith, S.J., Ellis, N., Pitcher, C.R., 2011. Conditional variable importance in R package extendedForest. R vignette [http://gradientforest.r-forge.r-project.org/Conditional-importance.pdf].

Examples


library(spm)

data(petrel)
set.seed(1234)
steprf1 <- steprfAVI(trainx = petrel[, c(1,2, 6:9)], trainy = petrel[, 5],
 rpt = 2, predacc = "VEcv", importance = TRUE, nsim = 3, min.n.var = 2)
steprf1

steprf2 <- steprfAVI(trainx = hard[, -c(1, 17)], trainy = hard[, 17],
 rpt = 2, predacc = "ccr", importance = TRUE, nsim = 3, min.n.var = 2)
steprf2

#plot steprf1 results
library(reshape2)
pa1 <- as.data.frame(steprf1$predictive.accuracy2)
names(pa1) <- steprf1$variable.removed
pa2 <- melt(pa1, id = NULL)
names(pa2) <- c("Variable","VEcv")
library(lattice)
par (font.axis=2, font.lab=2)
with(pa2, boxplot(VEcv~Variable, ylab="VEcv (%)", xlab="Predictive variable removed"))

barplot(steprf1$delta.accuracy, col = (1:length(steprf1$variable.removed)),
names.arg = steprf1$variable.removed, main = "Predictive accuracy vs variable removed",
font.main = 4, cex.names=1, font=2, ylab="Increase in VEcv (%)")

[Package steprf version 1.0.2 Index]