R: Adaptive partition based on random forests

parRF {BAGofT}

R Documentation

Adaptive partition based on random forests

Description

parRF generates an adaptive partition based on the training set data and training set predictions. It controls the group sizes by the covariates data from the validation set.

Usage

parRF(parVar = ".", Kmax = NULL, nmin = NULL, ntree = 60, mtry = NULL, maxnodes = NULL)

Arguments

`parVar`	a character vector that contains the names of the covariates to generate the adaptive partition. The default is taking all the variables except the response from the ‘data’.
`Kmax`	the maximum number of groups. The default is floor(nrow(Test.data)/nmin).
`nmin`	a numerical vector of the training set Pearson residuals from the classifier to test. The default is ceiling(sqrt(nrow(Validation.data))).
`ntree`	number of trees to grow. The default is 60.
`mtry`	number of variables randomly sampled as candidates at each split. The default value is floor( length(parVarNew)/3) where parVarNew is the number of covariates after the preselection.
`maxnodes`	maximum number of terminal nodes trees in the forest can have.

Value

`gup`	a factor that contains the grouping result of the validation set data.
`parRes`	a list that contains the variable importance from the random forest.

References

Zhang, Ding and Yang (2021) "Is a Classification Procedure Good Enough?-A Goodness-of-Fit Assessment Tool for Classification Learning" arXiv preprint arXiv:1911.03063v2 (2021).

Examples

###################################################
# Generate a sample dataset.
###################################################
# set the random seed
set.seed(20)
# set the number of observations
n <- 200

# generate covariates data
x1dat <- runif(n, -3, 3)
x2dat <- rnorm(n, 0, 1)
x3dat <- rchisq(n, 4)

# set coefficients
beta1 <- 1
beta2 <- 1
beta3 <- 1

# calculate the linear predictor data
lindat <- x1dat * beta1 + x2dat * beta2 + x3dat * beta3
# calculate the probabilities by inverse logit link
pdat <- 1/(1 + exp(-lindat))

# generate the response data
ydat <- sapply(pdat, function(x) stats :: rbinom(1, 1, x))

# generate the dataset
dat <- data.frame(y = ydat, x1 = x1dat, x2 = x2dat,
                  x3 = x3dat)

###################################################
# Apply parRF to generate an adaptive partition
###################################################
# number of rows in the dataset
nr <- nrow(dat)
# size of the validation set
ne <- floor(5*nrow(dat)^(1/2))
# obtain the training set size
nt <- nr - ne
# the indices for training set observations
trainIn <- sample(c(1 : nr), nt)

#split the data
datT <- dat[trainIn, ]
datE <- dat[-trainIn, ]
# fit a logistic regression model to test by training data
testModel <- testGlmBi(formula = y ~ x1 + x2 , link = "logit")
# output training set predictions and pearson residuals
testMod <- testModel(Train.data = datT, Validation.data = datE)

# obtain adaptive partition result from parFun
parFun <- parRF(parVar = c("x1", "x2", "x3"))
par <- parFun(Rsp = testMod$Rsp, predT = testMod$predT, res = testMod$res,
              Train.data = datT, Validation.data = datE)

# print the grouping result of the validataion set data
print(par$gup)

# print variable importance from the random forest
print(par$parRes)

[Package BAGofT version 1.0.0 Index]