R: Prepare for visualization of a random forest classification...

vcr.forest.train {classmap}

R Documentation

Prepare for visualization of a random forest classification on training data

Description

Produces output for the purpose of constructing graphical displays such as the classmap and silplot. The user first needs to train a random forest on the data by randomForest::randomForest. This then serves as an argument to vcr.forest.train.

Usage

vcr.forest.train(X, y, trainfit, type = list(),
                 k = 5, stand = TRUE)

Arguments

`X`	A rectangular matrix or data frame, where the columns (variables) may be of mixed type.
`y`	factor with the given class labels. It is crucial that `X` and `y` are exactly the same as in the call to `randomForest::randomForest`. `y` is allowed to contain `NA`'s.
`trainfit`	the output of a `randomForest::randomForest` training run.
`k`	the number of nearest neighbors used in the farness computation.
`type`	list for specifying some (or all) of the types of the variables (columns) in `X`, used for computing the dissimilarity matrix, as in `cluster::daisy`. The list may contain the following components: `"ordratio"` (ratio scaled variables to be treated as ordinal variables), `"logratio"` (ratio scaled variables that must be logarithmically transformed), `"asymm"` (asymmetric binary) and `"symm"` (symmetric binary variables). Each component's value is a vector, containing the names or the numbers of the corresponding columns of `X`. Variables not mentioned in the `type` list are interpreted as usual (see argument `X`).
`stand`	whether or not to standardize numerical (interval scaled) variables by their range as in the original `cluster::daisy` code for the farness computation. Defaults to `TRUE`.

Value

A list with components:

`X`	The data used to train the forest.
`yint`	number of the given class of each case. Can contain `NA`'s.
`y`	given class label of each case. Can contain `NA`'s.
`levels`	levels of `y`
`predint`	predicted class number of each case. For each case this is the class with the highest posterior probability. Always exists.
`pred`	predicted label of each case.
`altint`	number of the alternative class. Among the classes different from the given class, it is the one with the highest posterior probability. Is `NA` for cases whose `y` is missing.
`altlab`	label of the alternative class. Is `NA` for cases whose `y` is missing.
`PAC`	probability of the alternative class. Is `NA` for cases whose `y` is missing.
`figparams`	parameters for computing `fig`, can be used for new data.
`fig`	distance of each case `i` from each class `g`. Always exists.
`farness`	farness of each case from its given class. Is `NA` for cases whose `y` is missing.
`ofarness`	for each case `i`, its lowest `fig[i,g]` to any class `g`. Always exists.
`trainfit`	The trained random forest which was given as an input to this function.

Author(s)

Raymaekers J., Rousseeuw P.J.

References

Raymaekers J., Rousseeuw P.J.(2021). Silhouettes and quasi residual plots for neural nets and tree-based classifiers. (link to open access pdf)

Examples

library(randomForest)
data("data_instagram")
traindata <- data_instagram[which(data_instagram$dataType == "train"), -13]
set.seed(71) # randomForest is not deterministic
rfout <- randomForest(y~., data = traindata, keep.forest = TRUE)
mytype <- list(symm = c(1, 5, 7, 8)) # These 4 columns are
# (symmetric) binary variables. The variables that are not
# listed are interval-scaled by default.
x_train <- traindata[, -12]
y_train <- traindata[, 12]
# Prepare for visualization:
vcrtrain <- vcr.forest.train(X = x_train, y = y_train,
                            trainfit = rfout, type = mytype)
confmat.vcr(vcrtrain)
stackedplot(vcrtrain, classCols = c(4, 2))
silplot(vcrtrain, classCols = c(4, 2))
classmap(vcrtrain, "genuine", classCols = c(4, 2))
classmap(vcrtrain, "fake", classCols = c(4, 2))

# For more examples, we refer to the vignette:
## Not run: 
vignette("Random_forest_examples")

## End(Not run)

[Package classmap version 1.2.3 Index]