vcr.forest.train {classmap}R Documentation

Prepare for visualization of a random forest classification on training data

Description

Produces output for the purpose of constructing graphical displays such as the classmap and silplot. The user first needs to train a random forest on the data by randomForest::randomForest. This then serves as an argument to vcr.forest.train.

Usage

vcr.forest.train(X, y, trainfit, type = list(),
                 k = 5, stand = TRUE)

Arguments

X

A rectangular matrix or data frame, where the columns (variables) may be of mixed type.

y

factor with the given class labels. It is crucial that X and y are exactly the same as in the call to randomForest::randomForest. y is allowed to contain NA's.

trainfit

the output of a randomForest::randomForest training run.

k

the number of nearest neighbors used in the farness computation.

type

list for specifying some (or all) of the types of the variables (columns) in X, used for computing the dissimilarity matrix, as in cluster::daisy. The list may contain the following components: "ordratio" (ratio scaled variables to be treated as ordinal variables), "logratio" (ratio scaled variables that must be logarithmically transformed), "asymm" (asymmetric binary) and "symm" (symmetric binary variables). Each component's value is a vector, containing the names or the numbers of the corresponding columns of X. Variables not mentioned in the type list are interpreted as usual (see argument X).

stand

whether or not to standardize numerical (interval scaled) variables by their range as in the original cluster::daisy code for the farness computation. Defaults to TRUE.

Value

A list with components:

X

The data used to train the forest.

yint

number of the given class of each case. Can contain NA's.

y

given class label of each case. Can contain NA's.

levels

levels of y

predint

predicted class number of each case. For each case this is the class with the highest posterior probability. Always exists.

pred

predicted label of each case.

altint

number of the alternative class. Among the classes different from the given class, it is the one with the highest posterior probability. Is NA for cases whose y is missing.

altlab

label of the alternative class. Is NA for cases whose y is missing.

PAC

probability of the alternative class. Is NA for cases whose y is missing.

figparams

parameters for computing fig, can be used for new data.

fig

distance of each case i from each class g. Always exists.

farness

farness of each case from its given class. Is NA for cases whose y is missing.

ofarness

for each case i, its lowest fig[i,g] to any class g. Always exists.

trainfit

The trained random forest which was given as an input to this function.

Author(s)

Raymaekers J., Rousseeuw P.J.

References

Raymaekers J., Rousseeuw P.J.(2021). Silhouettes and quasi residual plots for neural nets and tree-based classifiers. (link to open access pdf)

See Also

vcr.forest.newdata, classmap, silplot, stackedplot

Examples

library(randomForest)
data("data_instagram")
traindata <- data_instagram[which(data_instagram$dataType == "train"), -13]
set.seed(71) # randomForest is not deterministic
rfout <- randomForest(y~., data = traindata, keep.forest = TRUE)
mytype <- list(symm = c(1, 5, 7, 8)) # These 4 columns are
# (symmetric) binary variables. The variables that are not
# listed are interval-scaled by default.
x_train <- traindata[, -12]
y_train <- traindata[, 12]
# Prepare for visualization:
vcrtrain <- vcr.forest.train(X = x_train, y = y_train,
                            trainfit = rfout, type = mytype)
confmat.vcr(vcrtrain)
stackedplot(vcrtrain, classCols = c(4, 2))
silplot(vcrtrain, classCols = c(4, 2))
classmap(vcrtrain, "genuine", classCols = c(4, 2))
classmap(vcrtrain, "fake", classCols = c(4, 2))

# For more examples, we refer to the vignette:
## Not run: 
vignette("Random_forest_examples")

## End(Not run)

[Package classmap version 1.2.3 Index]