ordfor {ordinalForest}R Documentation

Ordinal forests

Description

Constructs prediction rules using the ordinal forest (OF) method presented in Hornung (2020).
The following tasks can be performed using OF: 1) Predicting the values of an ordinal target variable for new observations based on covariate values (see predict.ordfor); 2) Ranking the importances of the covariates with respect to predicting the values of the ordinal target variable.
The default values for the hyperparameters nsets, ntreeperdiv, ntreefinal, npermtrial, and nbest were found to be in a reasonable range in Hornung (2020) and it should not be necessary to alter these values in most situations.
For details on OFs see the 'Details' section below.
NOTE: Starting with package version 2.4, it is also possible to obtain class probability predictions in addition to the class point predictions and variable importance values based on the class probabilities through using the (negative) ranked probability score (Epstein, 1969) as performance function (perffunction="probability"). Using the ranked probability score in the variable importance can be expected to deliver more stable variable rankings, because the ranked probability score accounts for the ordinal scale of the dependent variable. In situations in which there is no need for predicting class probabilities, but simply class predictions are sufficient, other performance functions may be more suitable. See the subsection "Performance functions" in the "Details" section below for further details.

Usage

ordfor(
  depvar,
  data,
  nsets = 1000,
  ntreeperdiv = 100,
  ntreefinal = 5000,
  importance = c("rps", "accuracy"),
  perffunction = c("equal", "probability", "proportional", "oneclass", "custom"),
  classimp,
  classweights,
  nbest = 10,
  naive = FALSE,
  num.threads = NULL,
  npermtrial = 500,
  permperdefault = FALSE,
  mtry = NULL,
  min.node.size = NULL,
  replace = TRUE,
  sample.fraction = ifelse(replace, 1, 0.632),
  always.split.variables = NULL,
  keep.inbag = FALSE
)

Arguments

depvar

character. Name of the dependent variable in data.

data

data.frame. Data frame containing the covariates and a factor-valued ordinal target variable. The order of the levels of the latter has to correspond to the order of the ordinal classes of the target variable.

nsets

integer. Number of score sets tried prior to the approximation of the optimal score set.

ntreeperdiv

integer. Number of trees in the smaller regression forests constructed for each of the nsets different score sets tried.

ntreefinal

integer. Number of trees in the larger regression forest constructed using the optimized score set (i.e., the OF).

importance

character. The type of variable importance measure to use. The default "rps" uses the ranked probability score as an error measure. If set to "accuracy", the importance measure is based on the accuracy. The latter choice corresponds to the default importance measure of random forests and does not take the ordinal scale of the target variable into account. NOTE: If the ranked probability score is used as performance function (perffunction="probability"), importance is set to "rps" automatically. Preliminary results indicate that the option "rps" might lead to a better discrimination between influential and non-influential covariates.

perffunction

character. Performance function. The default is "equal". See 'Details', subsection 'Performance functions' below and perff.

classimp

character. Class to prioritize if perffunction="oneclass".

classweights

numeric. Needed if perffunction="custom": vector of length equal to the number of classes. Class weights - the higher the weight w_j assigned to class j is chosen, the higher the accuracy of the OF with respect to discerning observations in class j from observations not in class j will tend to be.

nbest

integer. Number of best score sets used to calculate the optimized score set.

naive

boolean. If set to TRUE, a naive ordinal forest is constructed, that is, the score set used for the classes of the target variable is not optimized, but instead the following (naive) scores are used: 1,2,3,... Note that it is strongly recommended to set naive=FALSE (default). The only advantage of choosing naive=TRUE is that the computational burden is reduced. However, the precision of the predictions of a prediction rule obtained using naive ordinal forest can be considerably worse than that of a corresponding prediction rule obtained using ordinal forest.

num.threads

integer. Number of threads. Default is number of CPUs available (passed to the modified ranger code).

npermtrial

integer. Number of permutations of the class width ordering to try for the second to the nsetsth score set tried prior to the calculation of the optimized score set.

permperdefault

boolean. If set to TRUE, npermtrial different permutations will per default be tried for the 2th to the nsetsth score set used during the optimization - also for J! < nsets. Default is FALSE.

mtry

integer. Number of variables to sample as candidate variables for each split. Default is the (rounded down) square root of the number of variables.

min.node.size

integer. Minimal node size. Default is 5, except if perffunction="probability", in which case the default is 10.

replace

boolean. Sample with replacement. Default is TRUE.

sample.fraction

numeric. Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement.

always.split.variables

character. Character vector with variable names to be always selected in addition to the mtry variables tried for splitting.

keep.inbag

boolean. Save how often observations are in-bag in each tree. Default is FALSE.

Details

Introduction

The ordinal forest (OF) method allows ordinal regression with high-dimensional and low-dimensional data. After having constructed an OF prediction rule using a training dataset, it can be used to predict the values of the ordinal target variable for new observations. Moreover, by means of the (permutation-based) variable importance measure of OF, it is also possible to rank the covariates with respect to their importance in the prediction of the values of the ordinal target variable.
OF is presented in Hornung (2020). See the latter publication for details on the method. In the following, a brief, practice-orientated introduction to OF is provided.

Methods

The concept of OF is based on the following assumption: There exists a (possibly latent) refined continuous variable y* underlying the observed ordinal target variable y (y in {1,...,J}, J number of classes), where y* determines the values of y. The functional relationship between y* and y takes the form of a monotonically increasing step function. Depending on which of J intervals ]c_1,⁠ ⁠c_2], ⁠ ⁠]c_2,⁠ ⁠c_3], ⁠ ⁠ ..., ⁠ ⁠ ]c_J,⁠ ⁠c_{J+1}[ contains the value of y*, the ordinal target variable y takes a different value.

In situations in which the values of the continuous target variable y* are known, they can be used in regression techniques for continuous response variables. The OF method is, however, concerned with settings in which only the values of the classes of the ordinal target variable are given. The main idea of OF is to optimize score values s_1,...,s_J to be used in place of the class values 1,...,J of the ordinal target variable in standard regression forests by maximizing the out-of-bag (OOB) prediction performance measured by a performance function g (see section "Performance functions").

The approximation of the optimal score set consists of two steps:
1) Construct a large number of regression forests (b in 1,...,nsets) featuring limited numbers of trees, where each of these uses as the values of the target variable a randomly generated score set s_{b,1},...,s_{b,J}. For each forest constructed, calculate the value of the performance function g using the OOB estimated predictions of the values of the ordinal target variable and the corresponding true values.
2) Calculate the approximated optimal score set s_1,...,s_J as a summary over the nbest best score sets generated in 1), that is, those nbest score sets that were associated with the highest values of the performance function g.

After calculating the optimized score set, a larger regression forest is constructed using this optimized score set s_1,...,s_J for the class values 1,...,J of the target variable. This regression forest is the OF prediction rule.

Except in the case of using the (negative) ranked probabilty score as performance function, prediction is performed by majority voting of the predictions of the individual trees in the OF. If the (negative) ranked probabilty score is used as performance function, both class predictions and predicted class probabilities are provided: The class probabilities are obtained by averaging over the class probabilities predicted by the individual trees and the class predictions are obtained as the classes with maximum class probabilites.

OF features a permutation variable importance measure that, if importance is set to "rps" (default), uses the ranked probability score as error measure and the misclassification error else (importance="accuracy").

Hyperparameters

There are several hyperparameters, which do, however, not have to be optimized by the user in general, because the default values used for these hyperparameters were seen to be in a reasonable range and the results seem to be quite robust with respect to the choices of the hyperparameter values.

These hyperparameters are described in the following:

Performance functions

As noted above, the different score sets tried during the estimation of the optimal score set are assessed with respect to their OOB prediction performance. The choice of the specific performance function used in these assessments determines the specific kind of performance the ordinal forest should feature:

Value

ordfor returns an object of class ordfor. An object of class "ordfor" is a list containing the following components:

forestfinal

object of class "ranger". Regression forest constructed using the optimized score set (i.e., the OF). Required by predict.ordfor.

bordersbest

vector of length J+1. Average over the nbest best partitions of [0,1]. Required by predict.ordfor.

forests

list of length nsets. The regression forests constructed for the nsets different score sets tried prior to the approximation of the optimal score set.

perffunctionvalues

vector of length nsets. Performance function values for all score sets tried prior to the approximation of the optimal score set.

bordersb

matrix of dimension nsets x (J+1). All nsets partitions of [0,1] considered.

classes

character vector of length J. Classes of the target variable.

nsets

integer. Number of score sets tried prior to the approximation of the optimal score set.

ntreeperdiv

integer. Number of trees per score set considered.

ntreefinal

integer. Number of trees of the OF prediction rule.

perffunction

character. Performance function used.

classimp

character. If perffunction="oneclass": class to priorize, NA else.

nbest

integer. Number of best score sets used to approximate the optimal score set.

classfreq

table. Class frequencies.

varimp

vector of length p. Permutation variable importance for each covariate. If perffunction="probability", the ranked probability score is used as error measure in the variable importance. For all other choices of the performance function, the misclassification error is used.

References

Examples

## Not run: 
data(hearth)

set.seed(123)
hearthsubset <- hearth[sort(sample(1:nrow(hearth), size=floor(nrow(hearth)*(1/2)))),]
ordforres <- ordfor(depvar="Class", data=hearthsubset, nsets=50, nbest=5, ntreeperdiv=100, 
  ntreefinal=1000)
# NOTE: nsets=50 is not enough, because the prediction performance of the resulting 
# ordinal forest will be suboptimal!! In practice, nsets=1000 (default value) or a 
# larger number should be used.

ordforres

sort(ordforres$varimp, decreasing=TRUE)

## End(Not run)


[Package ordinalForest version 2.4-3 Index]