R: Interval Censored Recursive Forests (ICRF)

icrf {icrf}

R Documentation

Interval Censored Recursive Forests (ICRF)

Description

icrf implements the ICRF algorithm to estimate the conditional survival probability for interval censored survival data. (It can also be used for right-censored survival data and current status data.) icrf recursively builds random forests using the extremely randomized trees (ERT) algorithm and uses kernel smoothing in the time domain. This icrf package is built based on the randomForest package by Andy Liaw and Matthew Wiener. (Quoted statements are from randomForest by Liaw and Wiener unless otherwise mentioned.)

Usage

icrf(x, ...)

## Default S3 method:
icrf(
  x,
  L,
  R,
  tau = max(R[is.finite(R)]) * 1.5,
  bandwidth = NULL,
  quasihonesty = TRUE,
  initialSmoothing = TRUE,
  timeSmooth = NULL,
  xtest = NULL,
  ytest = NULL,
  nfold = 5L,
  ntree = 500L,
  mtry = ceiling(sqrt(p)),
  split.rule = c("Wilcoxon", "logrank", "PetoWilcoxon", "PetoLogrank", "GWRS", "GLR",
    "SWRS", "SLR"),
  ERT = FALSE,
  uniformERT = ERT,
  returnBest = sampsize < n,
  imse.monitor = 1,
  replace = !ERT,
  sampsize = ifelse(ERT, 0.95, 0.632) * n,
  nodesize = 6L,
  maxnodes = NULL,
  importance = FALSE,
  nPerm = 1,
  proximity,
  oob.prox = ifelse(sampsize == n & !replace, FALSE, proximity),
  do.trace = FALSE,
  keep.forest = is.null(xtest),
  keep.inbag = FALSE,
  ...
)

## S3 method for class 'formula'
icrf(
  formula,
  data = NULL,
  data.type = c("interval", "right", "currentstatus"),
  interval.label = c("L", "R"),
  right.label = c("T", "status"),
  currentstatus.label = c("monitor", "status"),
  ...,
  na.action = na.fail,
  epsilon = NULL
)

## S3 method for class 'icrf'
print(x, ...)

Arguments

`x`	a data frame or a matrix of predictors. `x` is not needed when `formula` is specified.
`...`	optional arguments to be passed to icrf.default.
`L`, `R`	the left and right end point of the interval. `R` should be greater than or equal to `L`. In case of equality, a small number `epsilon` (the smaller of minimum nonzero interval length and 1e-10) is added.
`tau`	the study end time. ([0, `tau`] is the window for the analysis.)
`bandwidth`	a positive number. The bandwidth of the kernel smoothing. For faster computing, set `bandwidth = 0` for no smoothing.
`quasihonesty`	if `TRUE`, the terminal node prediction is given by the NPMLE of the interval data. If `FALSE`, the terminal node prediction is given by the average of the conditional probabilities (exploitative).
`initialSmoothing`	if `TRUE`, the initial survival curve used for interval-conditional survival probability estimate is smoothed using the Gaussian kernel.
`timeSmooth`	a numeric vector of time points at which the smoothed survival curves are estimated. It should be in an increasing order. If `null`, a set of distinct interval end points is used.
`xtest`	a dataset or matrix of predictors for the test dataset.
`ytest`	a true survival curve for the test set in a form of the dataframe or matrix. The number of rows is the same as `xtest` and each column corresponds to the time points of `timeSmooth`.
`nfold`	Number of forests to iterate. In practice, numbers between 5 and 10 is reasonable.
`ntree`	Number of trees to build within each forest. 'This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.'
`mtry`	Number of candidate predictors tried at each split. The default value is sqrt(p) where p is number of variables in `x`.
`split.rule`	Splitting rules. See details. The default is `"Wilcoxon", or equivalently "GWRS"`.
`ERT`	If `ERT=TRUE` ERT algorithm applies. If `FALSE`, a comprehensive greedy algorithm (Breiman's random forest algorithm) applies.
`uniformERT`	Only relevant when `ERT=TRUE`. If `uniformERT=TRUE`, random candidate cutpoints are selected using uniform distribution. If `FALSE`, random candidate cutpoints are chosen among the midpoints of two neighboring predictor values.
`returnBest`	If `returnBest=TRUE`, the survival curve estimate at the best iteration is returned. If `FALSE`, the estimate at the last iteration is returned. The best iteration is determined by the type of IMSE measures specified in `imse.monitor`. By default, `returnBest=TRUE` when the out-of-bag sample is available (sampsize < n).
`imse.monitor`	Which type of IMSE is used to monitor which fold is the best?
`replace`	Whether the cases are sampled with or without replacement?
`sampsize`	Size of random sampling.
`nodesize`	Each terminal node cannot be smaller than this value. 'Setting this number larger causes smaller trees to be grown (and thus take less time).'
`maxnodes`	Up to how many terminal nodes can a tree have? 'If not given, trees are grown to the maximum possible (subject to limits by nodesize). If set larger than maximum possible, a warning is issued.'
`importance`	If `TRUE`, variable importance measure will be computed.
`nPerm`	How many permutations (of OOB data) to do for variable importance assessment? 'Number larger than 1 gives slightly more stable estimate, but not very effective. Currently only implemented for regression.'
`proximity`	If `TRUE`, proximity measure among the cases is calculated.
`oob.prox`	If `TRUE`, proximity is calculated only on "out-of-bag" data.
`do.trace`	If `TRUE`, intermediate outputs are printed during the tree building procedure. 'If set to some integer, then running output is printed for every do.trace trees.'
`keep.forest`	'If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.'
`keep.inbag`	'Should an n by ntree matrix be returned that keeps track of which samples are "in-bag" in which trees (but not how many times, if sampling with replacement)'
`formula`, `data.type`, `interval.label`, `right.label`, `currentstatus.label`	a formula object, with the response in a Surv 'interval2' or `cbind`. Alternatively, the survival outcome may be omitted in the formula and the labels relevent to the survival outcome can be entered in either `interval.label`, `right.label`, or `currentstatus.label` with the `data.type` being specified.
`data`	a data frame that includes the intervals and the predictor values.
`na.action`	'a function to specify the action to be taken if NAs are found. (NOTE: If given, this argument must be named.)'
`epsilon`	A small positive value needed to discriminate the left and right interval end points for the uncensored data.

Details

Four split.rule options are available: Wilcoxon, logrank, PetoWilcoxon, PetoLogrank. The aliases are GWRS, GLR, SWRS, and SLR, respectively. The first two are generalized Wilcoxon-rank-sum test and generalized log-rank test proposed in Cho et al (2020+), and the latter two are score-based Wilcoxon-rank-sum test and score-based log-rank test proposed by Peto and Peto (1972) "Asymptotically efficient rank invariant test procedures."

Value

An icrf class object which contains the following components in a list:

callthe original call to icrf
methodThe input values of split.rule, ERT,

quasihonest, bandwith, and the subsample ratio (= sampsize / n)
predictedthe estimated survival curves of the training set using out-of-bag samples.
predictedNOthe estimated survival curves of the training set using non-out-of-bag samples.
predictedNO.Smthe smoothed survival curves of the training set using non-out-of-bag samples.
time.pointstime points at which the survival curves are estimated.
time.points.smoothtime points at which the smoothed survival curves are estimated.
imse.oobIntegrated mean squared error (IMSE) measured based on the out-of-bag samples
imse.NOIntegrated mean squared error (IMSE) measured based on the non-out-of-bag samples
oob.timesnumber of times for which each case was 'out-of-bag'
importancean array of three matrices where each matrix has nfold columns and p (number of predictors) rows. The importance is measured based on increase in IMSE types 1 and 2, respectively, and the node impurity.
importanceSD'The "standard errors" of the permutation-based importance measure.' A p by nfold by 2 array corresponding to the first two matrices of the importance array.
nfoldnumber of forests iterated over.
ntreenumber of trees built.
mtrynumber of candidate predictors tried at each node.
forest'a list that contains the entire forest;' NULL 'if keep.forest=FALSE.'
intervalsn by 2 matrix of the intervals.
proximityif proximity=TRUE if proximity=TRUE when icrf is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes).

inbagif keep.inbag=TRUE provides a matrix of in-bag indicators for the last forest iteration.

runtimestart and end times and the elapsed time.

testif test set is given (through the xtest or additionally ytest arguments), this component is a list which contains the corresponding predicted and error measures (IMSE's). If proximity=TRUE, there is also a component, proximity, which contains the proximity among the test set as well as proximity between test and training data.

Author(s)

Hunyong Cho, Nicholas P. Jewell, and Michael R. Kosorok.

References

Cho H., Jewell N. J., and Kosorok M. R. (2020+). "Interval censored recursive forests"

Examples

# rats data example.
# The type of this dataset is current status data.
# Note that this is a toy example. Use a larger ntree and nfold in practice.
data(rat2)


 set.seed(2)
# 1. formula (currentstatus)
rats.icrf <-
  icrf(~ dose.lvl + weight + male + cage.no, data = rat2,
       data.type = "currentstatus", currentstatus.label = c("survtime", "tumor"),
       returnBest = TRUE, ntree=10, nfold=3)

# 2. formula containing the interval
# Alternatively, create the interval endpoints and use the Surv object.
L = ifelse(rat2$tumor, 0, rat2$survtime)
R = ifelse(rat2$tumor, rat2$survtime, Inf)
library(survival) # for Surv function
icrf(Surv(L, R, type = "interval2") ~ dose.lvl + weight + male + cage.no, data = rat2,
     ntree=10, nfold=3)

# Or, 3. formula (interval)
rat2b <- cbind(rat2, L = L, R = R)
set.seed(1)
icrf( ~ dose.lvl + weight + male + cage.no, data = rat2b,
     data.type = "interval", interval.label = c("L", "R"),
     ntree=10, nfold=3)

# 4. default method
set.seed(1)
icrf(rat2[, c("dose.lvl", "weight", "male", "cage.no")], L = L, R = R,
     ntree=10, nfold=3)

[Package icrf version 2.0.2 Index]