R: Dichotomize via Recursive Partitioning

rpartD {Qindex}

R Documentation

Dichotomize via Recursive Partitioning

Description

Dichotomize one or more predictors of a Surv, a logical, or a double response, using recursive partitioning and regression tree rpart.

Usage

rpartD(
  y,
  x,
  check_degeneracy = TRUE,
  cp = .Machine$double.eps,
  maxdepth = 2L,
  ...
)

m_rpartD(y, X, check_degeneracy = TRUE, ...)

Arguments

`y`	a Surv object, a logical vector, or a double vector, the response `y`
`x`	numeric vector, one predictor `x`
`check_degeneracy`	logical scalar, whether to allow the dichotomized value to be all-`FALSE` or all-`TRUE` (i.e., degenerate) for any one of the predictors. Default `TRUE` to produce a warning message for degeneracy.
`cp`	double scalar, complexity parameter, see rpart.control. Default `.Machine$double.eps`, so that a split is enforced no matter how small improvement in overall `R^2` is
`maxdepth`	positive integer scalar, maximum depth of any node, see rpart.control. Default `2L`, because only the first node is needed
`...`	additional parameters of rpart and/or rpart.control
`X`	numeric matrix, a set of predictors. Each column of `X` is one predictor.

Details

Dichotomize Single Predictor

Function rpartD() dichotomizes one predictor in the following steps,

Recursive partitioning and regression tree rpart analysis is performed for the response y and the predictor x.
The labels.rpart of the first node of the rpart tree is considered as the dichotomizing rule of the double predictor x. The term dichotomizing rule indicates the combination of an inequality sign (>, >=, < and <=) and a double cutoff threshold a
The dichotomizing rule from Step 2 is further processed, such that
- <a is regarded as \geq a
- \leq a is regarded as >a
- > a and \geq a are regarded as is.
This step is necessary for a narrative of greater than or greater than or equal to the threshold a.
A warning message is produced, if the dichotomizing rule, applied to a new double predictor newx, creates an all-TRUE or all-FALSE result. We do not make the algorithm stop, as most regression models in R are capable of handling an all-TRUE or all-FALSE predictor, by returning a NA_real_ regression coefficient estimate.

Dichotomize Multiple Predictors

Function m_rpartD() dichotomizes each predictor X[,i] based on the response y using function rpartD(). Applying the multiple dichotomizing rules to a new set of predictors newX,

A warning message is produced, if at least one of the dichotomized predictors is all-TRUE or all-FALSE.
We do not check if more than one of the dichotomized predictors are identical to each other. We take care of this situation in helper function coef_dichotom()

Value

Dichotomize Single Predictor

Function rpartD() returns a function, with a double vector parameter newx. The returned value of rpartD(y,x)(newx) is a logical vector with attributes

attr(,'cutoff'): double scalar, the cutoff value for newx

Dichotomize Multiple Predictors

Function m_rpartD() returns a function, with a double matrix parameter newX. The argument for newX must have the same number of columns and the same column names as the input matrix X. The returned value of m_rpartD(y,X)(newX) is a logical matrix with attributes

attr(,'cutoff'): named double vector, the cutoff values for each predictor in newX

Note

In future integer and factor predictors will be supported.

Examples

## Dichotomize Single Predictor
data(cu.summary, package = 'rpart') # see more details from ?rpart::cu.summary
with(cu.summary, rpartD(y = Price, x = Mileage, check_degeneracy = FALSE))
(foo = with(cu.summary, rpartD(y = Price, x = Mileage)))
foo(rnorm(10, mean = 24.5))

## Dichotomize Multiple Predictors
library(survival)
data(stagec, package = 'rpart') # see more details from ?rpart::stagec
nrow(stagec) # 146
(foo = with(stagec[1:100,], m_rpartD(y = Surv(pgtime, pgstat), X = cbind(age, g2, gleason))))
foo(as.matrix(stagec[-(1:100), c('age', 'g2', 'gleason')]))

[Package Qindex version 0.1.5 Index]