R: Optimal Dichotomizing Predictors via Repeated Sample Splits

optimSplit_dichotom {Qindex}

R Documentation

Optimal Dichotomizing Predictors via Repeated Sample Splits

Description

Functions explained in this documentation are,

optimSplit_dichotom(): to identify the optimal dichotomizing predictors using repeated sample splits.
split_dichotom(): a helper function to perform a univariable regression model on the test set with a dichotomized predictor, using a dichotomizing rule determined by a recursive partitioning of the training set.
quantile_split_dichotom(): a helper function to locate a quantile of multiple split_dichotom objects, based on the estimated univariable regression coefficient.

Usage

optimSplit_dichotom(formula, data, include, top = 1L, nsplit, ...)

split_dichotom(y, x, index, ...)

quantile_split_dichotom(y, x, indices = rSplit(y, ...), probs = 0.5, ...)

Arguments

`formula`	formula. Left-hand-side is the name of a Surv, logical, or double response `y`. Right-hand-side is the candidate numeric predictors in `data`, given either as the name of a numeric matrix column (e.g., `y ~ X`), or as the names of several numeric vector columns (e.g., `y ~ x1 + x2 + x3`)
`data`	data.frame, containing the response and predictors in `formula`
`include`	language object, inclusion criteria for the optimal dichotomizing predictors. A suggested choice is `(highX>.15 & highX<.85)` to guarantee a user-desired range of proportions in `highX`. See explanation of `highX` in helper function `split_dichotom()`.
`top`	positive integer scalar, number of optimal dichotomizing predictors, default `1L`
`nsplit`, `...`	additional parameters for function `rSplit()`
`y`	(for helper functions) a Surv object, a logical vector, or a double vector, the response `y`
`x`	(for helper functions) numeric vector, a single predictor `x`
`index`	(for helper function `split_dichotom()`) logical vector, indices of training and test set. `TRUE` elements indicate training subjects and `FALSE` elements indicate test subjects.
`indices`	(optional, for helper function `quantile_split_dichotom()`) a list of logical vectors, the indices of multiple training-test sample splits. Default value is provided by function `rSplit()`.
`probs`	(for helper function `quantile_split_dichotom()`) double scalar, see quantile

Details

Function optimSplit_dichotom() selects the optimal dichotomizing predictors via repeated sample splits. Specifically,

Generate multiple training-test sample splits using function rSplit()
For each candidate predictor, find the median split_dichotom (using helper function quantile_split_dichotom()) of the multiple sample splits from Step 1.
(Optional) limit the selection in a subset of the candidate predictors. Typically, we would prefer to guarantee a user-desired range of highX (see explanations on highX in section Returns of Helper Functions). A suggested choice is (highX>.15 & highX<.85).
Rank the candidate predictors, from either Step 2 or Step 3, by the decreasing order of the absolute values of the estimated univariable regression coefficients of the corresponding split_dichotom objects.

The optimal dichotomizing predictors are the ones with the largest absolute values of the estimated univariable regression coefficients of the corresponding split_dichotom objects.

Value

Function optimSplit_dichotom() returns a data.frame, which contains the response, and only the optimal dichotomizing predictors out of all candidate predictors. Other variables in data, which are not specified in formula, are retained. In addition, the dichotomized values of the optimal dichotomizing predictors, according to their respective dichotomizing rules, are also included. The returned value has attributes,

attr(,'id_top')

positive integer scalar or vector, the indices of the optimal dichotomizing predictors out of all candidate predictors.

attr(,'top')

a diagnostic data.frame of the median split_dichotoms of each of the optimal dichotomizing predictors, with columns

⁠$cutoff⁠: the cutoff threshold, identified in the training set
⁠$highX⁠: proportion of the dichotomizing predictors greater-than or greater-than-or-equal-to the cutoff threshold, in the test set
⁠$coef⁠: the estimated univariable regression coefficient of the dichotomized predictor, in the test set

Details on Helper Functions

Univariable regression model with a dichotomized predictor

Helper function split_dichotom() performs a univariable regression model on the test set with a dichotomized predictor, using a dichotomizing rule determined by a recursive partitioning of the training set. Currently the Cox proportional hazards (coxph) regression for Surv response, logistic (glm) regression for logical response and linear (lm) regression for gaussian response are supported. Specifically, given a training-test sample split,

find the dichotomizing rule of the response y given the predictor x, using function rpartD(), in the training set
dichotomize the predictor x using the rule identified in Step 1, in the test set.
run a univariable regression model on the response y on the dichotomized predictor from Step 2, in the test set.

Quantile of split_dichotom objects

Helper function quantile_split_dichotom() finds the quantile of the univariable regression coefficient (i.e., effect size) of a dichotomized predictor, based on multiple given training-test sample splits. Specifically,

for each training-test sample split, fit the univariable regression model based on the dichotomized predictor, using helper function split_dichotom()
finds the nearest-even (type = 3) quantile of the estimated univariable regression coefficients obtained in Step 1, based on the user-specified probability prob

The split_dichotom object from Step 1, whose estimated univariable regression coefficient equals to the specified quantile identified in Step 2, is referred to as the quantile of split_dichotom objects based on the multiple given training-test sample splits.

Returns of Helper Functions

Helper function split_dichotom(), as well as helper function quantile_split_dichotom(), returns a Cox proportional hazards (coxph), or a logistic (glm), or a linear (lm) regression model, with additional attributes

attr(,'rule'): function, the dichotomizing rule based on the training set
attr(,'cutoff'): numeric scalar, the cutoff threshold based on the training set
attr(,'highX'): double scalar, proportion of numeric predictor x, in the test set, which is greater-than or greater-than-or-equal-to the cutoff threshold attr(, 'cutoff')
attr(,'coef'): double scalar, the estimated univariable regression coefficient of the dichotomized predictor in the test set

Examples

library(survival)
data(pbc, package = 'survival') # see more details from ?survival::pbc
head(pbc2 <- within.data.frame(subset(pbc, status != 1L), expr = {
  death = (status == 2L)
  trt = structure(trt, levels = c('D-penicillmain', 'placebo'), class = 'factor')
  trt = relevel(trt, ref = 'placebo')
}))

# set.seed if needed
m1 = optimSplit_dichotom(
  Surv(time, death) ~ bili + chol + albumin + copper + alk.phos + ast + trig + platelet + protime, 
  data = pbc2, nsplit = 20L, include = (highX > .15 & highX < .85), top = 2L) 
head(m1, n = 10L)
attr(m1, 'top')

[Package Qindex version 0.1.5 Index]