optimSplit_dichotom {Qindex}R Documentation

Optimal Dichotomizing Predictors via Repeated Sample Splits

Description

Functions explained in this documentation are,

optimSplit_dichotom()

to identify the optimal dichotomizing predictors using repeated sample splits.

split_dichotom()

a helper function to perform a univariable regression model on the test set with a dichotomized predictor, using a dichotomizing rule determined by a recursive partitioning of the training set.

quantile_split_dichotom()

a helper function to locate a quantile of multiple split_dichotom objects, based on the estimated univariable regression coefficient.

Usage

optimSplit_dichotom(formula, data, include, top = 1L, nsplit, ...)

split_dichotom(y, x, index, ...)

quantile_split_dichotom(y, x, indices = rSplit(y, ...), probs = 0.5, ...)

Arguments

formula

formula. Left-hand-side is the name of a Surv, logical, or double response y. Right-hand-side is the candidate numeric predictors in data, given either as the name of a numeric matrix column (e.g., y ~ X), or as the names of several numeric vector columns (e.g., y ~ x1 + x2 + x3)

data

data.frame, containing the response and predictors in formula

include

language object, inclusion criteria for the optimal dichotomizing predictors. A suggested choice is (highX>.15 & highX<.85) to guarantee a user-desired range of proportions in highX. See explanation of highX in helper function split_dichotom().

top

positive integer scalar, number of optimal dichotomizing predictors, default 1L

nsplit, ...

additional parameters for function rSplit()

y

(for helper functions) a Surv object, a logical vector, or a double vector, the response y

x

(for helper functions) numeric vector, a single predictor x

index

(for helper function split_dichotom()) logical vector, indices of training and test set. TRUE elements indicate training subjects and FALSE elements indicate test subjects.

indices

(optional, for helper function quantile_split_dichotom()) a list of logical vectors, the indices of multiple training-test sample splits. Default value is provided by function rSplit().

probs

(for helper function quantile_split_dichotom()) double scalar, see quantile

Details

Function optimSplit_dichotom() selects the optimal dichotomizing predictors via repeated sample splits. Specifically,

  1. Generate multiple training-test sample splits using function rSplit()

  2. For each candidate predictor, find the median split_dichotom (using helper function quantile_split_dichotom()) of the multiple sample splits from Step 1.

  3. (Optional) limit the selection in a subset of the candidate predictors. Typically, we would prefer to guarantee a user-desired range of highX (see explanations on highX in section Returns of Helper Functions). A suggested choice is (highX>.15 & highX<.85).

  4. Rank the candidate predictors, from either Step 2 or Step 3, by the decreasing order of the absolute values of the estimated univariable regression coefficients of the corresponding split_dichotom objects.

The optimal dichotomizing predictors are the ones with the largest absolute values of the estimated univariable regression coefficients of the corresponding split_dichotom objects.

Value

Function optimSplit_dichotom() returns a data.frame, which contains the response, and only the optimal dichotomizing predictors out of all candidate predictors. Other variables in data, which are not specified in formula, are retained. In addition, the dichotomized values of the optimal dichotomizing predictors, according to their respective dichotomizing rules, are also included. The returned value has attributes,

attr(,'id_top')

positive integer scalar or vector, the indices of the optimal dichotomizing predictors out of all candidate predictors.

attr(,'top')

a diagnostic data.frame of the median split_dichotoms of each of the optimal dichotomizing predictors, with columns

⁠$cutoff⁠

the cutoff threshold, identified in the training set

⁠$highX⁠

proportion of the dichotomizing predictors greater-than or greater-than-or-equal-to the cutoff threshold, in the test set

⁠$coef⁠

the estimated univariable regression coefficient of the dichotomized predictor, in the test set

Details on Helper Functions

Univariable regression model with a dichotomized predictor

Helper function split_dichotom() performs a univariable regression model on the test set with a dichotomized predictor, using a dichotomizing rule determined by a recursive partitioning of the training set. Currently the Cox proportional hazards (coxph) regression for Surv response, logistic (glm) regression for logical response and linear (lm) regression for gaussian response are supported. Specifically, given a training-test sample split,

  1. find the dichotomizing rule of the response y given the predictor x, using function rpartD(), in the training set

  2. dichotomize the predictor x using the rule identified in Step 1, in the test set.

  3. run a univariable regression model on the response y on the dichotomized predictor from Step 2, in the test set.

Quantile of split_dichotom objects

Helper function quantile_split_dichotom() finds the quantile of the univariable regression coefficient (i.e., effect size) of a dichotomized predictor, based on multiple given training-test sample splits. Specifically,

  1. for each training-test sample split, fit the univariable regression model based on the dichotomized predictor, using helper function split_dichotom()

  2. finds the nearest-even (type = 3) quantile of the estimated univariable regression coefficients obtained in Step 1, based on the user-specified probability prob

The split_dichotom object from Step 1, whose estimated univariable regression coefficient equals to the specified quantile identified in Step 2, is referred to as the quantile of split_dichotom objects based on the multiple given training-test sample splits.

Returns of Helper Functions

Helper function split_dichotom(), as well as helper function quantile_split_dichotom(), returns a Cox proportional hazards (coxph), or a logistic (glm), or a linear (lm) regression model, with additional attributes

attr(,'rule')

function, the dichotomizing rule based on the training set

attr(,'cutoff')

numeric scalar, the cutoff threshold based on the training set

attr(,'highX')

double scalar, proportion of numeric predictor x, in the test set, which is greater-than or greater-than-or-equal-to the cutoff threshold attr(, 'cutoff')

attr(,'coef')

double scalar, the estimated univariable regression coefficient of the dichotomized predictor in the test set

Examples

library(survival)
data(pbc, package = 'survival') # see more details from ?survival::pbc
head(pbc2 <- within.data.frame(subset(pbc, status != 1L), expr = {
  death = (status == 2L)
  trt = structure(trt, levels = c('D-penicillmain', 'placebo'), class = 'factor')
  trt = relevel(trt, ref = 'placebo')
}))

# set.seed if needed
m1 = optimSplit_dichotom(
  Surv(time, death) ~ bili + chol + albumin + copper + alk.phos + ast + trig + platelet + protime, 
  data = pbc2, nsplit = 20L, include = (highX > .15 & highX < .85), top = 2L) 
head(m1, n = 10L)
attr(m1, 'top')


[Package Qindex version 0.1.5 Index]