optimSplit_dichotom {Qindex} | R Documentation |
Optimal Dichotomizing Predictors via Repeated Sample Splits
Description
Functions explained in this documentation are,
optimSplit_dichotom()
-
to identify the optimal dichotomizing predictors using repeated sample splits.
split_dichotom()
-
a helper function to perform a univariable regression model on the test set with a dichotomized predictor, using a dichotomizing rule determined by a recursive partitioning of the training set.
quantile_split_dichotom()
-
a helper function to locate a quantile of multiple split_dichotom objects, based on the estimated univariable regression coefficient.
Usage
optimSplit_dichotom(formula, data, include, top = 1L, nsplit, ...)
split_dichotom(y, x, index, ...)
quantile_split_dichotom(y, x, indices = rSplit(y, ...), probs = 0.5, ...)
Arguments
formula |
formula.
Left-hand-side is the name of
a Surv, logical, or double response |
data |
data.frame, containing the response and predictors in |
include |
language object,
inclusion criteria for the optimal dichotomizing predictors.
A suggested choice is |
top |
positive integer scalar, number of optimal dichotomizing predictors, default |
nsplit , ... |
additional parameters for function |
y |
(for helper functions)
a Surv object, a logical vector,
or a double vector, the response |
x |
|
index |
(for helper function |
indices |
(optional, for helper function |
probs |
(for helper function |
Details
Function optimSplit_dichotom()
selects the optimal dichotomizing predictors via repeated sample splits.
Specifically,
Generate multiple training-test sample splits using function
rSplit()
For each candidate predictor, find the median split_dichotom (using helper function
quantile_split_dichotom()
) of the multiple sample splits from Step 1.(Optional) limit the selection in a subset of the candidate predictors. Typically, we would prefer to guarantee a user-desired range of
highX
(see explanations onhighX
in section Returns of Helper Functions). A suggested choice is(highX>.15 & highX<.85)
.Rank the candidate predictors, from either Step 2 or Step 3, by the decreasing order of the absolute values of the estimated univariable regression coefficients of the corresponding split_dichotom objects.
The optimal dichotomizing predictors are the ones with the largest absolute values of the estimated univariable regression coefficients of the corresponding split_dichotom objects.
Value
Function optimSplit_dichotom()
returns a data.frame,
which contains the response,
and only the optimal dichotomizing predictors out of all candidate predictors.
Other variables in data
, which are not specified in formula
, are retained.
In addition, the dichotomized values of the optimal dichotomizing predictors,
according to their respective dichotomizing rules, are also included.
The returned value has attributes,
attr(,'id_top')
-
positive integer scalar or vector, the indices of the optimal dichotomizing predictors out of all candidate predictors.
attr(,'top')
-
a diagnostic data.frame of the median split_dichotoms of each of the optimal dichotomizing predictors, with columns
$cutoff
the cutoff threshold, identified in the training set
$highX
-
proportion of the dichotomizing predictors greater-than or greater-than-or-equal-to the cutoff threshold, in the test set
$coef
-
the estimated univariable regression coefficient of the dichotomized predictor, in the test set
Details on Helper Functions
Univariable regression model with a dichotomized predictor
Helper function split_dichotom()
performs a univariable regression model on the test set
with a dichotomized predictor,
using a dichotomizing rule determined
by a recursive partitioning of the training set.
Currently the Cox proportional hazards (coxph) regression for Surv response,
logistic (glm) regression for logical response and
linear (lm) regression for gaussian response
are supported.
Specifically, given a training-test sample split,
find the dichotomizing rule of the response
y
given the predictorx
, using functionrpartD()
, in the training setdichotomize the predictor
x
using the rule identified in Step 1, in the test set.run a univariable regression model on the response
y
on the dichotomized predictor from Step 2, in the test set.
Quantile of split_dichotom objects
Helper function quantile_split_dichotom()
finds the quantile
of the univariable regression coefficient (i.e., effect size) of a dichotomized predictor,
based on multiple given training-test sample splits.
Specifically,
-
for each training-test sample split, fit the univariable regression model based on the dichotomized predictor, using helper function
split_dichotom()
-
finds the nearest-even (
type = 3
) quantile of the estimated univariable regression coefficients obtained in Step 1, based on the user-specified probabilityprob
The split_dichotom object from Step 1, whose estimated univariable regression coefficient equals to the specified quantile identified in Step 2, is referred to as the quantile of split_dichotom objects based on the multiple given training-test sample splits.
Returns of Helper Functions
Helper function split_dichotom()
, as well as helper function quantile_split_dichotom()
, returns
a Cox proportional hazards (coxph),
or a logistic (glm),
or a linear (lm)
regression model,
with additional attributes
attr(,'rule')
function, the dichotomizing rule based on the training set
attr(,'cutoff')
numeric scalar, the cutoff threshold based on the training set
attr(,'highX')
double scalar, proportion of numeric predictor
x
, in the test set, which is greater-than or greater-than-or-equal-to the cutoff thresholdattr(, 'cutoff')
attr(,'coef')
double scalar, the estimated univariable regression coefficient of the dichotomized predictor in the test set
Examples
library(survival)
data(pbc, package = 'survival') # see more details from ?survival::pbc
head(pbc2 <- within.data.frame(subset(pbc, status != 1L), expr = {
death = (status == 2L)
trt = structure(trt, levels = c('D-penicillmain', 'placebo'), class = 'factor')
trt = relevel(trt, ref = 'placebo')
}))
# set.seed if needed
m1 = optimSplit_dichotom(
Surv(time, death) ~ bili + chol + albumin + copper + alk.phos + ast + trig + platelet + protime,
data = pbc2, nsplit = 20L, include = (highX > .15 & highX < .85), top = 2L)
head(m1, n = 10L)
attr(m1, 'top')