PRISM {StratifiedMedicine} | R Documentation |
PRISM: Patient Response Identifier for Stratified Medicine
Description
PRISM algorithm. Given a data-set of (Y, A, X) (Outcome, treatment, covariates),
the PRISM
identifies potential subgroups along with point-estimate and variability
metrics; with and without resampling (bootstrap or cross-validation based). This four
step procedure (filter, ple, submod, param) is flexible and accepts user-inputs at each
step.
Usage
PRISM(
Y,
A = NULL,
X,
Xtest = NULL,
mu_train = NULL,
family = "gaussian",
filter = "glmnet",
ple = "ranger",
submod = NULL,
param = NULL,
meta = ifelse(family == "survival", "T-learner", "X-learner"),
pool = "no",
delta = ">0",
propensity = FALSE,
combine = "SS",
alpha_ovrl = 0.05,
alpha_s = 0.05,
filter.hyper = NULL,
ple.hyper = NULL,
submod.hyper = NULL,
resample = NULL,
stratify = ifelse(!is.null(A), "trt", "no"),
R = NULL,
resample_submod = NULL,
R_submod = NULL,
resample_pool = NULL,
R_pool = NULL,
calibrate = FALSE,
alpha.mat = NULL,
filter.resamp = NULL,
ple.resamp = NULL,
verbose = TRUE,
verbose.resamp = FALSE,
seed = 777,
efficient = TRUE
)
Arguments
Y |
The outcome variable. Must be numeric or survival (ex; Surv(time,cens) ) |
A |
Treatment variable. (Default supports binary treatment, either numeric or factor). "ple_train" accomodates >2 along with binary treatments. |
X |
Covariate space. |
Xtest |
Test set. Default is NULL (no test predictions). Variable types should match X. |
mu_train |
Patient-level estimates in training set (see |
family |
Outcome type. Options include "gaussion" (default), "binomial", and "survival". |
filter |
Filter model to determine variables that are likely associated with the
outcome and/or treatment. Outputs a potential reduce list of varia where X.star
has potentially less variables than X. Default is "glmnet" (elastic net). Other
options include "ranger" (random forest based variable importance with p-values).
See |
ple |
Base-learner used to estimate patient-level equantities, such as the conditional average treatment effect (CATE), E(Y|A=1,X)-E(Y|A=0, X) = CATE(X). Default is random based based through "ranger". "None" uses no ple. See below for details on estimating the treatment contrasts. |
submod |
Subgroup identification model function. Options include tree-methods that target the treatment by variable interaction directly ("lmtree", "glmtree", "mob_weib"), regress the CATE ("rpart_cate", "ctree_cate"), and target prognostic variables ("rpart", "ctree"). Default for family="gaussian" is "lmtree" (MOB with OLS loss). For "binomial" the default is "glmtree" (MOB with binomial loss). Default for "survival" is "lmtree" (log-rank transformation on survival outcomes and then fit MOB-OLS). "None" uses no submod. Currently only available for binary treatments or A=NULL. |
param |
Parameter estimation and inference function. Based on the discovered subgroups, estimate parameter estimates and correspond variability metrics. Options include "lm" (unadjusted linear regression), "dr" (doubly-robust estimator), "gcomp" (G-computation, average the patient-level estimates), "cox" (cox regression), and "rmst" (RMST based estimates as in survRMST package). Default for "gaussian", "binomial" is "dr", while default for "survival" is "cox". Currently only available for binary treatments or A=NULL. |
meta |
Using the ple model as a base learner, meta-learners can be used for estimating patient-level treatment differences. Options include "T-learner" (treatment specific models), "S-learner" (single model), and "X-learner". For family="gaussian" & "binomial", the default is "X-learner", which uses a two-stage regression approach (See Kunzel et al 2019). For "survival", the default is "T-learner". "X-learner" is currently not supported for survival outcomes. |
pool |
Whether to pool the initial identified subgroups (ex: tree nodes). Default = "no". Other options include "trteff" or "trteff_boot" (check if naive or bootstrap treatment estimate is beyond clinical meaningful threshold delta, ex: trteff_boot > 0), and optimal treatment regime (OTR) pooling, "otr:logistic", "otr:rf". "otr:logistic" fits weighted logistic regression with I(mu_1-mu_0>delta) as the outcome, the candidate subgroups as covariates, and weights=abs((mu_1-mu_0) - delta). "otr:rf" follows the same approach but with weighted random forest, and also includes X in the regression. Regardless of the pooling approach, the key output is "trt_assign", a data-frame with the initial subgroups and the pooled subgroups (ex: dopt=1, patient should receive A=1, vs dopt=0, patient should receive A=0). |
delta |
Threshold for defining benefit vs non-benefitting patients. Only applicable for submod="otr", and if pooling is used (see "pool"). Default=">0". |
propensity |
Propensity score estimation, P(A=a|X). Default=FALSE which use the marginal estimates, P(A=a) (applicable for RCT data). If TRUE, will use the "ple" base learner to estimate P(A=a|X). |
combine |
Method of combining group-specific point-estimates. Options include "SS" (sample size weighting), and "maxZ" (see: Mehrotra and Marceau-West). This is used for pooling (ex: within dopt=1 groups, aggregate group-specific treatment estimates), and for calculating the overall population treatment effect estimate. |
alpha_ovrl |
Two-sided alpha level for overall population. Default=0.05 |
alpha_s |
Two-sided alpha level at subgroup level. Default=0.05 |
filter.hyper |
Hyper-parameters for the filter function (must be list). Default is NULL. |
ple.hyper |
Hyper-parameters for the PLE function (must be list). Default is NULL. |
submod.hyper |
Hyper-parameters for the submod function (must be list). Default is NULL. |
resample |
Resampling method for resample-based treatment effect estimates and variability metrics. Options include "Bootstrap" and "CV" (cross-validation). Default=NULL (No resampling). |
stratify |
Stratified resampling? Default="trt" (stratify by A). Other options include "sub" (stratify by the identified subgroups), "trt_sub" (stratify by A and the identified subgroups), and "no" (no stratification). |
R |
Number of resamples (default=NULL; R=100 for Permutation/Bootstrap and R=5 for CV). This resamples the entire PRISM procedure. |
resample_submod |
For submod only, resampling method for treatment effect estimates. Options include "Bootstrap" or NULL (no resampling). |
R_submod |
Number of resamples for resample_submod |
resample_pool |
For submod only, resampling method for pooling step. nly applicable if resample_submod="Bootstrap" and/or pool="trteff_boot". |
R_pool |
Number of resamples for resample_pool |
calibrate |
Bootstrap calibration for nominal alpha (Loh et al 2016). Default=FALSE. For TRUE, outputs the calibrated alpha level and calibrated CIs for the overall population and subgroups. Not applicable for permutation or CV resampling. |
alpha.mat |
Grid of alpha values for calibration. Default=NULL, which uses seq(alpha/1000,alpha,by=0.005) for alpha_ovrl/alpha_s. |
filter.resamp |
Filter function during re-sampling. Default=NULL (uses "filter"). If "None", the "filter" model is not trained in each resample, and instead use filtered variables from the observed data "filter" step. (less computationally expensive). |
ple.resamp |
Ple function during re-sampling. Default=NULL (uses "ple"). If "None", the "ple" model is not training in each resample, and instead the original model estimates are resampled (less computationally expensive). |
verbose |
Detail progress of PRISM? Default=TRUE |
verbose.resamp |
Output iterations during resampling? Default=FALSE |
seed |
Seed for PRISM run (Default=777) |
efficient |
If TRUE (default for PRISM), then models (filter, ple, submod) will store reduced set of outputs for faster speed. |
Details
PRISM is a general framework with five key steps:
0. Estimand: Determine the question of interest (ex: mean treatment difference)
1. Filter (filter): Reduce covariate space by removing noise covariates. Options include elastic net ("glmnet") and random forest variable importance ("ranger").
2. Patient-Level Estimates (ple): Estimate counterfactual patient-level quantities, for example, the conditional average treatment effect (CATE), E(Y|A=1,X)-E(Y|A=0,X). This calls the "ple_train" function, and follows the framework of Kunzel et al 2019. Base-learners include random forest ("ranger"), BART ("bart"), elastic net ("glmnet"), and linear models (LM, GLM, or Cox regression). Meta-learners include the "S-Learner" (single model), "T-learner" (treatment specific models), and "X-learner" (2-stage approach).
3. Subgroup Model (submod): Currently uses tree-based methods to identify predictive and/or prognostic subgroups. Options include MOB OLS ("lmtree"), MOB GLM ("glmtree"), MOB Weibull ("mob_weib"), conditional inference trees ("ctree", Y~ctree(X); "ctree_cate", CATE~ctree(X)), and recursive partitioning and regression trees ("rpart", Y~rpart(X); "rpart_cate", CATE~rpart(X)), and optimal treatment regimes ("otr").
4. Treatment Effect Estimation (param): For the overall population and the discovered subgroups (if any), obtain treatment effect point-estimates and variability metrics. Options include: cox regression ("cox"), double robust estimator ("dr"), linear regression ("lm"), average of patient-level estimates ("gcomp"), and restricted mean survival time ("rmst").
Steps 1-4 also support user-specific models. If treatment is provided (A!=NULL), the default settings are as follows:
Y is continuous (family="gaussian"): Elastic Net Filter ==> X-learner with random forest ==> MOB (OLS) ==> Double Robust estimator
Y is binary (family="binomial"): Elastic Net Filter ==> X-learner with random forest ==> MOB (GLM) ==> Double Robust estimator
Y is right-censored (family="survival"): Elastic Net Filter ==> T-learner with random forest ==> MOB (Weibull) ==> Cox regression
If treatment is not provided (A=NULL), the default settings are as follows:
Y is continuous (family="gaussian"): Elastic Net Filter ==> Random Forest ==> ctree ==> linear regression
Y is binary (family="binomial"): Elastic Net Filter ==> Random Forest ==> ctree ==> linear regression
Y is right-censored (family="survival"): Elastic Net Filter ==> Survival Random Forest ==> ctree ==> RMST
Value
Trained PRISM object. Includes filter, ple, submod, and param outputs.
filter.mod - Filter model
filter.vars - Variables remaining after filtering
ple.fit - Fitted ple model (model fit, other fit outputs)
mu_train - Patient-level estimates (train)
mu_test - Patient-level estimates (test)
submod.fit - Fitted submod model (model fit, other fit outputs)
out.train - Training data-set with identified subgroups
out.test - Test data-set with identified subgroups
Rules - Subgroup rules / definitions
param.dat - Parameter estimates and variablity metrics (depends on param)
resamp_dist - Resampling distributions (NULL if no resampling is done)
References
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent, https://web.stanford.edu/~hastie/Papers/glmnet.pdf Journal of Statistical Software, Vol. 33(1), 1-22 Feb 2010 Vol. 33(1), 1-22 Feb 2010.
Jemielita T, Mehrotra D. PRISM: Patient Response Identifiers for Stratified Medicine. https://arxiv.org/abs/1912.03337
Hothorn T, Hornik K, Zeileis A (2006). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.
Wright, M. N. & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw 77:1-17. doi: 10.18637/jss.v077.i01.
Zeileis A, Hothorn T, Hornik K (2008). Model-Based Recursive Partitioning. Journal of Computational and Graphical Statistics, 17(2), 492–514.
Examples
## Load library ##
library(StratifiedMedicine)
## Examples: Continuous Outcome ##
dat_ctns = generate_subgrp_data(family="gaussian")
Y = dat_ctns$Y
X = dat_ctns$X
A = dat_ctns$A
# Run Default: glmnet, ranger (X-learner), lmtree, dr #
res0 = PRISM(Y=Y, A=A, X=X)
summary(res0)
plot(res0)
res1 = PRISM(Y=Y, A=A, X=X, filter="None")
summary(res1)
plot(res1)
# Search for Prognostic Only (omit A from function) #
res3 = PRISM(Y=Y, X=X)
summary(res3)
plot(res3)
## With bootstrap (No filtering) ##
library(ggplot2)
res_boot = PRISM(Y=Y, A=A, X=X, resample = "Bootstrap", R=50, verbose.resamp = TRUE)
# Plot of distributions and P(est>0) #
plot(res_boot, type="resample", estimand = "E(Y|A=1)-E(Y|A=0)")+
geom_vline(xintercept = 0)
aggregate(I(est>0)~Subgrps, data=res_boot$resamp_dist, FUN="mean")
## Examples: Binary Outcome ##
dat_bin = generate_subgrp_data(family="binomial")
Y = dat_bin$Y
X = dat_bin$X
A = dat_bin$A
# Run Default: glmnet, ranger, glmtree, dr #
res0 = PRISM(Y=Y, A=A, X=X)
plot(res0)
# Survival Data ##
library(survival)
library(ggplot2)
require(TH.data); require(coin)
data("GBSG2", package = "TH.data")
surv.dat = GBSG2
# Design Matrices ###
Y = with(surv.dat, Surv(time, cens))
X = surv.dat[,!(colnames(surv.dat) %in% c("time", "cens")) ]
set.seed(513)
A = rbinom( n = dim(X)[1], size=1, prob=0.5 )
# PRISM: glmnet ==> Random Forest to estimate Treatment-Specific RMST
# ==> MOB (Weibull) ==> Cox for HRs#
res_weib = PRISM(Y=Y, A=A, X=X)
plot(res_weib, type="PLE:waterfall")
plot(res_weib)
#PRISM: glmnet ==> Random Forest to estimate Treatment-Specific RMST
#RPART_CATE: Regress RMST on RPART for subgroups #
res_cate = PRISM(Y=Y, A=A, X=X, submod="rpart_cate")
plot(res_cate)
# PRISM: ENET ==> CTREE ==> Cox; with bootstrap #
res_ctree1 = PRISM(Y=Y, A=A, X=X, ple="None", submod = "ctree",
resample="Bootstrap", R=50, verbose.resamp = TRUE)
plot(res_ctree1)
plot(res_ctree1, type="resample", estimand="HR(A=1 vs A=0)")+geom_vline(xintercept = 1)
aggregate(I(est<0)~Subgrps, data=res_ctree1$resamp_dist, FUN="mean")