BoostMLR {BoostMLR}R Documentation

Boosting for Multivariate Longitudinal Response

Description

Function jointly models the multiple longitudinal responses (referred to as multivariate longitudinal response) and multiple covariates and time from a longitudinal study using gradient boosting approach (Pande et al., 2020). Covariates can be time-varying or time-invariant. Special cases include modeling of univariate longitudinal response from a longitudinal study, and univariate or multivariate response from a cross-sectional study. In all cases, responses are assumed to be continuous. The estimated coefficient can be a function of time (referred to as time-varying coefficient in case of a longitudinal study) or a function of pre-specified covariate (in case of a longitudinal or a cross-sectional study) or fixed.

Usage

  BoostMLR(x,
           tm,
           id,
           y,
           Time_Varying = TRUE,
           BS_Time = TRUE,
           nknots_t = 10,
           d_t = 3,
           All_RawX = TRUE,
           RawX_Names,
           nknots_x = 7,
           d_x = 3,
           M = 200,
           nu = 0.05,
           Mod_Grad = TRUE,
           Shrink = FALSE,
           VarFlag = TRUE,
           lower_perc = 0.25,
           upper_perc = 0.75,
           NLambda = 100,
           Verbose = TRUE,
           Trace = FALSE,
           lambda = 0,
           setting_seed = FALSE,
           seed_value = 100L,
           ...)

Arguments

x

Data frame (or matrix) containing x-values (covariates). The number of rows should match with number of rows of response y. Covariates can be time-varying or time-invariant. Missing values are allowed, and are ignored during estimation. If unspecified, model will be fitted with time alone (applicable in the situation when the interest is to obtain an estimated mean response trajectory over time without the influence of any covariates).

tm

Vector of time values, one entry for each row of the response y. In case of a longitudinal study, the estimated coefficient will be a function of tm when Time_Varying = TRUE. If unspecified, data is assumed to be generated from a cross-sectional study, and the relationship between y and x can be obtained. In case of a longitudinal or cross-sectional study, coefficient can be a function of covariate z which is not a part of x by using z in place of tm or it can be fixed when Time_Varying = FALSE.

id

Vector of subject identifier with same length as the number of rows of y. If id is unspecified along with tm, data is assumed to be generated from a cross-sectional study, and the relationship between y and x can be obtained.

y

Data frame (or matrix) containing the y-values (response) in case of multivariate response or a vector of y-values in case of univariate response.

Time_Varying

Time-varying coefficient model or a fixed coefficient model?

BS_Time

If tm is specified, should tm is mapped using B-spline or use original scale of tm? Default is TRUE, which allows mapping of tm using B-spline.

nknots_t

If BS_Time = TRUE, specify number of knots for B-spline of tm.

d_t

If BS_Time = TRUE, specify degree of polynomial for B-spline of tm.

All_RawX

Use original scale of x or map each covariate using B-spline? Default is TRUE, which means original scale of x is used; if FALSE, covariates measured on continuous scale will be mapped using B-splines.

RawX_Names

If All_RawX = FALSE, specify names of the covariates, measured on a continuous scale, that should be used as it is without mapping using B-spline. Note that, even if All_RawX = FALSE, covariates not measured on a continuous scale, such as binary, nominal, and ordinal covariates will be used without mapping.

nknots_x

Specify number of knots for B-spline of x. This can be a vector of length equal to the number of covariates or a scalar. If scalar, same value will be used for all covariates.

d_x

Specify degree of polynomial for B-spline of x. This can be a vector of length equal to the number of covariates or a scalar. If scalar, same value will be used for all covariates.

M

Number of boosting iterations.

nu

Boosting regularization parameter. A value from the interval (0,1].

Mod_Grad

Use a modified gradient? Modified gradient is a special type of gradient that is independent of the correlation coefficient. Pande A. (2017) observed that prediction performance increases under modified gradient.

Shrink

Allow estimated coefficient to shrink to zero using L1 penalization?

VarFlag

Estimate the variance (scale parameter) and correlation parameter for each y? Applicable for a longitudinal study. If VarFlag = FALSE, a fixed value of scale parameter = 1 and correlation parameter = 0 is used.

lower_perc

Lower percentile value is used to determine the lower cut-off for the distribution of parameter estimate. Applicable when Shrink = TRUE. Refer to Pande et al. (2020) for details.

upper_perc

Upper percentile value is used to determine the upper cut-off for the distribution of parameter estimate. Applicable when Shrink = TRUE. Refer to Pande et al. (2020) for details.

NLambda

Number of replications for generating distribution of parameter estimates. Applicable when Shrink = TRUE. Refer to Pande et al. (2020) for details.

Verbose

Print the current stage of boosting iteration?

Trace

Print the current stage of execution? Useful for identifying location in case error occurs.

lambda

Additional penaulty; not implemented at this time.

setting_seed

Set setting_seed = TRUE if you intend to reproduce the result.

seed_value

Seed value.

...

Further arguments passed to or from other methods.

Details

This is a non-parametric approach for joint modeling of a multivariate longitudinal response, which is based on marginal model. Estimation is performed using gradient boosting, a generic form of boosting (Friedman J. H., 2001). Our boosting approach is closely related to component-wise L2 boosting with L1 penalization. Approach can handle high dimensionalilty of covariate and response when some of the covariates and responses are pure noise.

Approach is designed to identify covariates that affect responses differently as different time intervals. This idea is helpful to dissect an overall effect of covariate into different time intervals. For example, some covariates affect response at the beginning of the follow-up whereas others at a later stage.

Shrinking allows for early termination of boosting to prevent overfitting. Also, it provides a parsimonious model by shrinking coefficient for non-informative covariate-response pair to zero.

Value

x

Matrix containing x-values.

id

Vector of subject identifier.

tm

Vector of time values.

y

Matrix containing y-values.

UseRaw

Logical vector indicating indexes of covariates which are used as it is without B-spline mapping.

x_Names

Variable names of x.

y_Names

Variable names of y.

M

Number of boosting iterations. If boosting terminates before a pre-specified M, this indicates the last boosting iteration before termination.

nu

Regularization parameter.

Tm_Beta

An estimate of the parameter beta. This consist of a list of length equal to the number of multivariate response (denoted by L). If Time_Varying = TRUE, each element from the list represents a matrix with number of columns equal to the number of covariates and the number of rows equal to the length of tm. Each column of the matrix represents an estimate of time-varying coefficient for the given covariate. If Time_Varying = FALSE, in place of estimate of time-varying coefficient, a fixed estimate is provided similar to the estimate from a parametric model. The result is provided for covariates who are treated as it is (i.e., the original scale); for covariates who are mapped using B-spline, the estimates are difficult to interprete and therefore the output is NA.

mu

Estimate of the conditional expectation of y corresponding to the M'th boosting iterations.

Error_Rate

Training error rate for each response across the boosting iterations.

Variable_Select

Indexes of important covariates that get picked-up across time and across boosting iterations. Result is shown as a matrix with M rows and H (number of overlapping time intervals) columns, where each element represents index of covariate.

Response_Select

Indexes of important responses that get picked-up across time and across boosting iterations. Result is shown as a matrix with M rows and H columns, where each element represents index of response variable.

VarFlag

Whether the variance (scale parameter) and correlation are estimated?

Time_Varying

Whether estimates are time-varying or fixed?

Phi

Matrix, having dimension M by L, representing an estimate of variance (scale parameter) for each response across the boosting iterations.

Rho

Matrix, having dimension M by L, represent an estimate of correlation for each response across the boosting iterations.

Lambda_List

Estimate of the lambda (the L1 penaulty parameter) for each boosting iterations. Useful for internal calculation.

Grow_Object

Useful for internal calculation.

Author(s)

Amol Pande and Hemant Ishwaran

References

Pande A., Ishwaran H., Blackstone E.H. (2020). Boosting for multivariate longitudinal response.

Pande A., Li L., Rajeswaran J., Ehrlinger J., Kogalur U.B., Blackstone E.H., Ishwaran H. (2017). Boosted multivariate trees for longitudinal data, Machine Learning, 106(2): 277–305.

Pande A. (2017). Boosting for longitudinal data. Ph.D. Dissertation, Miller School of Medicine, University of Miami.

Friedman J.H. (2001). Greedy function approximation: a gradient boosting machine, Ann. of Statist., 5:1189-1232.

See Also

updateBoostMLR, predictBoostMLR, simLong

Examples



##-----------------------------------------------------------------
## Multivariate Longitudinal Response
##-----------------------------------------------------------------

# Simulate data involves 3 response and 4 covariates

dta <- simLong(n = 100, N = 5, rho =.80, model = 1, q_x = 0, 
                                  q_y = 0,type = "corCompSym")$dtaL

# Boosting call: Raw values of covariates, B-spline for time, 
# no shrinkage, no estimate of rho and phi

boost.grow <- BoostMLR(x = dta$features, tm = dta$time, id = dta$id, 
                            y = dta$y, M = 100, VarFlag = FALSE)

# Plot training error
plotBoostMLR(boost.grow$Error_Rate,xlab = "m",ylab = "Training Error")


##-----------------------------------------------------------------
## Laboratory data
##-----------------------------------------------------------------

data(Laboratory_Data, package = "BoostMLR")

Var_Names <- colnames(Laboratory_Data)
x_Names <- setdiff(Var_Names, c("id","time","tbili_po","creat_po"))

dta_id <- Laboratory_Data[,"id"]
dta_time <- Laboratory_Data[,"time"]
dta_x <- Laboratory_Data[,x_Names]
dta_y <- Laboratory_Data[,c("tbili_po","creat_po")]

boost.grow <- BoostMLR(x = dta_x,tm = dta_time,id = dta_id,y = dta_y,
                           Time_Varying = TRUE,BS_Time = TRUE,
                           All_RawX = TRUE,M = 10, VarFlag = TRUE)

##-----------------------------------------------------------------
## Univariate Longitudinal Response
##-----------------------------------------------------------------

# Simulate data involves 1 response and 4 covariates

dta <- simLong(n = 100, N = 5, rho =.80, model = 2, q_x = 0, 
                                  q_y = 0,type = "corCompSym")$dtaL

# Boosting call: B-spline for time and covariates, shrinkage, 
# estimate of rho and phi 

boost.grow <- BoostMLR(x = dta$features, tm = dta$time, id = dta$id, 
                          y = dta$y, M = 100, BS_Time = TRUE,
                          All_RawX = FALSE, Shrink = TRUE,VarFlag = TRUE)

# Plot training error
plotBoostMLR(boost.grow$Error_Rate,xlab = "m",ylab = "Training Error")

# Plot phi
plotBoostMLR(boost.grow$Phi,xlab = "m",ylab = "phi")

# Plot rho
plotBoostMLR(boost.grow$Rho,xlab = "m",ylab = "rho")


##-----------------------------------------------------------------
## Multivariate Longitudinal Response
##-----------------------------------------------------------------

# Simulate data involves 3 response and 4 covariates

dta <- simLong(n = 100, N = 5, rho =.80, model = 1, q_x = 0, 
                                  q_y = 0,type = "corCompSym")$dtaL

# Boosting call: Raw values of covariates, fixed parameter estimates
# instead of time varying, no shrinkage, no estimate of rho and phi

boost.grow <- BoostMLR(x = dta$features, tm = dta$time, id = dta$id, 
                       y = dta$y, M = 100,Time_Varying = FALSE,VarFlag = FALSE)

# Print parameter estimates
boost.grow$Tm_Beta

##-----------------------------------------------------------------
## Multivariate Response from Cross-sectional Data: Estimated 
## coefficient as a function of covariate
##-----------------------------------------------------------------

if (library("mlbench", logical.return = TRUE)) {
data("BostonHousing")

x <- BostonHousing[,c(1:7,9:12)]
tm <- BostonHousing[,8]
id <- 1:nrow(BostonHousing)
y <- BostonHousing[,13:14]

# Boosting call: Raw values of covariates, B-spline for covariate "dis", 
# no shrinkage

boost.grow <- BoostMLR(x = x, tm = tm, id = id, y = y, M = 100,VarFlag = FALSE)

# Plot training error
plotBoostMLR(boost.grow$Error_Rate,xlab = "m",ylab = "Training Error",
                                              legend_fraction_x = 0.2)
}
##-----------------------------------------------------------------
## Univariate Response from Cross-sectional Data: Fixed estimated 
## coefficient
##-----------------------------------------------------------------

if (library("mlbench", logical.return = TRUE)) {
library(mlbench)
data("BostonHousing")

x <- BostonHousing[,1:13]
y <- BostonHousing[,14]

# Boosting call: Raw values of covariates

boost.grow <- BoostMLR(x = x, y = y, M = 100)

# Plot training error
plotBoostMLR(boost.grow$Error_Rate,xlab = "m",ylab = "Training Error",
                                               legend_fraction_x = 0.2)
}


[Package BoostMLR version 1.0.3 Index]