R: Coefficient-wise tree-based varying coefficient regression...

tvcglm {vcrpart}

R Documentation

Coefficient-wise tree-based varying coefficient regression based on generalized linear models

Description

The tvcglm function implements the tree-based varying coefficient regression algorithm for generalized linear models introduced by Burgin and Ritschard (2017). The algorithm approximates varying coefficients by piecewise constant functions using recursive partitioning, i.e., it estimates the selected coefficients individually by strata of the value space of partitioning variables. The special feature of the provided algorithm is that it allows building for each varying coefficient an individual partition, which enhances the possibilities for model specification and to select partitioning variables individually by coefficient.

Usage

tvcglm(formula, data, family, 
       weights, subset, offset, na.action = na.omit, 
       control = tvcglm_control(), ...)

tvcglm_control(minsize = 30, mindev = 2.0,
               maxnomsplit = 5, maxordsplit = 9, maxnumsplit = 9,
               cv = TRUE, folds = folds_control("kfold", 5),
               prune = cv, fast = TRUE, center = fast,
	       maxstep = 1e3, verbose = FALSE, ...)

Arguments

`formula`	a symbolic description of the model to fit, e.g., `y ~ vc(z1, z2, z3) + vc(z1, z2, by = x1) + vc(z2, z3, by = x2)` where the `vc` terms specify the varying fixed coefficients. The unnamed arguments within `vc` terms are interpreted as partitioning variables (i.e., moderators). The `by` argument specifies the associated predictor variable. If no such predictor variable is specified (e.g., see the first term in the above example formula), the `vc` term is interpreted as a varying intercept, i.e., an nonparametric estimate of the direct effect of the partitioning variables. For details, see `vcrpart-formula`. Note that the global intercept may be removed by a `-1` term, according to the desired interpretation of the model.
`family`	the model family. An object of class `family`.
`data`	a data frame containing the variables in the model.
`weights`	an optional numeric vector of weights to be used in the fitting process.
`subset`	an optional logical or integer vector specifying a subset of `'data'` to be used in the fitting process.
`offset`	this can be used to specify an a priori known component to be included in the linear predictor during fitting.
`na.action`	a function that indicates what should happen if data contain `NA`s. The default `na.action = na.omit` is listwise deletion, i.e., observations with missings on any variable are dropped. See `na.action`.
`control`	a list with control parameters as returned by `tvcglm_control`, or by `tvcm_control` for advanced users.
`minsize`	numeric (vector). The minimum sum of weights in terminal nodes.
`mindev`	numeric scalar. The minimum permitted training error reduction a split must exhibit to be considered of a new split. The main role of this parameter is to save computing time by early stopping. May be set lower for very few partitioning variables resp. higher for many partitioning variables.
`maxnomsplit`, `maxordsplit`, `maxnumsplit`	integer scalars for split candidate reduction. See `tvcm_control`
`cv`	logical scalar. Whether or not the `cp` parameter should be cross-validated. If `TRUE` `cvloss` is called.
`folds`	a list of parameters to create folds as produced by `folds_control`. Is used for cross-validation.
`prune`	logical scalar. Whether or not the initial tree should be pruned by the estimated `cp` parameter from cross-validation. Cannot be `TRUE` if `cv = FALSE`.
`fast`	logical scalar. Whether the approximative model should be used to search for the next split. The approximative search model uses only the observations of the node to split and incorporates the fitted values of the current model as offsets. Therewith the estimation is reduces to the coefficients of the added split. If `FALSE`, the accurate search model is used.
`center`	logical integer. Whether the predictor variables of update models during the grid search should be centered. Note that `TRUE` will not modify the predictors of the fitted model.
`maxstep`	integer. The maximum number of iterations i.e. number of splits to be processed.
`verbose`	logical. Should information about the fitting process be printed to the screen?
`...`	additional arguments passed to the fitting function `fit` or to `tvcm_control`.

Details

tvcglm processes two stages. The first stage, called partitioning stage, builds overly fine partitions for each vc term; the second stage, called pruning stage, selects the best-sized partitions by collapsing inner nodes. For details on the pruning stage, see tvcm-assessment. The partitioning stage iterates the following steps:

Fit the current generalized linear model

y ~ NodeA:x1 + ... + NodeK:xK

with glm, where Nodek is a categorical variable with terminal node labels for the k-th varying coefficient.
Search the globally best split among the candidate splits by an exhaustive -2 likelihood training error search that cycles through all possible splits.
If the -2 likelihood training error reduction of the best split is smaller than mindev or there is no candidate split satisfying the minimum node size minsize, stop the algorithm.
Else incorporate the best split and repeat the procedure.

The partitioning stage selects, in each iteration, the split that maximizes the -2 likelihood training error reduction, compared to the current model. The default stopping parameters are minsize = 30 (a minimum node size of 30) and mindev = 2 (the training error reduction of the best split must be larger than two to continue).

The algorithm implements a number of split point reduction methods to decrease the computational complexity. See the arguments maxnomsplit, maxordsplit and maxnumsplit.

The algorithm can be seen as an extension of CART (Breiman et. al., 1984) and PartReg (Wang and Hastie, 2014), with the new feature that partitioning can be processed coefficient-wise.

Value

An object of class tvcm

Author(s)

Reto Burgin

References

Breiman, L., J. H. Friedman, R. A. Olshen and C.J. Stone (1984). Classification and Regression Trees. New York, USA: Wadsworth.

Wang, J. C., Hastie, T. (2014), Boosted Varying-Coefficient Regression Models for Product Demand Prediction, Journal of Computational and Graphical Statistics, 23(2), 361-382.

Burgin, R. and G. Ritschard (2017), Coefficient-Wise Tree-Based Varying Coefficient Regression with vcrpart. Journal of Statistical Software, 80(6), 1–33.

Examples

## ------------------------------------------------------------------- #  
## Example: Moderated effect of education on poverty
##
## The algorithm is used to find out whether the effect of high
## education 'EduHigh' on poverty 'Poor' is moderated by the civil
## status 'CivStat'. We specify two 'vc' terms in the logistic
## regression model for 'Poor': a first that accounts for the direct
## effect of 'CivStat' and a second that accounts for the moderation of
## 'CivStat' on the relation between 'EduHigh' and 'Poor'. We use here
## the 2-stage procedure with a partitioning- and a pruning stage as
## described in Burgin and Ritschard (2017). 
## ------------------------------------------------------------------- #

data(poverty)
poverty$EduHigh <- 1 * (poverty$Edu == "high")

## fit the model
model.Pov <-
  tvcglm(Poor ~ -1 +  vc(CivStat) + vc(CivStat, by = EduHigh) + NChild, 
         family = binomial(), data = poverty, subset = 1:200,
         control = tvcm_control(verbose = TRUE, papply = lapply,
           folds = folds_control(K = 1, type = "subsampling", seed = 7)))

## diagnosis
plot(model.Pov, "cv")
plot(model.Pov, "coef")
summary(model.Pov)
splitpath(model.Pov, steps = 1:3)
prunepath(model.Pov, steps = 1)

[Package vcrpart version 1.0-5 Index]