contrast {conTree}R Documentation

Build contrast tree

Description

Build contrast tree

Build boosted contrast tree model

Bootstrap contrast trees

Usage

contrast(
  x,
  y,
  z,
  w = rep(1, nrow(x)),
  cat.vars = NULL,
  not.used = NULL,
  qint = 10,
  xmiss = 9e+35,
  tree.size = 10,
  min.node = 500,
  mode = c("onesamp", "twosamp"),
  type = "dist",
  pwr = 2,
  quant = 0.5,
  nclass = NULL,
  costs = NULL,
  cdfsamp = 500,
  verbose = FALSE,
  tree.store = 1e+06,
  cat.store = 1e+05,
  nbump = 1,
  fnodes = 0.25,
  fsamp = 1,
  doprint = FALSE
)

modtrast(
  x,
  y,
  z,
  w = rep(1, nrow(x)),
  cat.vars = NULL,
  not.used = NULL,
  qint = 10,
  xmiss = 9e+35,
  tree.size = 10,
  min.node = 500,
  learn.rate = 0.1,
  type = c("dist", "diff", "class", "quant", "prob", "maxmean", "diffmean"),
  pwr = 2,
  quant = 0.5,
  cdfsamp = 500,
  verbose = FALSE,
  tree.store = 1e+06,
  cat.store = 1e+05,
  nbump = 1,
  fnodes = 0.25,
  fsamp = 1,
  doprint = FALSE,
  niter = 100,
  doplot = FALSE,
  span = 0,
  plot.span = 0.15,
  print.itr = 10
)

bootcri(
  x,
  y,
  z,
  w = rep(1, nrow(x)),
  cat.vars = NULL,
  not.used = NULL,
  qint = 10,
  xmiss = 9e+35,
  tree.size = 10,
  min.node = 500,
  mode = "onesamp",
  type = "dist",
  pwr = 2,
  quant = 0.5,
  nclass = NULL,
  costs = NULL,
  cdfsamp = 500,
  verbose = FALSE,
  tree.store = 1e+06,
  cat.store = 1e+05,
  nbump = 100,
  fnodes = 1,
  fsamp = 1,
  doprint = FALSE
)

Arguments

x

training input predictor data matrix or data frame. Rows are observations and columns are variables. Must be a numeric matrix or a data frame.

y

vector, or matrix containing training data input outcome values or censoring intervals for each observation. if y is a vector then it implies that y uncensored outcome values or other contrasting quantity. If y is a matrix, then then y is assumed to be censoring intervals for each observation; see details below

z

vector containing values of a second contrasting quantity for each observation

w

training observation weights

cat.vars

vector of column labels (numbers or names) indicating categorical variables (factors). All variables not so indicated are assumed to be orderable numeric; see details below

not.used

vector of column labels (numbers or names) indicating predictor variables not to be used in the model

qint

maximum number of split evaluation points on each predictor variable

xmiss

missing value flag. Must be numeric and larger than any non missing predictor/abs(response) variable value. Predictor variable values greater than or equal to xmiss are regarded as missing. Predictor variable data values of NA are internally set to the value of xmiss and thereby regarded as missing

tree.size

maximum number of terminal nodes in generated trees

min.node

minimum number of training observations in each tree terminal node

mode

indicating one or two-sample contrast; see details below for how it works with type

type

type of contrast; see details below for how it works with mode

pwr

center split bias parameter. Larger values produce less center split bias.

quant

specified quantile p (type='quant' only)

nclass

number of classes (type ='class' only) default=2

costs

nclass by nclass misclassification cost matrix (type='class' only); default is equal valued diagonal (error rate)

cdfsamp

= maximum subsample size used to compute censored CDF (censoring only)

verbose

a logical flag indicating print/don't print censored CDF computation progress, default FALSE

tree.store

size of internal tree storage. Decrease value in response to memory allocation error. Increase value for very large values of max.trees and/or tree.size, or in response to diagnostic message or erratic program behavior

cat.store

size of internal categorical value storage. Decrease value in response to memory allocation error. Increase value for very large values of max.trees and/or tree.size in the presence of many categorical variables (factors) with many levels, or in response to diagnostic message or erratic program behavior

nbump

number of bootstrap replications

fnodes

top fraction of node criteria used to evaluate trial bumped trees

fsamp

fraction of observations used in each bootstrap sample for bumped trees

doprint

logical flag TRUE/FALSE implies do/don't plot iteration progress

learn.rate

learning rate parameter in ⁠(0,1]⁠

niter

number of trees

doplot

a flag to display/not display graphical plots

span

span for qq-plot transformation smoother

plot.span

running median smoother span for discrepancy plot (doplot = TRUE, only)

print.itr

tree discrepancy printing iteration interval

Details

The varible xmiss is the missing value flag, Must be numeric and larger than any non missing predictor/abs(response) variable value. Predictor variable values greater than or equal to xmiss are regarded as missing. Predictor variable data values of NA are internally set to the value of xmiss and thereby regarded as missing.

If the response y is a matrix, it is assumed to contain censoring intervals for each observation. Rows are observations.

Note that censoring is only allowed for type='dist'; see further below.

If x is a data frame and cat.vars (the columns indicating categorical variables), is missing, then components of type factor are treated as categorical variables. Ordered factors should be input as type numeric with appropriate numerical scores. If cat.vars is present it will over ride the data frame typing.

The mode argument is either

When mode is 'twosamp'

When type is a function, it must be a function of three arguments f(y,z,w) where y and z are double vectors and w is a weight vector, not necessarily normalized. The function should return a double vector of length 1 as the result. See example below.

Value

a contrast model object use as input to interpretation procedures

a contrast model object to be used with predtrast()

a named list with out$bcri the bootstraped discrepancy values

Author(s)

Jerome H. Friedman

References

Jerome H. Friedman (2020). doi:10.1073/pnas.1921562117

Examples

data(census, package = "conTree")
dx <- 1:10000; dxt <- 10001:16281;
# Build contrast tree
tree <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = 'prob')
# Summarize tree
treesum(tree)
# Get terminal node identifiers for regions containing observations 1 through 10
getnodes(tree, x = census$xt[1:10, ])
# Plot nodes
nodeplots(tree, x = census$xt[dx, ], y = census$yt[dx], z = census$gblt[dx])
# Summarize contrast tree against (precomputed) gradient boosting
# on logistic scale using maximum likelihood (GBL)
nodesum(tree, census$xt[dxt,], census$yt[dxt], census$gblt[dxt])
# Use a custom R discrepancy function to build a contrast tree
dfun <- function(y, z, w) {
   w  <- w / sum(w)
   abs(sum(w * (y - z)))
}
tree2 <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = dfun)
nodesum(tree2, census$xt[dxt,], census$yt[dxt], census$gblt[dxt])
# Generate lack of fit curve
lofcurve(tree, census$xt[dx,], census$yt[dx], census$gblt[dx])
# Build contrast tree boosting models
# Use small # of iterations for illustration (typically >= 200)
modgbl = modtrast(census$x, census$y, census$gbl, type = 'prob', niter = 10)
# Plot model accuracy as a function of iteration number
xval(modgbl, census$x, census$y, census$gbl, col = 'red')
# Produce predictions from modtrast() for new data.
ypred <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter)
# Produce distribution boosting estimates
yhat <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter)

[Package conTree version 0.3-1 Index]