contrast {conTree} | R Documentation |
Build contrast tree
Description
Build contrast tree
Build boosted contrast tree model
Bootstrap contrast trees
Usage
contrast(
x,
y,
z,
w = rep(1, nrow(x)),
cat.vars = NULL,
not.used = NULL,
qint = 10,
xmiss = 9e+35,
tree.size = 10,
min.node = 500,
mode = c("onesamp", "twosamp"),
type = "dist",
pwr = 2,
quant = 0.5,
nclass = NULL,
costs = NULL,
cdfsamp = 500,
verbose = FALSE,
tree.store = 1e+06,
cat.store = 1e+05,
nbump = 1,
fnodes = 0.25,
fsamp = 1,
doprint = FALSE
)
modtrast(
x,
y,
z,
w = rep(1, nrow(x)),
cat.vars = NULL,
not.used = NULL,
qint = 10,
xmiss = 9e+35,
tree.size = 10,
min.node = 500,
learn.rate = 0.1,
type = c("dist", "diff", "class", "quant", "prob", "maxmean", "diffmean"),
pwr = 2,
quant = 0.5,
cdfsamp = 500,
verbose = FALSE,
tree.store = 1e+06,
cat.store = 1e+05,
nbump = 1,
fnodes = 0.25,
fsamp = 1,
doprint = FALSE,
niter = 100,
doplot = FALSE,
span = 0,
plot.span = 0.15,
print.itr = 10
)
bootcri(
x,
y,
z,
w = rep(1, nrow(x)),
cat.vars = NULL,
not.used = NULL,
qint = 10,
xmiss = 9e+35,
tree.size = 10,
min.node = 500,
mode = "onesamp",
type = "dist",
pwr = 2,
quant = 0.5,
nclass = NULL,
costs = NULL,
cdfsamp = 500,
verbose = FALSE,
tree.store = 1e+06,
cat.store = 1e+05,
nbump = 100,
fnodes = 1,
fsamp = 1,
doprint = FALSE
)
Arguments
x |
training input predictor data matrix or data frame. Rows are observations and columns are variables. Must be a numeric matrix or a data frame. |
y |
vector, or matrix containing training data input outcome values or censoring intervals for each observation. if y is a vector then it implies that y uncensored outcome values or other contrasting quantity. If y is a matrix, then then y is assumed to be censoring intervals for each observation; see details below |
z |
vector containing values of a second contrasting quantity for each observation |
w |
training observation weights |
cat.vars |
vector of column labels (numbers or names) indicating categorical variables (factors). All variables not so indicated are assumed to be orderable numeric; see details below |
not.used |
vector of column labels (numbers or names) indicating predictor variables not to be used in the model |
qint |
maximum number of split evaluation points on each predictor variable |
xmiss |
missing value flag. Must be numeric and larger than
any non missing predictor/abs(response) variable value.
Predictor variable values greater than or equal to xmiss are
regarded as missing. Predictor variable data values of |
tree.size |
maximum number of terminal nodes in generated trees |
min.node |
minimum number of training observations in each tree terminal node |
mode |
indicating one or two-sample contrast; see details below for how it works with type |
type |
type of contrast; see details below for how it works with mode |
pwr |
center split bias parameter. Larger values produce less center split bias. |
quant |
specified quantile p (type='quant' only) |
nclass |
number of classes (type ='class' only) default=2 |
costs |
nclass by nclass misclassification cost matrix (type='class' only); default is equal valued diagonal (error rate) |
cdfsamp |
= maximum subsample size used to compute censored CDF (censoring only) |
verbose |
a logical flag indicating print/don't print censored
CDF computation progress, default |
tree.store |
size of internal tree storage. Decrease value in response to memory allocation error. Increase value for very large values of max.trees and/or tree.size, or in response to diagnostic message or erratic program behavior |
cat.store |
size of internal categorical value storage. Decrease value in response to memory allocation error. Increase value for very large values of max.trees and/or tree.size in the presence of many categorical variables (factors) with many levels, or in response to diagnostic message or erratic program behavior |
nbump |
number of bootstrap replications |
fnodes |
top fraction of node criteria used to evaluate trial bumped trees |
fsamp |
fraction of observations used in each bootstrap sample for bumped trees |
doprint |
logical flag |
learn.rate |
learning rate parameter in |
niter |
number of trees |
doplot |
a flag to display/not display graphical plots |
span |
span for qq-plot transformation smoother |
plot.span |
running median smoother span for discrepancy plot ( |
print.itr |
tree discrepancy printing iteration interval |
Details
The varible xmiss
is the missing value flag, Must be
numeric and larger than any non missing predictor/abs(response) variable value. Predictor variable values greater than or equal to xmiss
are regarded as missing. Predictor variable data values of NA
are internally set to the value of xmiss and thereby regarded as missing.
If the response y is a matrix, it is assumed to contain censoring intervals for each observation. Rows are observations.
First/second column are lower/upper boundary of censoring interval (Can be same value for uncensored observations) respectively
-
y[,1] = -xmiss
implies outcome less than or equal toy[,2]
(censored from above) -
y[,2] = xmiss
implies outcome greater than or equal toy[,1]
Note that censoring is only allowed for type='dist'
; see further below.
If x is a data frame and cat.vars
(the columns indicating
categorical variables), is missing, then components of type factor
are treated as categorical variables. Ordered factors should be
input as type numeric with appropriate numerical scores. If
cat.vars
is present it will over ride the data frame typing.
The mode
argument is either
-
'onesamp'
(default) meaning onex
-vector for each(x,z)
pair -
'twosamp'
implies two-sample contrast with-
x
are predictor variables for both samples -
y
are outcomes for both samples -
z
is sample identity flag withz < 0
implying first sample observations andz > 0
, the second sample observations. Thetype
argument indicates the type of contrast. It can be either a user defined function or a string. Ifmode
is'onesamp'
, the default,
-
-
type = 'dist'
(default) implies contrast distribution ofy
with that ofz
(y
may be censored - see above) -
type = 'diff'
implies contrast joint paired values ofy
andz
-
type = 'class'
implies classification: contrast class labelsy[i]
andz[i]
are two class labels (in1:nclass
) for each observation. -
type = 'prob'
implies contrast predicted with empirical probabilities:y[i] = 0/1
andz[i]
is predicted probabilityP(y=1)
fori
-th observation -
type = 'quant'
is contrast predicted with empirical quantiles:y[i]
is outcome value fori
-th observation andz[i]
is predictedp
-th quantile value (see below) fori
-th observation(0 < p <1)
-
type = 'diffmean'
implies maximize absolute mean difference betweeny
andz
-
type = 'maxmean'
implies maximize signed mean difference betweeny
andz
When mode is 'twosamp'
-
type= 'dist'
(default) implies contrasty
distributions of both samples -
type = 'diffmean'
implies maximize absolute difference between means of two samples -
type = 'maxmean'
maximize signed difference between means of two samples
When type
is a function, it must be a function of three arguments
f(y,z,w)
where y
and z
are double vectors and w
is a weight
vector, not necessarily normalized. The function should return a
double vector of length 1 as the result. See example below.
Value
a contrast model object use as input to interpretation procedures
a contrast model object to be used with predtrast()
a named list with out$bcri
the bootstraped discrepancy values
Author(s)
Jerome H. Friedman
References
Jerome H. Friedman (2020). doi:10.1073/pnas.1921562117
Examples
data(census, package = "conTree")
dx <- 1:10000; dxt <- 10001:16281;
# Build contrast tree
tree <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = 'prob')
# Summarize tree
treesum(tree)
# Get terminal node identifiers for regions containing observations 1 through 10
getnodes(tree, x = census$xt[1:10, ])
# Plot nodes
nodeplots(tree, x = census$xt[dx, ], y = census$yt[dx], z = census$gblt[dx])
# Summarize contrast tree against (precomputed) gradient boosting
# on logistic scale using maximum likelihood (GBL)
nodesum(tree, census$xt[dxt,], census$yt[dxt], census$gblt[dxt])
# Use a custom R discrepancy function to build a contrast tree
dfun <- function(y, z, w) {
w <- w / sum(w)
abs(sum(w * (y - z)))
}
tree2 <- contrast(census$xt[dx,], census$yt[dx], census$gblt[dx], type = dfun)
nodesum(tree2, census$xt[dxt,], census$yt[dxt], census$gblt[dxt])
# Generate lack of fit curve
lofcurve(tree, census$xt[dx,], census$yt[dx], census$gblt[dx])
# Build contrast tree boosting models
# Use small # of iterations for illustration (typically >= 200)
modgbl = modtrast(census$x, census$y, census$gbl, type = 'prob', niter = 10)
# Plot model accuracy as a function of iteration number
xval(modgbl, census$x, census$y, census$gbl, col = 'red')
# Produce predictions from modtrast() for new data.
ypred <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter)
# Produce distribution boosting estimates
yhat <- predtrast(modgbl, census$xt, census$gblt, num = modgbl$niter)