etree {etree} | R Documentation |
Energy Tree
Description
Fits an Energy Tree for classification or regression.
Usage
etree(
response,
covariates,
weights = NULL,
minbucket = 5,
alpha = 0.05,
R = 1000,
split_type = "coeff",
coeff_split_type = "test",
p_adjust_method = "fdr",
random_covs = NULL
)
Arguments
response |
Response variable, an object of class either
|
covariates |
Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:
Each element (i.e., variable) in the covariates list must have the same
|
weights |
Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1. |
minbucket |
Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5. |
alpha |
Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05. |
R |
Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000. |
split_type |
Splitting method used when the selected covariate is
structured. It has two possible values: |
coeff_split_type |
Method to select the split point for the chosen
component when the selected covariate is structured and |
p_adjust_method |
Multiple-testing adjustment method for P-values,
which can be set to any of the values provided by
|
random_covs |
Size of the random subset of covariates to choose from
at each split. If set to |
Details
etree()
is the main function of the homonym package. It allows
implementing Energy Trees by simply specifying the response variable, the set
of covariates, and possibly some other parameters. The function is specified
in the same way regardless of the task type: the choice between
classification and regression is automatically made depending on the nature
of the response variable.
Energy Trees (Giubilei et al., 2022) are a recursive partitioning tree-based
model built upon
Conditional Trees (Hothorn et al., 2006). At each step of Energy Trees'
iterative procedure, an Energy test of independence (Szekely et al., 2007) is
performed between the response variable and each of the J covariates. If the
test of global independence (defined as the intersection of the J tests of
partial independence) is not rejected at the significance level set by
alpha
, the recursion is stopped; otherwise, the covariate most
associated with the response in terms of P-value is selected for splitting.
When the covariate is traditional (i.e, numeric or nominal), an Energy test
of independence is performed for each possible split point, and the one
yielding the strongest association with the response is chosen. When the
selected covariate is structured, the split procedure is defined by the value
of split_type
, and possibly by that of coeff_split_type
.
split_type
specifies the splitting method for structured covariates.
It has two possible values:
-
"coeff"
: in this case, feature vector extraction is used to transform the structured selected covariate into a set of numeric components using a representation that is specific to its type. Available transformations of such a kind are cubic B-spline expansions for functional data and shell distributions (Carmi et al., 2007) for graphs - obtained through k-cores (Seidman, 1983), s-cores (Eidsaa and Almaas, 2013), and d-cores (Giatsidis et al., 2013), for binary, weighted, and directed graphs, respectively. Then, the component most associated with the response is selected using Energy tests of independence (Szekely et al., 2007), and the split point for that component is chosen using the method defined bycoeff_split_type
; -
"cluster"
: in this case, the observed values for the structured selected covariate are used within a Partitioning Around Medoids (Kaufmann and Rousseeuw, 1987) step to split observations into the two kid nodes. Medoids calculation and units assignment are performed usingpam()
. Distances are specific to each type of variable (seedist_comp()
for details).
coeff_split_type
defines the method to select the split point for the
chosen component of the selected structured covariate if and only if
split_type = "coeff"
. It has two possible values:
-
"test"
: an Energy test of independence (Szekely et al., 2007) is performed for each possible split point of the chosen component, and the one yielding the strongest association with the response is selected; -
"traditional"
: the split point for the chosen component is selected as the one minimizing the Gini index (for classification) or the RSS (for regression) in the two kid nodes.
Value
An object of class "etree"
, "constparty"
, and "party"
.
It stores all the information about the fitted tree. Its elements can be
individually accessed using the $
operator. Their names and content
are the following:
-
node
: apartynode
object representing the basic structure of the tree; -
data
: alist
containing the data used for the fitting process. Traditional covariates are included in their original form, while structured covariates are stored in the form of components ifsplit_type = "coeff"
or as afactor
whose levels go from 1 to the total number of observations ifsplit_type = "cluster"
; -
fitted
: adata.frame
whose number of rows coincides with the sample size. It includes the fitted terminal node identifiers (in"(fitted)"
) and the response values of all observations (in"(response)"
); -
terms
: aterms
object; -
names
(optional): names of the nodes in the tree. They can be set using acharacter
vector: if its length is smaller than the number of nodes, the remaining nodes have missing names; if its length is larger, exceeding names are ignored.
References
R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.
S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir (2007). A model of internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences, 104(27):11150-11154.
M. Eidsaa and E. Almaas (2013). S-core network decomposition: A generalization of k-core analysis to weighted networks. Physical Review E, 88(6):062819.
C. Giatsidis, D. M. Thilikos, and M. Vazirgiannis (2013). D-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and information systems, 35(2):311-343.
T. Hothorn, K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651-674.
L. Kaufmann and P. Rousseeuw (1987). Clustering by means of medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416.
S. B. Seidman (1983). Network structure and minimum degree. Social networks, 5(3):269-287.
G. J. Szekely, M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769-2794.
See Also
ctree()
for the partykit
implementation of
Conditional Trees (Hothorn et al., 2006).
Examples
## Covariates
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)
## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)
## Regression ##
etree_fit <- etree(response = resp_reg, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
mean((resp_reg - predict(etree_fit)) ^ 2)
## Classification ##
etree_fit <- etree(response = resp_cls, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
table(resp_cls, predict(etree_fit))