etree {etree}R Documentation

Energy Tree

Description

Fits an Energy Tree for classification or regression.

Usage

etree(
  response,
  covariates,
  weights = NULL,
  minbucket = 5,
  alpha = 0.05,
  R = 1000,
  split_type = "coeff",
  coeff_split_type = "test",
  p_adjust_method = "fdr",
  random_covs = NULL
)

Arguments

response

Response variable, an object of class either "factor" or "numeric" (for classification and regression, respectively).

covariates

Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:

  • Numeric: numeric or integer vectors;

  • Nominal: factors;

  • Functions: objects of class "fdata";

  • Graphs: (lists of) objects of class "igraph".

Each element (i.e., variable) in the covariates list must have the same length(), which corresponds to the sample size.

weights

Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1.

minbucket

Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5.

alpha

Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05.

R

Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000.

split_type

Splitting method used when the selected covariate is structured. It has two possible values: "coeff" for feature vector extraction, and "cluster" for clustering. See Details for further information.

coeff_split_type

Method to select the split point for the chosen component when the selected covariate is structured and split_type = "coeff". It has two possible values: "test", in which case Energy tests of independence are used, and "traditional", to employ traditional methods (Gini index for classification and RSS for regression). See Details for further information.

p_adjust_method

Multiple-testing adjustment method for P-values, which can be set to any of the values provided by p.adjust.methods. Default is "fdr" for False Discovery Rate.

random_covs

Size of the random subset of covariates to choose from at each split. If set to NULL (default), all the covariates are considered each time.

Details

etree() is the main function of the homonym package. It allows implementing Energy Trees by simply specifying the response variable, the set of covariates, and possibly some other parameters. The function is specified in the same way regardless of the task type: the choice between classification and regression is automatically made depending on the nature of the response variable.

Energy Trees (Giubilei et al., 2022) are a recursive partitioning tree-based model built upon Conditional Trees (Hothorn et al., 2006). At each step of Energy Trees' iterative procedure, an Energy test of independence (Szekely et al., 2007) is performed between the response variable and each of the J covariates. If the test of global independence (defined as the intersection of the J tests of partial independence) is not rejected at the significance level set by alpha, the recursion is stopped; otherwise, the covariate most associated with the response in terms of P-value is selected for splitting. When the covariate is traditional (i.e, numeric or nominal), an Energy test of independence is performed for each possible split point, and the one yielding the strongest association with the response is chosen. When the selected covariate is structured, the split procedure is defined by the value of split_type, and possibly by that of coeff_split_type.

split_type specifies the splitting method for structured covariates. It has two possible values:

coeff_split_type defines the method to select the split point for the chosen component of the selected structured covariate if and only if split_type = "coeff". It has two possible values:

Value

An object of class "etree", "constparty", and "party". It stores all the information about the fitted tree. Its elements can be individually accessed using the $ operator. Their names and content are the following:

References

R. Giubilei, T. Padellini, P. Brutti (2022). Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates. arXiv preprint. https://arxiv.org/pdf/2207.04430.pdf.

S. Carmi, S. Havlin, S. Kirkpatrick, Y. Shavitt, and E. Shir (2007). A model of internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences, 104(27):11150-11154.

M. Eidsaa and E. Almaas (2013). S-core network decomposition: A generalization of k-core analysis to weighted networks. Physical Review E, 88(6):062819.

C. Giatsidis, D. M. Thilikos, and M. Vazirgiannis (2013). D-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and information systems, 35(2):311-343.

T. Hothorn, K. Hornik, and A. Zeileis (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651-674.

L. Kaufmann and P. Rousseeuw (1987). Clustering by means of medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405-416.

S. B. Seidman (1983). Network structure and minimum degree. Social networks, 5(3):269-287.

G. J. Szekely, M. L. Rizzo, and N. K. Bakirov (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769-2794.

See Also

ctree() for the partykit implementation of Conditional Trees (Hothorn et al., 2006).

Examples


## Covariates
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)

## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)

## Regression ##
etree_fit <- etree(response = resp_reg, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
mean((resp_reg - predict(etree_fit)) ^ 2)

## Classification ##
etree_fit <- etree(response = resp_cls, covariates = cov_list)
print(etree_fit)
plot(etree_fit)
table(resp_cls, predict(etree_fit))



[Package etree version 0.1.0 Index]