eforest {etree}R Documentation

Energy Forests

Description

Fits an Energy Forest, in the form of either a bagging of Energy Trees or a Random Energy Forest, depending on the value of the random_covs parameter.

Usage

eforest(
  response,
  covariates,
  weights = NULL,
  ntrees = 100,
  ncores = 1L,
  minbucket = 1,
  alpha = 1,
  R = 500,
  split_type = "cluster",
  coeff_split_type = "test",
  p_adjust_method = "fdr",
  perf_metric = NULL,
  random_covs = "auto",
  verbose = FALSE
)

Arguments

response

Response variable, an object of class either "factor" or "numeric" (for classification and regression, respectively).

covariates

Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following:

  • Numeric: numeric or integer vectors;

  • Nominal: factors;

  • Functions: objects of class "fdata";

  • Graphs: (lists of) objects of class "igraph".

Each element (i.e., variable) in the covariates list must have the same length(), which corresponds to the sample size.

weights

Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1.

ntrees

Number of Energy Trees to grow, i.e., the number of bootstrap samples to be generated and used for fitting.

ncores

Number of cores to use, i.e., at most how many child processes will be run simultaneously. Must be exactly 1 on Windows (which uses the master process). ncores corresponds to mc.cores in mclapply(), which is actually used to grow the single Energy Trees in a parallel fashion.

minbucket

Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5.

alpha

Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05.

R

Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000.

split_type

Splitting method used when the selected covariate is structured. It has two possible values: "coeff" for feature vector extraction, and "cluster" for clustering. See Details for further information.

coeff_split_type

Method to select the split point for the chosen component when the selected covariate is structured and split_type = "coeff". It has two possible values: "test", in which case Energy tests of independence are used, and "traditional", to employ traditional methods (Gini index for classification and RSS for regression). See Details for further information.

p_adjust_method

Multiple-testing adjustment method for P-values, which can be set to any of the values provided by p.adjust.methods. Default is "fdr" for False Discovery Rate.

perf_metric

Performance metric that is used to compute the Out-Of-Bag score. If NULL, default choices are used: Balanced Accuracy for classification and Root Mean Square Percentage Error for regression. See Details for further information and possible alternatives.

random_covs

Size of the random subset of covariates to choose from at each split. If set to NULL, all the covariates are considered each time, resulting in a bagging of Energy Trees. When random_covs is an integer greater than 1 and less than the total number of covariates, the model is a Random Energy Forest. By default, it is equal to "auto", which implies the square root of the number of covariates for classification, or one third of the number of covariates for regression (in both cases, rounded down to the nearest integer).

verbose

Logical indicating whether to print a one-line notification for the conclusion of each tree's fitting process.

Details

eforest() generates ntrees bootstrap samples and then calls etree() on each of them. Then, it computes the Out-Of-Bag (OOB) score using the performance metric defined through perf_metric.

For classification, possible values of perf_metric are "BAcc" and "WBAcc". Both are general enough to be used in multiclass classification problems, still producing sensible results in the case of binary classification. The two options are based on the calculation of a ground performance metric, the Balanced Accuracy, which is defined as the arithmetic mean between Sensitivity and Specificity. In this framework, Balanced Accuracy is computed using a "One vs. All" approach, i.e., considering one class at a time: positive instances are those belonging to that class, and negatives are the ones belonging to any other class. Then, the "One vs. All" Balanced Accuracies obtained by considering each class must be averaged. When perf_metric = "BAcc" (default for classification tasks), the average is arithmetic. When perf_metric = "WBAcc", the average is weighted using class sizes, hence giving more importance to the "One vs. All" Balanced Accuracy of larger classes.

For regression, the default value of perf_metric is "RMSPE", namely, Root Mean Square Percentage Error. Other available options are c("MAE", "MAPE", "MedianAE", "MedianAPE", "MSE", "NRMSE", "RAE", "RMSE", "RMLSE"). Each of these name points to the corresponding homonym function from the package MLmetrics, whose documentation provides more information about their definition.

Value

Object of class "eforest" with three elements: 1) ensemble, which is a list gathering all the fitted trees; 2) oob_score, an object of class "numeric" representing the OOB score computed using the performance metric defined through perf_metric; 3) perf_metric, an object of class "character" returning the performance metric used for computations.

Examples




## Covariates
set.seed(123)
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)

## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)

## Regression ##
eforest_fit <- eforest(response = resp_reg, covariates = cov_list, ntrees = 12)
print(eforest_fit$ensemble[[1]])
plot(eforest_fit$ensemble[[1]])
mean((resp_reg - predict(eforest_fit)) ^ 2)

## Classification ##
eforest_fit <- eforest(response = resp_cls, covariates = cov_list, ntrees = 12)
print(eforest_fit$ensemble[[12]])
plot(eforest_fit$ensemble[[12]])
table(resp_cls, predict(eforest_fit))



[Package etree version 0.1.0 Index]