R: Energy Forests

eforest {etree}

R Documentation

Energy Forests

Description

Fits an Energy Forest, in the form of either a bagging of Energy Trees or a Random Energy Forest, depending on the value of the random_covs parameter.

Usage

eforest(
  response,
  covariates,
  weights = NULL,
  ntrees = 100,
  ncores = 1L,
  minbucket = 1,
  alpha = 1,
  R = 500,
  split_type = "cluster",
  coeff_split_type = "test",
  p_adjust_method = "fdr",
  perf_metric = NULL,
  random_covs = "auto",
  verbose = FALSE
)

Arguments

`response`	Response variable, an object of class either `"factor"` or `"numeric"` (for classification and regression, respectively).
`covariates`	Set of covariates. Must be provided as a list, where each element is a different variable. Currently available types and the form they need to have to be correctly recognized are the following: Numeric: numeric or integer vectors; Nominal: factors; Functions: objects of class `"fdata"`; Graphs: (lists of) objects of class `"igraph"`. Each element (i.e., variable) in the covariates list must have the same `length()`, which corresponds to the sample size.
`weights`	Optional vector of non-negative integer-valued weights to be used in the fitting process. If not provided, all observations are assumed to have weight equal to 1.
`ntrees`	Number of Energy Trees to grow, i.e., the number of bootstrap samples to be generated and used for fitting.
`ncores`	Number of cores to use, i.e., at most how many child processes will be run simultaneously. Must be exactly 1 on Windows (which uses the master process). `ncores` corresponds to `mc.cores` in `mclapply()`, which is actually used to grow the single Energy Trees in a parallel fashion.
`minbucket`	Positive integer specifying the minimum number of observations that each terminal node must contain. Default is 5.
`alpha`	Nominal level controlling the probability of type I error in the Energy tests of independence used for variable selection. Default is 0.05.
`R`	Number of replicates employed to approximate the sampling distribution of the test statistic in every Energy test of independence. Default is 1000.
`split_type`	Splitting method used when the selected covariate is structured. It has two possible values: `"coeff"` for feature vector extraction, and `"cluster"` for clustering. See Details for further information.
`coeff_split_type`	Method to select the split point for the chosen component when the selected covariate is structured and `split_type = "coeff"`. It has two possible values: `"test"`, in which case Energy tests of independence are used, and `"traditional"`, to employ traditional methods (Gini index for classification and RSS for regression). See Details for further information.
`p_adjust_method`	Multiple-testing adjustment method for P-values, which can be set to any of the values provided by `p.adjust.methods`. Default is `"fdr"` for False Discovery Rate.
`perf_metric`	Performance metric that is used to compute the Out-Of-Bag score. If `NULL`, default choices are used: Balanced Accuracy for classification and Root Mean Square Percentage Error for regression. See Details for further information and possible alternatives.
`random_covs`	Size of the random subset of covariates to choose from at each split. If set to `NULL`, all the covariates are considered each time, resulting in a bagging of Energy Trees. When `random_covs` is an integer greater than 1 and less than the total number of covariates, the model is a Random Energy Forest. By default, it is equal to `"auto"`, which implies the square root of the number of covariates for classification, or one third of the number of covariates for regression (in both cases, rounded down to the nearest integer).
`verbose`	Logical indicating whether to print a one-line notification for the conclusion of each tree's fitting process.

Details

eforest() generates ntrees bootstrap samples and then calls etree() on each of them. Then, it computes the Out-Of-Bag (OOB) score using the performance metric defined through perf_metric.

For classification, possible values of perf_metric are "BAcc" and "WBAcc". Both are general enough to be used in multiclass classification problems, still producing sensible results in the case of binary classification. The two options are based on the calculation of a ground performance metric, the Balanced Accuracy, which is defined as the arithmetic mean between Sensitivity and Specificity. In this framework, Balanced Accuracy is computed using a "One vs. All" approach, i.e., considering one class at a time: positive instances are those belonging to that class, and negatives are the ones belonging to any other class. Then, the "One vs. All" Balanced Accuracies obtained by considering each class must be averaged. When perf_metric = "BAcc" (default for classification tasks), the average is arithmetic. When perf_metric = "WBAcc", the average is weighted using class sizes, hence giving more importance to the "One vs. All" Balanced Accuracy of larger classes.

For regression, the default value of perf_metric is "RMSPE", namely, Root Mean Square Percentage Error. Other available options are c("MAE", "MAPE", "MedianAE", "MedianAPE", "MSE", "NRMSE", "RAE", "RMSE", "RMLSE"). Each of these name points to the corresponding homonym function from the package MLmetrics, whose documentation provides more information about their definition.

Value

Object of class "eforest" with three elements: 1) ensemble, which is a list gathering all the fitted trees; 2) oob_score, an object of class "numeric" representing the OOB score computed using the performance metric defined through perf_metric; 3) perf_metric, an object of class "character" returning the performance metric used for computations.

Examples




## Covariates
set.seed(123)
nobs <- 100
cov_num <- rnorm(nobs)
cov_nom <- factor(rbinom(nobs, size = 1, prob = 0.5))
cov_gph <- lapply(1:nobs, function(j) igraph::sample_gnp(100, 0.2))
cov_fun <- fda.usc::rproc2fdata(nobs, seq(0, 1, len = 100), sigma = 1)
cov_list <- list(cov_num, cov_nom, cov_gph, cov_fun)

## Response variable(s)
resp_reg <- cov_num ^ 2
y <- round((cov_num - min(cov_num)) / (max(cov_num) - min(cov_num)), 0)
resp_cls <- factor(y)

## Regression ##
eforest_fit <- eforest(response = resp_reg, covariates = cov_list, ntrees = 12)
print(eforest_fit$ensemble[[1]])
plot(eforest_fit$ensemble[[1]])
mean((resp_reg - predict(eforest_fit)) ^ 2)

## Classification ##
eforest_fit <- eforest(response = resp_cls, covariates = cov_list, ntrees = 12)
print(eforest_fit$ensemble[[12]])
plot(eforest_fit$ensemble[[12]])
table(resp_cls, predict(eforest_fit))

[Package etree version 0.1.0 Index]