R: Fitting logic decision trees

logicDT {logicDT}

R Documentation

Fitting logic decision trees

Description

Main function for fitting logicDT models.

Usage

## Default S3 method:
logicDT(
  X,
  y,
  max_vars = 3,
  max_conj = 3,
  Z = NULL,
  search_algo = "sa",
  cooling_schedule = cooling.schedule(),
  scoring_rule = "auc",
  tree_control = tree.control(),
  gamma = 0,
  simplify = "vars",
  val_method = "none",
  val_frac = 0.5,
  val_reps = 10,
  allow_conj_removal = TRUE,
  conjsize = 1,
  randomize_greedy = FALSE,
  greedy_mod = TRUE,
  greedy_rem = FALSE,
  max_gen = 10000,
  gp_sigma = 0.15,
  gp_fs_interval = 1,
  ...
)

## S3 method for class 'formula'
logicDT(formula, data, ...)

Arguments

`X`	Matrix or data frame of binary predictors coded as 0 or 1.
`y`	Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed.
`max_vars`	Maximum number of predictors in the set of predictors. For the set `[X_1 \land X_2^c, X_1 \land X_3]`, this parameter is equal to 4.
`max_conj`	Maximum number of conjunctions/input variables for the decision trees. For the set `[X_1 \land X_2^c, X_1 \land X_3]`, this parameter is equal to 2.
`Z`	Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If leaf regression models (such as four parameter logistic models) shall be fitted, only the first given covariable is used.
`search_algo`	Search algorithm for guiding the global search. This can either be `"sa"` for simulated annealing, `"greedy"` for a greedy search or `"gp"` for genetic programming.
`cooling_schedule`	Cooling schedule parameters if simulated annealing is used. The required object should be created via the function `cooling.schedule`.
`scoring_rule`	Scoring rule for guiding the global search. This can either be `"auc"` for the area under the receiver operating characteristic curve (default for binary reponses), `"deviance"` for the deviance, `"nce"` for the normalized cross entropy or `"brier"` for the Brier score. For regression purposes, the MSE (mean squared error) is automatically chosen.
`tree_control`	Parameters controlling the fitting of decision trees. This should be configured via the function `tree.control`.
`gamma`	Complexity penalty added to the score. If `\texttt{gamma} > 0` is given, `\texttt{gamma} \cdot \|\|m\|\|_0` is added to the score with `\|\|m\|\|_0` being the total number of variables contained in the current model `m`. The main purpose of this penalty is for fitting logicDT stumps in conjunction with boosting. For regular logicDT models or bagged logicDT models, instead, the model complexity parameters `max_vars` and `max_conj` should be tuned.
`simplify`	Should the final fitted model be simplified? This means, that unnecessary terms as a whole (`"conj"`) will be removed if they cannot improve the score. `simplify = "vars"` additionally tries to prune individual conjunctions by removing unnecessary variables in those. `simplify = "none"` will not modify the final model.
`val_method`	Inner validation method. `"rv"` leads to a repeated validation where `val_reps` times the original data set is divided into `\texttt{val\_frac} \cdot 100\%` validation data and `(1-\texttt{val\_frac}) \cdot 100\%` training data. `"bootstrap"` draws bootstrap samples and uses the out-of-bag data as validation data. `"cv"` employs cross-validation with `val_reps` folds.
`val_frac`	Only used if `val_method = "rv"`. See description of `val_method`.
`val_reps`	Number of inner validation partitionings.
`allow_conj_removal`	Should it be allowed to remove complete terms/conjunctions in the search? If a model with the specified exact number of terms is desired, this should be set to `FALSE`. If extensive hyperparameter optimizations are feasible, `allow_conj_removal = FALSE` with a proper search over `max_vars` and `max_conj` is advised for fitting single models. For bagging or boosting with a greedy search, `allow_conj_removal = TRUE` together with a small number for `max_vars = max_conj` is recommended, e.g., 2 or 3.
`conjsize`	The minimum of training samples that have to belong to a conjunction. This parameters prevents including unnecessarily complex conjunctions that rarely occur.
`randomize_greedy`	Should the greedy search be randomized by only considering `\sqrt{\mathrm{Neighbour\ states}}` neighbors at each iteration, similar to random forests. Speeds up the greedy search but can lead to inferior results.
`greedy_mod`	Should modifications of conjunctions be considered in a greedy search? `greedy_mod = FALSE` speeds up the greedy search but can lead to inferior results.
`greedy_rem`	Should the removal of conjunctions be considered in a greedy search? `greedy_rem = FALSE` speeds up the greedy search but can lead to inferior results.
`max_gen`	Maximum number of generations for genetic programming.
`gp_sigma`	Parameter `\sigma` for fitness sharing in genetic programming. Very small values (e.g., 0.001) are recommended leading to only penalizing models which yield the exact same score.
`gp_fs_interval`	Interval for fitness sharing in genetic programming. The fitness calculation can be computationally expensive if many models exist in one generation. `gp_fs_interval = 10` leads to performing fitness sharing only every 10th generation.
`...`	Arguments passed to `logicDT.default`
`formula`	An object of type `formula` describing the model to be fitted.
`data`	A data frame containing the data for the corresponding `formula` object. Must also contain quantitative covariables if they should be included as well.

Details

logicDT is a method for finding response-associated interactions between binary predictors. A global search for the best set of predictors and interactions between predictors is performed trying to find the global optimal decision trees. On the one hand, this can be seen as a variable selection. On the other hand, Boolean conjunctions between binary predictors can be identified as impactful which is particularly useful if the corresponding marginal effects are negligible due to the greedy fashion of choosing splits in decision trees.

Three search algorithms are implemented:

Simulated annealing. An exhaustive stochastic optimization procedure. Recommended for single models (without [outer] bagging or boosting).
Greedy search. A very fast search always looking for the best possible improvement. Recommended for ensemble models.
Genetic programming. A more or less intensive search holding several competetive models at each generation. Niche method which is only recommended if multiple (simple) models do explain the variation in the response.

Furthermore, the option of a so-called "inner validation" is available. Here, the search is guided using several train-validation-splits and the average of the validation performance. This approach is computationally expensive but can lead to more robust single models.

For minimizing the computation time, two-dimensional hash tables are used saving evaluated models. This is irrelevant for the greedy search but can heavily improve the fitting times when employing a search with simulated annealing or genetic programming, especially when choosing an inner validation.

Value

An object of class logicDT. This is a list containing

`disj`	A matrix of the identified set of predictors and conjunctions of predictors. Each row corresponds to one term. Each entry corresponds to the column index in `X`. Negative values indicate negations. Missing values mean that the term does not contain any more variables.
`real_disj`	Human readable form of `disj`. Here, variable names are directly depicted.
`score`	Score of the best model. Smaller values are prefered.
`pet`	Decision tree fitted on the best set of input terms. This is a list containing the pointer to the `C` representation of the tree and `R` representations of the tree structure such as the splits and predictions.
`ensemble`	List of decision trees. Only relevant if inner validation was used.
`total_iter`	The total number of search iterations, i.e., tested configurations by fitting a tree (ensemble) and evaluating it.
`prevented_evals`	The number of prevented tree fittings by using the two-dimensional hash table.
`...`	Supplied parameters of the functional call to `logicDT`.

Saving and Loading

logicDT models can be saved and loaded using save(...) and load(...). The internal C structures will not be saved but rebuilt from the R representations if necessary.

References

Lau, M., Schikowski, T. & Schwender, H. (2024). logicDT: A procedure for identifying response-associated interactions between binary predictors. Machine Learning 113(2):933–992. doi: 10.1007/s10994-023-06488-6
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470
Kirkpatrick, S., Gelatt C. D. & Vecchi M. P. (1983). Optimization by Simulated Annealing. Science 220(4598):671–680. doi: 10.1126/science.220.4598.671

Examples

# Generate toy data
set.seed(123)
maf <- 0.25
n.snps <- 50
N <- 2000
X <- matrix(sample(0:2, n.snps * N, replace = TRUE,
                   prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)),
            ncol = n.snps)
colnames(X) <- paste("SNP", 1:n.snps, sep="")
X <- splitSNPs(X)
Z <- matrix(rnorm(N, 20, 10), ncol = 1)
colnames(Z) <- "E"
Z[Z < 0] <- 0
y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) +
  log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) +
  rnorm(N, 0, 1)


# Fit and evaluate single logicDT model
model <- logicDT(X[1:(N/2),], y[1:(N/2)],
                 Z = Z[1:(N/2),,drop=FALSE],
                 max_vars = 3, max_conj = 2,
                 search_algo = "sa",
                 tree_control = tree.control(
                   nodesize = floor(0.05 * nrow(X)/2)
                 ),
                 simplify = "vars",
                 allow_conj_removal = FALSE,
                 conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
plot(model)
print(model)

# Fit and evaluate bagged logicDT model
model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)],
                                Z = Z[1:(N/2),,drop=FALSE],
                                bagging.iter = 50,
                                max_vars = 3, max_conj = 3,
                                search_algo = "greedy",
                                tree_control = tree.control(
                                  nodesize = floor(0.05 * nrow(X)/2)
                                ),
                                simplify = "vars",
                                conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.bagged, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
print(model.bagged)

# Fit and evaluate boosted logicDT model
model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)],
                                  Z = Z[1:(N/2),,drop=FALSE],
                                  boosting.iter = 50,
                                  learning.rate = 0.01,
                                  subsample.frac = 0.75,
                                  replace = FALSE,
                                  max_vars = 3, max_conj = 3,
                                  search_algo = "greedy",
                                  tree_control = tree.control(
                                    nodesize = floor(0.05 * nrow(X)/2)
                                  ),
                                  simplify = "vars",
                                  conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.boosted, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
print(model.boosted)

# Calculate VIMs (variable importance measures)
vims <- vim(model.bagged)
plot(vims)
print(vims)

# Single greedy model
model <- logicDT(X[1:(N/2),], y[1:(N/2)],
                 Z = Z[1:(N/2),,drop=FALSE],
                 max_vars = 3, max_conj = 2,
                 search_algo = "greedy",
                 tree_control = tree.control(
                   nodesize = floor(0.05 * nrow(X)/2)
                 ),
                 simplify = "vars",
                 allow_conj_removal = FALSE,
                 conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
plot(model)
print(model)

[Package logicDT version 1.0.4 Index]