logicDT {logicDT}R Documentation

Fitting logic decision trees

Description

Main function for fitting logicDT models.

Usage

## Default S3 method:
logicDT(
  X,
  y,
  max_vars = 3,
  max_conj = 3,
  Z = NULL,
  search_algo = "sa",
  cooling_schedule = cooling.schedule(),
  scoring_rule = "auc",
  tree_control = tree.control(),
  gamma = 0,
  simplify = "vars",
  val_method = "none",
  val_frac = 0.5,
  val_reps = 10,
  allow_conj_removal = TRUE,
  conjsize = 1,
  randomize_greedy = FALSE,
  greedy_mod = TRUE,
  greedy_rem = FALSE,
  max_gen = 10000,
  gp_sigma = 0.15,
  gp_fs_interval = 1,
  ...
)

## S3 method for class 'formula'
logicDT(formula, data, ...)

Arguments

X

Matrix or data frame of binary predictors coded as 0 or 1.

y

Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed.

max_vars

Maximum number of predictors in the set of predictors. For the set [X_1 \land X_2^c, X_1 \land X_3], this parameter is equal to 4.

max_conj

Maximum number of conjunctions/input variables for the decision trees. For the set [X_1 \land X_2^c, X_1 \land X_3], this parameter is equal to 2.

Z

Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If leaf regression models (such as four parameter logistic models) shall be fitted, only the first given covariable is used.

search_algo

Search algorithm for guiding the global search. This can either be "sa" for simulated annealing, "greedy" for a greedy search or "gp" for genetic programming.

cooling_schedule

Cooling schedule parameters if simulated annealing is used. The required object should be created via the function cooling.schedule.

scoring_rule

Scoring rule for guiding the global search. This can either be "auc" for the area under the receiver operating characteristic curve (default for binary reponses), "deviance" for the deviance, "nce" for the normalized cross entropy or "brier" for the Brier score. For regression purposes, the MSE (mean squared error) is automatically chosen.

tree_control

Parameters controlling the fitting of decision trees. This should be configured via the function tree.control.

gamma

Complexity penalty added to the score. If \texttt{gamma} > 0 is given, \texttt{gamma} \cdot ||m||_0 is added to the score with ||m||_0 being the total number of variables contained in the current model m. The main purpose of this penalty is for fitting logicDT stumps in conjunction with boosting. For regular logicDT models or bagged logicDT models, instead, the model complexity parameters max_vars and max_conj should be tuned.

simplify

Should the final fitted model be simplified? This means, that unnecessary terms as a whole ("conj") will be removed if they cannot improve the score. simplify = "vars" additionally tries to prune individual conjunctions by removing unnecessary variables in those. simplify = "none" will not modify the final model.

val_method

Inner validation method. "rv" leads to a repeated validation where val_reps times the original data set is divided into \texttt{val\_frac} \cdot 100\% validation data and (1-\texttt{val\_frac}) \cdot 100\% training data. "bootstrap" draws bootstrap samples and uses the out-of-bag data as validation data. "cv" employs cross-validation with val_reps folds.

val_frac

Only used if val_method = "rv". See description of val_method.

val_reps

Number of inner validation partitionings.

allow_conj_removal

Should it be allowed to remove complete terms/conjunctions in the search? If a model with the specified exact number of terms is desired, this should be set to FALSE. If extensive hyperparameter optimizations are feasible, allow_conj_removal = FALSE with a proper search over max_vars and max_conj is advised for fitting single models. For bagging or boosting with a greedy search, allow_conj_removal = TRUE together with a small number for max_vars = max_conj is recommended, e.g., 2 or 3.

conjsize

The minimum of training samples that have to belong to a conjunction. This parameters prevents including unnecessarily complex conjunctions that rarely occur.

randomize_greedy

Should the greedy search be randomized by only considering \sqrt{\mathrm{Neighbour\ states}} neighbors at each iteration, similar to random forests. Speeds up the greedy search but can lead to inferior results.

greedy_mod

Should modifications of conjunctions be considered in a greedy search? greedy_mod = FALSE speeds up the greedy search but can lead to inferior results.

greedy_rem

Should the removal of conjunctions be considered in a greedy search? greedy_rem = FALSE speeds up the greedy search but can lead to inferior results.

max_gen

Maximum number of generations for genetic programming.

gp_sigma

Parameter \sigma for fitness sharing in genetic programming. Very small values (e.g., 0.001) are recommended leading to only penalizing models which yield the exact same score.

gp_fs_interval

Interval for fitness sharing in genetic programming. The fitness calculation can be computationally expensive if many models exist in one generation. gp_fs_interval = 10 leads to performing fitness sharing only every 10th generation.

...

Arguments passed to logicDT.default

formula

An object of type formula describing the model to be fitted.

data

A data frame containing the data for the corresponding formula object. Must also contain quantitative covariables if they should be included as well.

Details

logicDT is a method for finding response-associated interactions between binary predictors. A global search for the best set of predictors and interactions between predictors is performed trying to find the global optimal decision trees. On the one hand, this can be seen as a variable selection. On the other hand, Boolean conjunctions between binary predictors can be identified as impactful which is particularly useful if the corresponding marginal effects are negligible due to the greedy fashion of choosing splits in decision trees.

Three search algorithms are implemented:

Furthermore, the option of a so-called "inner validation" is available. Here, the search is guided using several train-validation-splits and the average of the validation performance. This approach is computationally expensive but can lead to more robust single models.

For minimizing the computation time, two-dimensional hash tables are used saving evaluated models. This is irrelevant for the greedy search but can heavily improve the fitting times when employing a search with simulated annealing or genetic programming, especially when choosing an inner validation.

Value

An object of class logicDT. This is a list containing

disj

A matrix of the identified set of predictors and conjunctions of predictors. Each row corresponds to one term. Each entry corresponds to the column index in X. Negative values indicate negations. Missing values mean that the term does not contain any more variables.

real_disj

Human readable form of disj. Here, variable names are directly depicted.

score

Score of the best model. Smaller values are prefered.

pet

Decision tree fitted on the best set of input terms. This is a list containing the pointer to the C representation of the tree and R representations of the tree structure such as the splits and predictions.

ensemble

List of decision trees. Only relevant if inner validation was used.

total_iter

The total number of search iterations, i.e., tested configurations by fitting a tree (ensemble) and evaluating it.

prevented_evals

The number of prevented tree fittings by using the two-dimensional hash table.

...

Supplied parameters of the functional call to logicDT.

Saving and Loading

logicDT models can be saved and loaded using save(...) and load(...). The internal C structures will not be saved but rebuilt from the R representations if necessary.

References

Examples

# Generate toy data
set.seed(123)
maf <- 0.25
n.snps <- 50
N <- 2000
X <- matrix(sample(0:2, n.snps * N, replace = TRUE,
                   prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)),
            ncol = n.snps)
colnames(X) <- paste("SNP", 1:n.snps, sep="")
X <- splitSNPs(X)
Z <- matrix(rnorm(N, 20, 10), ncol = 1)
colnames(Z) <- "E"
Z[Z < 0] <- 0
y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) +
  log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) +
  rnorm(N, 0, 1)


# Fit and evaluate single logicDT model
model <- logicDT(X[1:(N/2),], y[1:(N/2)],
                 Z = Z[1:(N/2),,drop=FALSE],
                 max_vars = 3, max_conj = 2,
                 search_algo = "sa",
                 tree_control = tree.control(
                   nodesize = floor(0.05 * nrow(X)/2)
                 ),
                 simplify = "vars",
                 allow_conj_removal = FALSE,
                 conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
plot(model)
print(model)

# Fit and evaluate bagged logicDT model
model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)],
                                Z = Z[1:(N/2),,drop=FALSE],
                                bagging.iter = 50,
                                max_vars = 3, max_conj = 3,
                                search_algo = "greedy",
                                tree_control = tree.control(
                                  nodesize = floor(0.05 * nrow(X)/2)
                                ),
                                simplify = "vars",
                                conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.bagged, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
print(model.bagged)

# Fit and evaluate boosted logicDT model
model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)],
                                  Z = Z[1:(N/2),,drop=FALSE],
                                  boosting.iter = 50,
                                  learning.rate = 0.01,
                                  subsample.frac = 0.75,
                                  replace = FALSE,
                                  max_vars = 3, max_conj = 3,
                                  search_algo = "greedy",
                                  tree_control = tree.control(
                                    nodesize = floor(0.05 * nrow(X)/2)
                                  ),
                                  simplify = "vars",
                                  conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.boosted, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
print(model.boosted)

# Calculate VIMs (variable importance measures)
vims <- vim(model.bagged)
plot(vims)
print(vims)

# Single greedy model
model <- logicDT(X[1:(N/2),], y[1:(N/2)],
                 Z = Z[1:(N/2),,drop=FALSE],
                 max_vars = 3, max_conj = 2,
                 search_algo = "greedy",
                 tree_control = tree.control(
                   nodesize = floor(0.05 * nrow(X)/2)
                 ),
                 simplify = "vars",
                 allow_conj_removal = FALSE,
                 conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,],
                  Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
plot(model)
print(model)

[Package logicDT version 1.0.4 Index]