logicDT {logicDT} | R Documentation |
Fitting logic decision trees
Description
Main function for fitting logicDT models.
Usage
## Default S3 method:
logicDT(
X,
y,
max_vars = 3,
max_conj = 3,
Z = NULL,
search_algo = "sa",
cooling_schedule = cooling.schedule(),
scoring_rule = "auc",
tree_control = tree.control(),
gamma = 0,
simplify = "vars",
val_method = "none",
val_frac = 0.5,
val_reps = 10,
allow_conj_removal = TRUE,
conjsize = 1,
randomize_greedy = FALSE,
greedy_mod = TRUE,
greedy_rem = FALSE,
max_gen = 10000,
gp_sigma = 0.15,
gp_fs_interval = 1,
...
)
## S3 method for class 'formula'
logicDT(formula, data, ...)
Arguments
X |
Matrix or data frame of binary predictors coded as 0 or 1. |
y |
Response vector. 0-1 coding for binary responses. Otherwise, a regression task is assumed. |
max_vars |
Maximum number of predictors in the set of predictors.
For the set |
max_conj |
Maximum number of conjunctions/input variables for the
decision trees. For the set |
Z |
Optional matrix or data frame of quantitative/continuous covariables. Multiple covariables allowed for splitting the trees. If leaf regression models (such as four parameter logistic models) shall be fitted, only the first given covariable is used. |
search_algo |
Search algorithm for guiding the global search.
This can either be |
cooling_schedule |
Cooling schedule parameters if simulated
annealing is used. The required object should be created via
the function |
scoring_rule |
Scoring rule for guiding the global search.
This can either be |
tree_control |
Parameters controlling the fitting of
decision trees. This should be configured via the
function |
gamma |
Complexity penalty added to the score.
If |
simplify |
Should the final fitted model be simplified?
This means, that unnecessary terms as a whole ( |
val_method |
Inner validation method. |
val_frac |
Only used if |
val_reps |
Number of inner validation partitionings. |
allow_conj_removal |
Should it be allowed to remove
complete terms/conjunctions in the search?
If a model with the specified exact number of terms is desired,
this should be set to |
conjsize |
The minimum of training samples that have to belong to a conjunction. This parameters prevents including unnecessarily complex conjunctions that rarely occur. |
randomize_greedy |
Should the greedy search be randomized
by only considering |
greedy_mod |
Should modifications of conjunctions be
considered in a greedy search?
|
greedy_rem |
Should the removal of conjunctions be
considered in a greedy search?
|
max_gen |
Maximum number of generations for genetic programming. |
gp_sigma |
Parameter |
gp_fs_interval |
Interval for fitness sharing in
genetic programming. The fitness calculation can be
computationally expensive if many models exist in one
generation. |
... |
Arguments passed to |
formula |
An object of type |
data |
A data frame containing the data for the corresponding
|
Details
logicDT is a method for finding response-associated interactions between binary predictors. A global search for the best set of predictors and interactions between predictors is performed trying to find the global optimal decision trees. On the one hand, this can be seen as a variable selection. On the other hand, Boolean conjunctions between binary predictors can be identified as impactful which is particularly useful if the corresponding marginal effects are negligible due to the greedy fashion of choosing splits in decision trees.
Three search algorithms are implemented:
Simulated annealing. An exhaustive stochastic optimization procedure. Recommended for single models (without [outer] bagging or boosting).
Greedy search. A very fast search always looking for the best possible improvement. Recommended for ensemble models.
Genetic programming. A more or less intensive search holding several competetive models at each generation. Niche method which is only recommended if multiple (simple) models do explain the variation in the response.
Furthermore, the option of a so-called "inner validation" is available. Here, the search is guided using several train-validation-splits and the average of the validation performance. This approach is computationally expensive but can lead to more robust single models.
For minimizing the computation time, two-dimensional hash tables are used saving evaluated models. This is irrelevant for the greedy search but can heavily improve the fitting times when employing a search with simulated annealing or genetic programming, especially when choosing an inner validation.
Value
An object of class logicDT
. This is a list
containing
disj |
A matrix of the identified set of predictors
and conjunctions of predictors. Each row corresponds to one term.
Each entry corresponds to the column index in |
real_disj |
Human readable form of |
score |
Score of the best model. Smaller values are prefered. |
pet |
Decision tree fitted on the best set of
input terms. This is a list containing the pointer to the
|
ensemble |
List of decision trees. Only relevant if inner validation was used. |
total_iter |
The total number of search iterations, i.e., tested configurations by fitting a tree (ensemble) and evaluating it. |
prevented_evals |
The number of prevented tree fittings by using the two-dimensional hash table. |
... |
Supplied parameters of the functional call
to |
Saving and Loading
logicDT models can be saved and loaded using save(...)
and
load(...)
. The internal C
structures will not be saved
but rebuilt from the R
representations if necessary.
References
Lau, M., Schikowski, T. & Schwender, H. (2024). logicDT: A procedure for identifying response-associated interactions between binary predictors. Machine Learning 113(2):933–992. doi: 10.1007/s10994-023-06488-6
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470
Kirkpatrick, S., Gelatt C. D. & Vecchi M. P. (1983). Optimization by Simulated Annealing. Science 220(4598):671–680. doi: 10.1126/science.220.4598.671
Examples
# Generate toy data
set.seed(123)
maf <- 0.25
n.snps <- 50
N <- 2000
X <- matrix(sample(0:2, n.snps * N, replace = TRUE,
prob = c((1-maf)^2, 1-(1-maf)^2-maf^2, maf^2)),
ncol = n.snps)
colnames(X) <- paste("SNP", 1:n.snps, sep="")
X <- splitSNPs(X)
Z <- matrix(rnorm(N, 20, 10), ncol = 1)
colnames(Z) <- "E"
Z[Z < 0] <- 0
y <- -0.75 + log(2) * (X[,"SNP1D"] != 0) +
log(4) * Z/20 * (X[,"SNP2D"] != 0 & X[,"SNP3D"] == 0) +
rnorm(N, 0, 1)
# Fit and evaluate single logicDT model
model <- logicDT(X[1:(N/2),], y[1:(N/2)],
Z = Z[1:(N/2),,drop=FALSE],
max_vars = 3, max_conj = 2,
search_algo = "sa",
tree_control = tree.control(
nodesize = floor(0.05 * nrow(X)/2)
),
simplify = "vars",
allow_conj_removal = FALSE,
conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,],
Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
plot(model)
print(model)
# Fit and evaluate bagged logicDT model
model.bagged <- logicDT.bagging(X[1:(N/2),], y[1:(N/2)],
Z = Z[1:(N/2),,drop=FALSE],
bagging.iter = 50,
max_vars = 3, max_conj = 3,
search_algo = "greedy",
tree_control = tree.control(
nodesize = floor(0.05 * nrow(X)/2)
),
simplify = "vars",
conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.bagged, X[(N/2+1):N,],
Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
print(model.bagged)
# Fit and evaluate boosted logicDT model
model.boosted <- logicDT.boosting(X[1:(N/2),], y[1:(N/2)],
Z = Z[1:(N/2),,drop=FALSE],
boosting.iter = 50,
learning.rate = 0.01,
subsample.frac = 0.75,
replace = FALSE,
max_vars = 3, max_conj = 3,
search_algo = "greedy",
tree_control = tree.control(
nodesize = floor(0.05 * nrow(X)/2)
),
simplify = "vars",
conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model.boosted, X[(N/2+1):N,],
Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
print(model.boosted)
# Calculate VIMs (variable importance measures)
vims <- vim(model.bagged)
plot(vims)
print(vims)
# Single greedy model
model <- logicDT(X[1:(N/2),], y[1:(N/2)],
Z = Z[1:(N/2),,drop=FALSE],
max_vars = 3, max_conj = 2,
search_algo = "greedy",
tree_control = tree.control(
nodesize = floor(0.05 * nrow(X)/2)
),
simplify = "vars",
allow_conj_removal = FALSE,
conjsize = floor(0.05 * nrow(X)/2))
calcNRMSE(predict(model, X[(N/2+1):N,],
Z = Z[(N/2+1):N,,drop=FALSE]), y[(N/2+1):N])
plot(model)
print(model)