R: Regularized Class Association Rules for Multi-class Problems...

RCAR {arulesCBA}

R Documentation

Regularized Class Association Rules for Multi-class Problems (RCAR+)

Description

Build a classifier based on association rules mined for an input dataset and weighted with LASSO regularized logistic regression following RCAR (Azmi, et al., 2019). RCAR+ extends RCAR from a binary classifier to a multi-label classifier and can use support-balanced CARs.

Usage

RCAR(
  formula,
  data,
  lambda = NULL,
  alpha = 1,
  glmnet.args = NULL,
  cv.glmnet.args = NULL,
  parameter = NULL,
  control = NULL,
  balanceSupport = FALSE,
  disc.method = "mdlp",
  verbose = FALSE,
  ...
)

Arguments

`formula`	A symbolic description of the model to be fitted. Has to be of form `class ~ .` or `class ~ predictor1 + predictor2`.
`data`	A data.frame or arules::transactions containing the training data. Data frames are automatically discretized and converted to transactions with `prepareTransactions()`.
`lambda`	The amount of weight given to regularization during the logistic regression learning process. If not specified (`NULL`) then cross-validation is used to determine the best value (see Details section).
`alpha`	The elastic net mixing parameter. `alpha = 1` is the lasso penalty (default RCAR), and `alpha = 0` the ridge penalty.
`cv.glmnet.args`, `glmnet.args`	A list of arguments passed on to `glmnet::cv.glmnet()` and `glmnet::glmnet()`, respectively. See Example section.
`parameter`, `control`	Optional parameter and control lists for `arules::apriori()`.
`balanceSupport`	balanceSupport parameter passed to `mineCARs()`.
`disc.method`	Discretization method for factorizing numeric input (default: `"mdlp"`). See `discretizeDF.supervised()` for more supervised discretization methods.
`verbose`	Report progress?
`...`	For convenience, additional parameters are used to create the `parameter` control list for `arules::apriori()` (e.g., to specify the support and confidence thresholds).

Details

RCAR+ extends RCAR from a binary classifier to a multi-label classifier using regularized multinomial logistic regression via glmnet.

In arulesCBA, the class variable is always represented by a set of items. For a binary classification problem, we use an item and its compliment (typically called ⁠<item label>=TRUE⁠ and ⁠<item label>=FALSE⁠). For a multi-label classification problem we use one item for each possible class label (format ⁠<class item>=<label>⁠). See prepareTransactions() for details.

RCAR+ first mines CARs to find itemsets (LHS of the CARs) that are related to the class items. Then, a transaction x lhs(CAR) coverage matrix X is created. The matrix contains a 1 if the LHS of the CAR applies to the transaction, and 0 otherwise. A regularized multinomial logistic model to predict the true class y for each transaction given X is fitted. Note that the RHS of the CARs are actually ignored in this process, so the algorithm effectively uses rules consisting of each LHS of a CAR paired with each class label. This is important to keep in mind when trying to interpret the rules used in the classifier.

If lambda for regularization is not specified during training (lambda = NULL) then cross-validation is used to determine the largest value of lambda such that the error is within 1 standard error of the minimum (see glmnet::cv.glmnet() for how to perform cross-validation in parallel).

For the final classifier, we only keep the rules that have a weight greater than 0 for at least one class label. The rules include as the weight the beta coefficients of the model.

Prediction for a new transaction is performed in two steps:

Translate the transaction into a 0-1 coverage vector indicating what class association rule's LHS covers the transaction.
Calculate the predicted label given the multinomial logistic regression model.

Value

Returns an object of class CBA representing the trained classifier with the additional field model containing a list with the following elements:

`reg_model`	them multinomial logistic regression model as an object of class glmnet::glmnet.
`cv`	only available if `lambda = NULL` was specified. Contains the results for the cross-validation used determine lambda. We use by default `lambda.1se` to determine lambda.
`all_rules`	the actual classifier only contains the rules with non-zero weights. This field contains all rules used to build the classifier, including the rules with a weight of zero. This is consistent with the model in `reg_model`.

Author(s)

Tyler Giallanza and Michael Hahsler

References

M. Azmi, G.C. Runger, and A. Berrado (2019). Interpretable regularized class association rules algorithm for classification in a categorical data space. Information Sciences, Volume 483, May 2019. Pages 313-331.

Examples

data("iris")

classifier <- RCAR(Species ~ ., iris)
classifier

# inspect the rule base sorted by the larges class weight
inspect(sort(classifier$rules, by = "weight"))

# make predictions for the first few instances of iris
predict(classifier, head(iris))
table(pred = predict(classifier, iris), true = iris$Species)

# plot the cross-validation curve as a function of lambda and add a
# red line at lambda.1se used to determine lambda.
plot(classifier$model$cv)
abline(v = log(classifier$model$cv$lambda.1se), col = "red")

# plot the coefficient profile plot (regularization path) for each class
# label. Note the line for the chosen lambda is only added to the last plot.
# You can manually add it to the others.
plot(classifier$model$reg_model, xvar = "lambda", label = TRUE)
abline(v = log(classifier$model$cv$lambda.1se), col = "red")

#' inspect rule 11 which has a large weight for class virginica
inspect(classifier$model$all_rules[11])

[Package arulesCBA version 1.2.7 Index]