R: Importance of features in a model.

xgb.importance {xgboost}

R Documentation

Importance of features in a model.

Description

Creates a data.table of feature importances in a model.

Usage

xgb.importance(
  feature_names = NULL,
  model = NULL,
  trees = NULL,
  data = NULL,
  label = NULL,
  target = NULL
)

Arguments

`feature_names`	character vector of feature names. If the model already contains feature names, those would be used when `feature_names=NULL` (default value). Non-null `feature_names` could be provided to override those in the model.
`model`	object of class `xgb.Booster`.
`trees`	(only for the gbtree booster) an integer vector of tree indices that should be included into the importance calculation. If set to `NULL`, all trees of the model are parsed. It could be useful, e.g., in multiclass classification to get feature importances for each class separately. IMPORTANT: the tree index in xgboost models is zero-based (e.g., use `trees = 0:4` for first 5 trees).
`data`	deprecated.
`label`	deprecated.
`target`	deprecated.

Details

This function works for both linear and tree models.

For linear models, the importance is the absolute magnitude of linear coefficients. For that reason, in order to obtain a meaningful ranking by importance for a linear model, the features need to be on the same scale (which you also would want to do when using either L1 or L2 regularization).

Value

For a tree model, a data.table with the following columns:

Features names of the features used in the model;
Gain represents fractional contribution of each feature to the model based on the total gain of this feature's splits. Higher percentage means a more important predictive feature.
Cover metric of the number of observation related to this feature;
Frequency percentage representing the relative number of times a feature have been used in trees.

A linear model's importance data.table has the following columns:

Features names of the features used in the model;
Weight the linear coefficient of this feature;
Class (only for multiclass models) class label.

If feature_names is not provided and model doesn't have feature_names, index of the features will be used instead. Because the index is extracted from the model dump (based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).

Examples


# binomial classification using gbtree:
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
               eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
xgb.importance(model = bst)

# binomial classification using gblinear:
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, booster = "gblinear",
               eta = 0.3, nthread = 1, nrounds = 20, objective = "binary:logistic")
xgb.importance(model = bst)

# multiclass classification using gbtree:
nclass <- 3
nrounds <- 10
mbst <- xgboost(data = as.matrix(iris[, -5]), label = as.numeric(iris$Species) - 1,
               max_depth = 3, eta = 0.2, nthread = 2, nrounds = nrounds,
               objective = "multi:softprob", num_class = nclass)
# all classes clumped together:
xgb.importance(model = mbst)
# inspect importances separately for each class:
xgb.importance(model = mbst, trees = seq(from=0, by=nclass, length.out=nrounds))
xgb.importance(model = mbst, trees = seq(from=1, by=nclass, length.out=nrounds))
xgb.importance(model = mbst, trees = seq(from=2, by=nclass, length.out=nrounds))

# multiclass classification using gblinear:
mbst <- xgboost(data = scale(as.matrix(iris[, -5])), label = as.numeric(iris$Species) - 1,
               booster = "gblinear", eta = 0.2, nthread = 1, nrounds = 15,
               objective = "multi:softprob", num_class = nclass)
xgb.importance(model = mbst)

[Package xgboost version 1.7.8.1 Index]