plot_permutation_variable_importance {familiar} | R Documentation |
Plot permutation variable importance.
Description
This function plots the data on permutation variable importance
stored in a familiarCollection object.
Usage
plot_permutation_variable_importance(
object,
draw = FALSE,
dir_path = NULL,
split_by = NULL,
color_by = NULL,
facet_by = NULL,
facet_wrap_cols = NULL,
ggtheme = NULL,
discrete_palette = NULL,
x_label = waiver(),
y_label = "feature",
legend_label = waiver(),
plot_title = waiver(),
plot_sub_title = waiver(),
caption = NULL,
x_range = NULL,
x_n_breaks = 5,
x_breaks = NULL,
conf_int_style = c("point_line", "line", "bar_line", "none"),
conf_int_alpha = 0.4,
width = waiver(),
height = waiver(),
units = waiver(),
export_collection = FALSE,
...
)
## S4 method for signature 'ANY'
plot_permutation_variable_importance(
object,
draw = FALSE,
dir_path = NULL,
split_by = NULL,
color_by = NULL,
facet_by = NULL,
facet_wrap_cols = NULL,
ggtheme = NULL,
discrete_palette = NULL,
x_label = waiver(),
y_label = "feature",
legend_label = waiver(),
plot_title = waiver(),
plot_sub_title = waiver(),
caption = NULL,
x_range = NULL,
x_n_breaks = 5,
x_breaks = NULL,
conf_int_style = c("point_line", "line", "bar_line", "none"),
conf_int_alpha = 0.4,
width = waiver(),
height = waiver(),
units = waiver(),
export_collection = FALSE,
...
)
## S4 method for signature 'familiarCollection'
plot_permutation_variable_importance(
object,
draw = FALSE,
dir_path = NULL,
split_by = NULL,
color_by = NULL,
facet_by = NULL,
facet_wrap_cols = NULL,
ggtheme = NULL,
discrete_palette = NULL,
x_label = waiver(),
y_label = "feature",
legend_label = waiver(),
plot_title = waiver(),
plot_sub_title = waiver(),
caption = NULL,
x_range = NULL,
x_n_breaks = 5,
x_breaks = NULL,
conf_int_style = c("point_line", "line", "bar_line", "none"),
conf_int_alpha = 0.4,
width = waiver(),
height = waiver(),
units = waiver(),
export_collection = FALSE,
...
)
Arguments
object |
familiarCollection object, or one or more familiarData
objects, that will be internally converted to a familiarCollection object.
It is also possible to provide a familiarEnsemble or one or more
familiarModel objects together with the data from which data is computed
prior to export. Paths to such files can also be provided.
|
draw |
(optional) Draws the plot if TRUE.
|
dir_path |
(optional) Path to the directory where created figures are
saved to. Output is saved in the variable_importance subdirectory. If NULL
no figures are saved, but are returned instead.
|
split_by |
(optional) Splitting variables. This refers to column names
on which datasets are split. A separate figure is created for each split.
See details for available variables.
|
color_by |
(optional) Variables used to determine fill colour of plot
objects. The variables cannot overlap with those provided to the split_by
argument, but may overlap with other arguments. See details for available
variables.
|
facet_by |
(optional) Variables used to determine how and if facets of
each figure appear. In case the facet_wrap_cols argument is NULL , the
first variable is used to define columns, and the remaing variables are
used to define rows of facets. The variables cannot overlap with those
provided to the split_by argument, but may overlap with other arguments.
See details for available variables.
|
facet_wrap_cols |
(optional) Number of columns to generate when facet
wrapping. If NULL, a facet grid is produced instead.
|
ggtheme |
(optional) ggplot theme to use for plotting.
|
discrete_palette |
(optional) Palette used to fill the bars in case a
non-singular variable was provided to the color_by argument.
|
x_label |
(optional) Label to provide to the x-axis. If NULL, no label
is shown.
|
y_label |
(optional) Label to provide to the y-axis. If NULL, no label
is shown.
|
legend_label |
(optional) Label to provide to the legend. If NULL, the
legend will not have a name.
|
plot_title |
(optional) Label to provide as figure title. If NULL, no
title is shown.
|
plot_sub_title |
(optional) Label to provide as figure subtitle. If
NULL, no subtitle is shown.
|
caption |
(optional) Label to provide as figure caption. If NULL, no
caption is shown.
|
x_range |
(optional) Value range for the x-axis.
|
x_n_breaks |
(optional) Number of breaks to show on the x-axis of the
plot. x_n_breaks is used to determine the x_breaks argument in case it
is unset.
|
x_breaks |
(optional) Break points on the x-axis of the plot.
|
conf_int_style |
(optional) Confidence interval style. See details for
allowed styles.
|
conf_int_alpha |
(optional) Alpha value to determine transparency of
confidence intervals or, alternatively, other plot elements with which the
confidence interval overlaps. Only values between 0.0 (fully transparent)
and 1.0 (fully opaque) are allowed.
|
width |
(optional) Width of the plot. A default value is derived from
the number of facets.
|
height |
(optional) Height of the plot. A default value is derived from
the number of features and the number of facets.
|
units |
(optional) Plot size unit. Either cm (default), mm or in .
|
export_collection |
(optional) Exports the collection if TRUE.
|
... |
Arguments passed on to as_familiar_collection , ggplot2::ggsave , extract_permutation_vimp
familiar_data_names Names of the dataset(s). Only used if the object parameter
is one or more familiarData objects.
collection_name Name of the collection.
filename File name to create on disk.
plot Plot to save, defaults to last plot displayed.
device Device to use. Can either be a device function
(e.g. png), or one of "eps", "ps", "tex" (pictex),
"pdf", "jpeg", "tiff", "png", "bmp", "svg" or "wmf" (windows only). If
NULL (default), the device is guessed based on the filename extension.
path Path of the directory to save plot to: path and filename
are combined to create the fully qualified file name. Defaults to the
working directory.
scale Multiplicative scaling factor.
dpi Plot resolution. Also accepts a string input: "retina" (320),
"print" (300), or "screen" (72). Applies only to raster output types.
limitsize When TRUE (the default), ggsave() will not
save images larger than 50x50 inches, to prevent the common error of
specifying dimensions in pixels.
bg Background colour. If NULL , uses the plot.background fill value
from the plot theme.
create.dir Whether to create new directories if a non-existing
directory is specified in the filename or path (TRUE ) or return an
error (FALSE , default). If FALSE and run in an interactive session,
a prompt will appear asking to create a new directory when necessary.
data A dataObject object, data.table or data.frame that
constitutes the data that are assessed.
is_pre_processed Flag that indicates whether the data was already
pre-processed externally, e.g. normalised and clustered. Only used if the
data argument is a data.table or data.frame .
cl Cluster created using the parallel package. This cluster is then
used to speed up computation through parallellisation.
evaluation_times One or more time points that are used for in analysis of
survival problems when data has to be assessed at a set time, e.g.
calibration. If not provided explicitly, this parameter is read from
settings used at creation of the underlying familiarModel objects. Only
used for survival outcomes.
ensemble_method Method for ensembling predictions from models for the
same sample. Available methods are:
metric One or more metrics for assessing model performance. See the
vignette on performance metrics for the available metrics. If not provided
explicitly, this parameter is read from settings used at creation of the
underlying familiarModel objects.
feature_cluster_method The method used to perform clustering. These are
the same methods as for the cluster_method configuration parameter:
none , hclust , agnes , diana and pam .
none cannot be used when extracting data regarding mutual correlation or
feature expressions.
If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModel objects.
feature_linkage_method The method used for agglomerative clustering in
hclust and agnes . These are the same methods as for the
cluster_linkage_method configuration parameter: average , single ,
complete , weighted , and ward .
If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModel objects.
feature_cluster_cut_method The method used to divide features into
separate clusters. The available methods are the same as for the
cluster_cut_method configuration parameter: silhouette , fixed_cut and
dynamic_cut .
silhouette is available for all cluster methods, but fixed_cut only
applies to methods that create hierarchical trees (hclust , agnes and
diana ). dynamic_cut requires the dynamicTreeCut package and can only
be used with agnes and hclust .
If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModel objects.
feature_similarity_threshold The threshold level for pair-wise
similarity that is required to form feature clusters with the fixed_cut
method.
If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModel objects.
feature_similarity_metric Metric to determine pairwise similarity
between features. Similarity is computed in the same manner as for
clustering, and feature_similarity_metric therefore has the same options
as cluster_similarity_metric : mcfadden_r2 , cox_snell_r2 ,
nagelkerke_r2 , spearman , kendall and pearson .
If not provided explicitly, this parameter is read from settings used at
creation of the underlying familiarModel objects.
verbose Flag to indicate whether feedback should be provided on the
computation and extraction of various data elements.
message_indent Number of indentation steps for messages shown during
computation and extraction of various data elements.
detail_level (optional) Sets the level at which results are computed
and aggregated.
-
ensemble : Results are computed at the ensemble level, i.e. over all
models in the ensemble. This means that, for example, bias-corrected
estimates of model performance are assessed by creating (at least) 20
bootstraps and computing the model performance of the ensemble model for
each bootstrap.
-
hybrid (default): Results are computed at the level of models in an
ensemble. This means that, for example, bias-corrected estimates of model
performance are directly computed using the models in the ensemble. If there
are at least 20 trained models in the ensemble, performance is computed for
each model, in contrast to ensemble where performance is computed for the
ensemble of models. If there are less than 20 trained models in the
ensemble, bootstraps are created so that at least 20 point estimates can be
made.
-
model : Results are computed at the model level. This means that, for
example, bias-corrected estimates of model performance are assessed by
creating (at least) 20 bootstraps and computing the performance of the model
for each bootstrap.
Note that each level of detail has a different interpretation for bootstrap
confidence intervals. For ensemble and model these are the confidence
intervals for the ensemble and an individual model, respectively. That is,
the confidence interval describes the range where an estimate produced by a
respective ensemble or model trained on a repeat of the experiment may be
found with the probability of the confidence level. For hybrid , it
represents the range where any single model trained on a repeat of the
experiment may be found with the probability of the confidence level. By
definition, confidence intervals obtained using hybrid are at least as
wide as those for ensemble . hybrid offers the correct interpretation if
the goal of the analysis is to assess the result of a single, unspecified,
model.
hybrid is generally computationally less expensive then ensemble , which
in turn is somewhat less expensive than model .
A non-default detail_level parameter can be specified for separate
evaluation steps by providing a parameter value in a named list with data
elements, e.g. list("auc_data"="ensemble", "model_performance"="hybrid") .
This parameter can be set for the following data elements: auc_data ,
decision_curve_analyis , model_performance , permutation_vimp ,
ice_data , prediction_data and confusion_matrix .
estimation_type (optional) Sets the type of estimation that should be
possible. This has the following options:
-
point : Point estimates.
-
bias_correction or bc : Bias-corrected estimates. A bias-corrected
estimate is computed from (at least) 20 point estimates, and familiar may
bootstrap the data to create them.
-
bootstrap_confidence_interval or bci (default): Bias-corrected
estimates with bootstrap confidence intervals (Efron and Hastie, 2016). The
number of point estimates required depends on the confidence_level
parameter, and familiar may bootstrap the data to create them.
As with detail_level , a non-default estimation_type parameter can be
specified for separate evaluation steps by providing a parameter value in a
named list with data elements, e.g. list("auc_data"="bci", "model_performance"="point") . This parameter can be set for the following
data elements: auc_data , decision_curve_analyis , model_performance ,
permutation_vimp , ice_data , and prediction_data .
aggregate_results (optional) Flag that signifies whether results
should be aggregated during evaluation. If estimation_type is
bias_correction or bc , aggregation leads to a single bias-corrected
estimate. If estimation_type is bootstrap_confidence_interval or bci ,
aggregation leads to a single bias-corrected estimate with lower and upper
boundaries of the confidence interval. This has no effect if
estimation_type is point .
The default value is equal to TRUE except when assessing metrics to assess
model performance, as the default violin plot requires underlying data.
As with detail_level and estimation_type , a non-default
aggregate_results parameter can be specified for separate evaluation steps
by providing a parameter value in a named list with data elements, e.g.
list("auc_data"=TRUE, , "model_performance"=FALSE) . This parameter exists
for the same elements as estimation_type .
confidence_level (optional) Numeric value for the level at which
confidence intervals are determined. In the case bootstraps are used to
determine the confidence intervals bootstrap estimation, familiar uses the
rule of thumb n = 20 / ci.level to determine the number of required
bootstraps.
The default value is 0.95 .
bootstrap_ci_method (optional) Method used to determine bootstrap
confidence intervals (Efron and Hastie, 2016). The following methods are
implemented:
Note that the standard method is not implemented because this method is
often not suitable due to non-normal distributions. The bias-corrected and
accelerated (BCa) method is not implemented yet.
|
Details
This function generates a horizontal barplot that lists features by
the estimated model improvement over that of a dataset where the respective
feature is randomly permuted.
The following splitting variables are available for split_by
, color_by
and facet_by
:
-
fs_method
: feature selection methods.
-
learner
: learners.
-
data_set
: data sets.
-
metric
: the model performance metrics.
-
evaluation_time
: the evaluation times (survival outcomes only).
-
similarity_threshold
: the similarity threshold used to identify groups
of features to permute simultaneously.
By default, the data is split by fs_method
, learner
and metric
,
faceted by data_set
and evaluation_time
, and coloured by
similarity_threshold
.
Available palettes for discrete_palette
are those listed by
grDevices::palette.pals()
(requires R >= 4.0.0), grDevices::hcl.pals()
(requires R >= 3.6.0) and rainbow
, heat.colors
, terrain.colors
,
topo.colors
and cm.colors
, which correspond to the palettes of the same
name in grDevices
. If not specified, a default palette based on palettes
in Tableau are used. You may also specify your own palette by using colour
names listed by grDevices::colors()
or through hexadecimal RGB strings.
Labelling methods such as set_fs_method_names
or set_feature_names
can
be applied to the familiarCollection
object to update labels, and order
the output in the figure.
Bootstrap confidence intervals (if present) can be shown using various
styles set by conf_int_style
:
-
point_line
(default): confidence intervals are shown as lines, on which
the point estimate is likewise shown.
-
line
(default): confidence intervals are shown as lines, but the point
estimate is not shown.
-
bar_line
: confidence intervals are shown as lines, with the point
estimate shown as a bar plot with the opacity of conf_int_alpha
.
-
none
: confidence intervals are not shown. The point estimate is shown as
a bar plot.
For metrics where lower values indicate better model performance, more
negative permutation variable importance values indicate features that are
more important. Because this may cause confusion, values obtained for these
metrics are mirrored around 0.0 for plotting (but not any tabular data
export).
Value
NULL
or list of plot objects, if dir_path
is NULL
.
[Package
familiar version 1.4.8
Index]