lift_curve {yardstick} | R Documentation |
Lift curve
Description
lift_curve()
constructs the full lift curve and returns a tibble. See
gain_curve()
for a closely related concept.
Usage
lift_curve(data, ...)
## S3 method for class 'data.frame'
lift_curve(
data,
truth,
...,
na_rm = TRUE,
event_level = yardstick_event_level(),
case_weights = NULL
)
Arguments
data |
A |
... |
A set of unquoted column names or one or more
|
truth |
The column identifier for the true class results
(that is a |
na_rm |
A |
event_level |
A single string. Either |
case_weights |
The optional column identifier for case weights.
This should be an unquoted column name that evaluates to a numeric column
in |
Details
There is a ggplot2::autoplot()
method for quickly visualizing the curve.
This works for binary and multiclass output, and also works with grouped data
(i.e. from resamples). See the examples.
Value
A tibble with class lift_df
or lift_grouped_df
having
columns:
-
.n
The index of the current sample. -
.n_events
The index of the current unique sample. Values with repeatedestimate
values are given identical indices in this column. -
.percent_tested
The cumulative percentage of values tested. -
.lift
First calculate the cumulative percentage of true results relative to the total number of true results. Then divide that by.percent_tested
.
If using the case_weights
argument, all of the above columns will be
weighted. This makes the most sense with frequency weights, which are integer
weights representing the number of times a particular observation should be
repeated.
Gain and Lift Curves
The motivation behind cumulative gain and lift charts is as a visual method
to determine the effectiveness of a model when compared to the results one
might expect without a model. As an example, without a model, if you were to
advertise to a random 10% of your customer base, then you might expect to
capture 10% of the of the total number of positive responses had you
advertised to your entire customer base. Given a model that predicts which
customers are more likely to respond, the hope is that you can more
accurately target 10% of your customer base and capture >
10% of the total
number of positive responses.
The calculation to construct lift curves is as follows:
-
truth
andestimate
are placed in descending order by theestimate
values (estimate
here is a single column supplied in...
). The cumulative number of samples with true results relative to the entire number of true results are found.
The cumulative
%
found is divided by the cumulative%
tested to construct the lift value. This ratio represents the factor of improvement over an uninformed model. Values>
1 represent a valuable model. This is the y-axis of the lift chart.
Multiclass
If a multiclass truth
column is provided, a one-vs-all
approach will be taken to calculate multiple curves, one per level.
In this case, there will be an additional column, .level
,
identifying the "one" column in the one-vs-all calculation.
Relevant Level
There is no common convention on which factor level should
automatically be considered the "event" or "positive" result
when computing binary classification metrics. In yardstick
, the default
is to use the first level. To alter this, change the argument
event_level
to "second"
to consider the last level of the factor the
level of interest. For multiclass extensions involving one-vs-all
comparisons (such as macro averaging), this option is ignored and
the "one" level is always the relevant result.
Author(s)
Max Kuhn
See Also
Other curve metrics:
gain_curve()
,
pr_curve()
,
roc_curve()
Examples
# ---------------------------------------------------------------------------
# Two class example
# `truth` is a 2 level factor. The first level is `"Class1"`, which is the
# "event of interest" by default in yardstick. See the Relevant Level
# section above.
data(two_class_example)
# Binary metrics using class probabilities take a factor `truth` column,
# and a single class probability column containing the probabilities of
# the event of interest. Here, since `"Class1"` is the first level of
# `"truth"`, it is the event of interest and we pass in probabilities for it.
lift_curve(two_class_example, truth, Class1)
# ---------------------------------------------------------------------------
# `autoplot()`
library(ggplot2)
library(dplyr)
# Use autoplot to visualize
autoplot(lift_curve(two_class_example, truth, Class1))
# Multiclass one-vs-all approach
# One curve per level
hpc_cv %>%
filter(Resample == "Fold01") %>%
lift_curve(obs, VF:L) %>%
autoplot()
# Same as above, but will all of the resamples
hpc_cv %>%
group_by(Resample) %>%
lift_curve(obs, VF:L) %>%
autoplot()