bivariate {PDtoolkit} | R Documentation |
Bivariate analysis
Description
bivariate
returns the bivariate statistics for risk factors supplied in data frame db
.
Implemented procedure expects all risk factors to be categorical, thus numeric risk factors should be first
categorized. Additionally, maximum number of groups per risk factor is set to 10, so risk factors with more than
10 categories will not be processed automatically, but manual inspection can be still done using woe.tbl
and auc.model
functions in order to produce the same statistics. Results of both checks (risk factor class and
number of categories), if identified, will be reported in second element of function output - info
data frame.
Bivariate report (first element of function output - results
data frame) includes:
rf: Risk factor name.
bin: Risk factor group (bin).
no: Number of observations per bin.
ng: Number of good cases (where target is equal to 0) per bin.
nb: Number of bad cases (where target is equal to 1) per bin.
pct.o: Percentage of observations per bin.
pct.g: Percentage of good cases (where target is equal to 0) per bin.
pct.b: Percentage of bad cases (where target is equal to 1) per bin.
dr: Default rate per bin.
so: Number of all observations.
sg: Number of all good cases.
sb: Number of all bad cases.
dist.g: Distribution of good cases per bin.
dist.b: Distribution of bad cases per bin.
woe: WoE value.
iv.b: Information value per bin.
iv.s: Information value of risk factor (sum of individual bins' information values).
auc: Area under curve of simple logistic regression model estimated as
y ~ x
, wherey
is selected target variable andx
is categorical risk factor.
Additional info report (second element of function output - info
data frame), if produced, includes:
rf: Risk factor name.
reason.code: Reason code takes value 1 if inappropriate class of risk factor is identified, while for check of maximum number of categories it takes value 2.
comment: Reason description.
Usage
bivariate(db, target)
Arguments
db |
Data frame of risk factors and target variable supplied for bivariate analysis. |
target |
Name of target variable within |
Value
The command bivariate
returns the list of two data frames. The first one contains bivariate metrics
while the second data frame reports results of above explained validations
(class of the risk factors and number of categories).
See Also
woe.tbl
and auc.model
for manual bivariate analysis.
Examples
suppressMessages(library(PDtoolkit))
data(gcd)
#categorize numeric risk factors
gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual)[[2]]
gcd$age.bin.1 <- cut(x = gcd$age, breaks = 20)
gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]]
gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual)[[2]]
str(gcd)
#select target variable and categorized risk factors
gcd.bin <- gcd[, c("qual", "age.bin", "maturity.bin", "amount.bin")]
#run bivariate analysis on data frame with only categorical risk factors
bivariate(db = gcd.bin, target = "qual")
#run bivariate analysis on data frame with mixed risk factors (categorical and numeric).
#for this example info table is produced
bivariate(db = gcd, target = "qual")
#run woe table for risk factor with more than 10 modalities
woe.tbl(tbl = gcd, x = "age.bin.1", y = "qual")
#calculate auc for risk factor with more than 10 modalities
lr <- glm(qual ~ age.bin.1, family = "binomial", data = gcd)
auc.model(predictions = predict(lr, type = "response", newdata = gcd),
observed = gcd$qual)