R: Bivariate analysis

bivariate {PDtoolkit}

R Documentation

Bivariate analysis

Description

bivariate returns the bivariate statistics for risk factors supplied in data frame db.
Implemented procedure expects all risk factors to be categorical, thus numeric risk factors should be first categorized. Additionally, maximum number of groups per risk factor is set to 10, so risk factors with more than 10 categories will not be processed automatically, but manual inspection can be still done using woe.tbl and auc.model functions in order to produce the same statistics. Results of both checks (risk factor class and number of categories), if identified, will be reported in second element of function output - info data frame.
Bivariate report (first element of function output - results data frame) includes:

rf: Risk factor name.
bin: Risk factor group (bin).
no: Number of observations per bin.
ng: Number of good cases (where target is equal to 0) per bin.
nb: Number of bad cases (where target is equal to 1) per bin.
pct.o: Percentage of observations per bin.
pct.g: Percentage of good cases (where target is equal to 0) per bin.
pct.b: Percentage of bad cases (where target is equal to 1) per bin.
dr: Default rate per bin.
so: Number of all observations.
sg: Number of all good cases.
sb: Number of all bad cases.
dist.g: Distribution of good cases per bin.
dist.b: Distribution of bad cases per bin.
woe: WoE value.
iv.b: Information value per bin.
iv.s: Information value of risk factor (sum of individual bins' information values).
auc: Area under curve of simple logistic regression model estimated as y ~ x, where y is selected target variable and x is categorical risk factor.

Additional info report (second element of function output - info data frame), if produced, includes:

rf: Risk factor name.
reason.code: Reason code takes value 1 if inappropriate class of risk factor is identified, while for check of maximum number of categories it takes value 2.
comment: Reason description.

Usage

bivariate(db, target)

Arguments

`db`	Data frame of risk factors and target variable supplied for bivariate analysis.
`target`	Name of target variable within `db` argument.

Value

The command bivariate returns the list of two data frames. The first one contains bivariate metrics while the second data frame reports results of above explained validations (class of the risk factors and number of categories).

Examples

suppressMessages(library(PDtoolkit))
data(gcd)
#categorize numeric risk factors
gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual)[[2]]
gcd$age.bin.1 <- cut(x = gcd$age, breaks = 20)
gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]]
gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual)[[2]]
str(gcd)
#select target variable and categorized risk factors
gcd.bin <- gcd[, c("qual", "age.bin", "maturity.bin", "amount.bin")]
#run bivariate analysis on data frame with only categorical risk factors
bivariate(db = gcd.bin, target = "qual")
#run bivariate analysis on data frame with mixed risk factors (categorical and numeric). 
#for this example info table is produced
bivariate(db = gcd, target = "qual")
#run woe table for risk factor with more than 10 modalities
woe.tbl(tbl = gcd, x = "age.bin.1", y = "qual")
#calculate auc for risk factor with more than 10 modalities
lr <- glm(qual ~ age.bin.1, family = "binomial", data = gcd)
auc.model(predictions = predict(lr, type = "response", newdata = gcd),
    observed = gcd$qual)