| bivariate {PDtoolkit} | R Documentation | 
Bivariate analysis
Description
bivariate returns the bivariate statistics for risk factors supplied in data frame db. 
Implemented procedure expects all risk factors to be categorical, thus numeric risk factors should be first
categorized. Additionally, maximum number of groups per risk factor is set to 10, so risk factors with more than
10 categories will not be processed automatically, but manual inspection can be still done using woe.tbl
and auc.model functions in order to produce the same statistics. Results of both checks (risk factor class and
number of categories), if identified, will be reported in second element of function output - info data frame. 
Bivariate report (first element of function output - results data frame) includes:
- rf: Risk factor name. 
- bin: Risk factor group (bin). 
- no: Number of observations per bin. 
- ng: Number of good cases (where target is equal to 0) per bin. 
- nb: Number of bad cases (where target is equal to 1) per bin. 
- pct.o: Percentage of observations per bin. 
- pct.g: Percentage of good cases (where target is equal to 0) per bin. 
- pct.b: Percentage of bad cases (where target is equal to 1) per bin. 
- dr: Default rate per bin. 
- so: Number of all observations. 
- sg: Number of all good cases. 
- sb: Number of all bad cases. 
- dist.g: Distribution of good cases per bin. 
- dist.b: Distribution of bad cases per bin. 
- woe: WoE value. 
- iv.b: Information value per bin. 
- iv.s: Information value of risk factor (sum of individual bins' information values). 
- auc: Area under curve of simple logistic regression model estimated as - y ~ x, where- yis selected target variable and- xis categorical risk factor.
Additional info report (second element of function output - info data frame), if produced, includes:
- rf: Risk factor name. 
- reason.code: Reason code takes value 1 if inappropriate class of risk factor is identified, while for check of maximum number of categories it takes value 2. 
- comment: Reason description. 
Usage
bivariate(db, target)
Arguments
| db | Data frame of risk factors and target variable supplied for bivariate analysis. | 
| target | Name of target variable within  | 
Value
The command bivariate returns the list of two data frames. The first one contains bivariate metrics
while the second data frame reports results of above explained validations
(class of the risk factors and number of categories).
See Also
woe.tbl and auc.model for manual bivariate analysis.
Examples
suppressMessages(library(PDtoolkit))
data(gcd)
#categorize numeric risk factors
gcd$age.bin <- ndr.bin(x = gcd$age, y = gcd$qual)[[2]]
gcd$age.bin.1 <- cut(x = gcd$age, breaks = 20)
gcd$maturity.bin <- ndr.bin(x = gcd$maturity, y = gcd$qual, y.type = "bina")[[2]]
gcd$amount.bin <- ndr.bin(x = gcd$amount, y = gcd$qual)[[2]]
str(gcd)
#select target variable and categorized risk factors
gcd.bin <- gcd[, c("qual", "age.bin", "maturity.bin", "amount.bin")]
#run bivariate analysis on data frame with only categorical risk factors
bivariate(db = gcd.bin, target = "qual")
#run bivariate analysis on data frame with mixed risk factors (categorical and numeric). 
#for this example info table is produced
bivariate(db = gcd, target = "qual")
#run woe table for risk factor with more than 10 modalities
woe.tbl(tbl = gcd, x = "age.bin.1", y = "qual")
#calculate auc for risk factor with more than 10 modalities
lr <- glm(qual ~ age.bin.1, family = "binomial", data = gcd)
auc.model(predictions = predict(lr, type = "response", newdata = gcd),
    observed = gcd$qual)