nzv {PDtoolkit} | R Documentation |
Near-zero variance
Description
nzv
procedure aims to identify risk factors with low variability (almost constants). Usually these risk factors are
expertly investigated and decision is made if they should be excluded from further modeling process.
nzv
output report includes the following metrics:
rf: Risk factor name.
rf.type: Risk factor class. This metric is always one of:
numeric
orcategorical
.sc.num: Number of special cases.
sc.pct: Percentage of special cases in total number of observations.
cc.num: Number of complete cases.
cc.pct: Percentage of complete cases in total number of observations. Sum of this value and
sc.pct
is equal to 1.cc.unv: Number of unique values in complete cases.
cc.unv.pct: Percentage of unique values in total number of complete cases.
cc.lbl.1: The most frequent value in complete cases.
cc.frq.1: Number of occurrence of the most frequent value in complete cases.
cc.lbl.2: The second most frequent value in complete cases.
cc.frq.2: Number of occurrence of the second most frequent value in complete cases.
cc.fqr: Frequency ratio - the ratio between the occurrence of most frequent and the second most frequent value in complete cases.
ind: Indicator which takes value of
1
if the percentage of complete cases is less then 10% and frequency ratio (cc.fqr
) greater than 19. This values can be used for filtering risk factors that need further expert investigation, but user are also encourage to derive its own indicators based on reported metrics.
Usage
nzv(db, sc = c(NA, NaN, Inf, -Inf))
Arguments
db |
Data frame of risk factors supplied for near-zero variance analysis. |
sc |
Numeric or character vector with special case elements. Default values are |
Value
The command nzv
returns the data frame with different matrices needed for identification of near-zero variables.
For details see Description section.
Examples
suppressMessages(library(PDtoolkit))
data(loans)
#artificially add some special values
loans$"Account Balance"[1:10] <- NA
rf.s <- nzv(db = loans, sc = c(NA, NaN, Inf, -Inf))
rf.s