rf.clustering {PDtoolkit}R Documentation

Risk factor clustering

Description

rf.clustering implements correlation based clustering of risk factors. Clustering procedure is base on hclust from stats package.

Usage

rf.clustering(db, metric, k = NA)

Arguments

db

Data frame of risk factors supplied for clustering analysis.

metric

Correlation metric used for distance calculation. Available options are:

  • "raw pearson" - calculated distance as.dist(1 - cor(db, method = "pearson"));

  • "raw spearman" - calculated distance as.dist(1 - cor(db, method = "spearman"));

  • "common pearson" - calculated distance as.dist((1 - cor(db, method = "pearson")) / 2);

  • "common spearman" - calculated distance as.dist((1 - cor(db, method = "spearman")) / 2);

  • "absolute pearson" - calculated distance as.dist(1 - abs(cor(db, method = "pearson")));

  • "absolute spearman" - calculated distance as.dist(1 - abs(cor(db, method = "spearman")));

  • "sqrt pearson" - calculated distance as.dist(sqrt(1 - cor(db, method = "pearson")));

  • "sqrt spearman" - calculated distance as.dist(sqrt(1 - cor(db, method = "spearman")));

  • "x2y" - calculated distance as.dist(1 - dx2y(d = db)[[2]])).

x2y metric is proposed by Professor Rama Ramakrishnan and details can be found on this link. This metric is especially handy if analyst wants to perform clustering before any binning procedures and to decrease number of risk factors. Additionally, x2y algorithm process numerical and categorical risk factors at once and it is able to identify non-linear relationship between the pairs. Metric x2y is not symmetric with respect to inputs - x, y, therefore arithmetic average of values between xy and yx is used to produce the final value for each pair.

k

Number of clusters. If default value (NA) is passed, then automatic elbow method will be used to determine the optimal number of clusters, otherwise selected number of clusters will be used.

Value

The function rf.clustering returns a data frame with: risk factors, clusters assigned and distance to centroid (ordered from smallest to largest). The last column (distance to centroid) can be used for selection of one or more risk factors per cluster.

Examples

suppressMessages(library(PDtoolkit))
library(rpart)
data(loans)
#clustering using common spearman metric
#first we need to categorize numeric risk factors
num.rf <- sapply(loans, is.numeric)
num.rf <- names(num.rf)[!names(num.rf)%in%"Creditability" & num.rf]
loans[, num.rf] <- sapply(num.rf, function(x) 
			  sts.bin(x = loans[, x], y = loans[, "Creditability"])[[2]])
#replace woe in order to convert to all numeric factors 
loans.woe <- replace.woe(db = loans, target = "Creditability")[[1]]
cr <- rf.clustering(db = loans.woe[, -which(names(loans.woe)%in%"Creditability")], 
		  metric = "common spearman", 
		  k = NA)
cr
#select one risk factor per cluster with min distance to centorid
cr %>% group_by(clusters) %>% 
 slice(which.min(dist.to.centroid))

[Package PDtoolkit version 1.2.0 Index]