gower.dist {StatMatch} | R Documentation |
Computes the Gower's Distance
Description
This function computes the Gower's distance (dissimilarity) between units in a dataset or between observations in two distinct datasets.
Usage
gower.dist(data.x, data.y=data.x, rngs=NULL, KR.corr=TRUE, var.weights = NULL,
robcb=NULL)
Arguments
data.x |
A matrix or a data frame containing variables that should be used in the computation of the distance. Columns of mode Missing values ( If only |
data.y |
A numeric matrix or data frame with the same variables, of the same type, as those in |
rngs |
A vector with the ranges to scale the variables. Its length must be equal to number of variables in rngs["X1"] <- max(data.x[,"X1"], data.y[,"X1"]) - min(data.x[,"X1"], data.y[,"X1"]) . |
KR.corr |
When |
var.weights |
By default ( |
robcb |
By default is ( |
Details
This function computes distances between records when variables of different type (categorical and continuous) have been observed. In order to handle different types of variables, the Gower's dissimilarity coefficient (Gower, 1971) is used. By default (KR.corr=TRUE
) the Kaufman and Rousseeuw (1990) extension of the Gower's dissimilarity coefficient is used.
The final dissimilarity between the ith and jth unit is obtained as a weighted sum of dissimilarities for each variable:
d(i,j) = \frac{\sum_k{\delta_{ijk} d_{ijk} w_k}}{\sum_k{\delta_{ijk} w_k}}
In particular, d_{ijk}
represents the distance between the ith and jth unit computed considering the kth variable, while w_k
is the weight assigned to variable k (by default 1 for all the variables, unless different weights are provided by user with argument var.weights
). Distance depends on the nature of the variable:
-
logical
columns are considered as asymmetric binary variables, for such cased_{ijk}=0
ifx_{ik} = x_{jk} = \code{TRUE}
, 1 otherwise; -
factor
orcharacter
columns are considered as categorical nominal variables andd_{ijk}=0
ifx_{ik}=x_{jk}
, 1 otherwise; -
numeric
columns are considered as interval-scaled variables andd_{ijk}=\frac{\left|x_{ik}-x_{jk}\right|}{R_k}
being
R_k
the range of the kth variable. The range is the one supplied with the argumentrngs
(rngs[k]
) or the one computed on available data (whenrngs=NULL
); -
ordered
columns are considered as categorical ordinal variables and the values are substituted with the corresponding position index,r_{ik}
in the factor levels. WhenKR.corr=FALSE
these position indexes (that are different from the output of the R functionrank
) are transformed in the following mannerz_{ik}=\frac{(r_{ik}-1)}{max\left(r_{ik}\right) - 1}
These new values,
z_{ik}
, are treated as observations of an interval scaled variable.
As far as the weight \delta_{ijk}
is concerned:
-
\delta_{ijk}=0
ifx_{ik} = \code{NA}
orx_{jk} = \code{NA}
; -
\delta_{ijk}=0
if the variable is asymmetric binary andx_{ik}=x_{jk}=0
orx_{ik} = x_{jk} = \code{FALSE}
; -
\delta_{ijk}=1
in all the other cases.
In practice, NAs
and couple of cases with x_{ik}=x_{jk}=\code{FALSE}
do not contribute to distance computation.
Value
A matrix
object with distances between rows of data.x
and those of data.y
.
Author(s)
Marcello D'Orazio mdo.statmatch@gmail.com
References
Gower, J. C. (1971), “A general coefficient of similarity and some of its properties”. Biometrics, 27, 623–637.
Kaufman, L. and Rousseeuw, P.J. (1990), Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York.
See Also
Examples
x1 <- as.logical(rbinom(10,1,0.5))
x2 <- sample(letters, 10, replace=TRUE)
x3 <- rnorm(10)
x4 <- ordered(cut(x3, -4:4, include.lowest=TRUE))
xx <- data.frame(x1, x2, x3, x4, stringsAsFactors = FALSE)
# matrix of distances between observations in xx
dx <- gower.dist(xx)
head(dx)
# matrix of distances between first obs. in xx
# and the remaining ones
gower.dist(data.x=xx[1:6,], data.y=xx[7:10,], var.weights = c(1,2,5,2))