distmix {kmed} | R Documentation |
Distances for mixed variables data set
Description
This function computes a distance matrix for a mixed variable data set applying various methods.
Usage
distmix(data, method = "gower", idnum = NULL, idbin = NULL, idcat = NULL)
Arguments
data |
A data frame or matrix object. |
method |
A method to calculate the mixed variables distance (see Details). |
idnum |
A vector of column index of the numerical variables. |
idbin |
A vector of column index of the binary variables. |
idcat |
A vector of column index of the categorical variables. |
Details
There are six methods available to calculate the mixed variable
distance. They are gower
, wishart
, podani
,
huang
, harikumar
, ahmad
.
gower
The Gower (1971) distance is the most common distance for a mixed variable
data set. Although the Gower distance accommodates missing values, a missing
value is not allowed in this function. If there is a missing value, the Gower
distance from the daisy
function in the cluster package can be
applied. The Gower distance between objects i and j is
computed by
d_{ij} = 1 - s_{ij}
, where
s_{ij} = \frac{\sum_{l=1}^p \omega_{ijl} s_{ijl}}
{\sum_{l=1}^p \omega_{ijl}}
\omega_{ijl}
is a weight in variable l that is usually 1 or 0
(for a missing value). If the variable l is a numerical variable,
s_{ijl} = 1- \frac{|x_{il} - x_{jl}|}{R_l}
s_{ijl} \in
{0, 1}, if the variable l is a binary/
categorical variable.
wishart
Wishart (2003) has proposed a different measure compared to Gower (1971) in
the numerical variable part. Instead of a range, it applies a variance of
the numerical variable in the s_{ijl}
such that the distance becomes
d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}}
{\delta_{ijl}}\right)^2}
where \delta_{ijl} = s_l
when l is a numerical variable and
\delta_{ijl} \in
{0, 1} when l is a binary/ categorical
variable.
podani
Podani (1999) has suggested a different method to compute a distance for a mixed variable data set. The Podani distance is calculated by
d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}}
{\delta_{ijl}}\right)^2}
where \delta_{ijl} = R_l
when l is a numerical variable and
\delta_{ijl} \in
{0, 1} when l is a binary/ categorical
variable.
huang
The Huang (1997) distance between objects i and j is computed by
d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \gamma
\sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})
where P_n
and P_c
are the number of numerical and categorical
variables, respectively,
\gamma = \frac{\sum_{r=1}^{P_n} s_{r}^2}{P_n}
and \delta_c(x_{is} - x_{js})
is the mismatch/ simple matching distance
(see matching
) between object i and object
j in the variable s.
harikumar
Harikumar-PV (2015) has proposed a distance for a mixed variable data set:
d_{ij} = \sum_{r=1}^{P_n} |x_{ir} - x_{jr}| + \sum_{s=1}^{P_c}
\delta_c (x_{is} - x_{js}) + \sum_{t=1}^{p_b} \delta_b (x_{it}, x_{jt})
where P_b
is the number of binary variables,
\delta_c (x_{is}, x_{js})
is the co-occurrence distance (see
cooccur
), and \delta_b (x_{it}, x_{jt})
is the
Hamming distance.
ahmad
Ahmad and Dey (2007) has computed a distance of a mixed variable set via
d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 +
\sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})
where \delta_c (x_{it}, x_{jt})
are the co-occurrence distance
(see cooccur
). In the Ahmad and Dey distance,
the binary and categorical variables are not separable such that
the co-occurrence distance is based on the combined these two classes,
i.e. binary and categorical variables. Note that this function applies
standard version of Squared Euclidean, i.e without any weight.
At leas two arguments of the idnum
, idbin
, and
idcat
have to be provided because this function calculates
the mixed distance. If the method
is harikumar
,
the categorical variables have to be at least two variables such
that the co-occurrence distance can be computed. It also applies when
method = "ahmad"
. The idbin
+ idcat
has to
be more than 1 column. It returns to an Error
message otherwise.
Value
Function returns a distance matrix (n x n).
Author(s)
Weksi Budiaji
Contact: budiaji@untirta.ac.id
References
Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.
Gower, J., 1971. A general coefficient of similarity and some of its properties. Biometrics 27, pp. 857-871
Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, pp. 226-237.
Huang, Z., 1997. Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21-34.
Podani, J., 1999. Extending gower's general coefficient of similarity to ordinal characters. Taxon 48, pp. 331-340.
Wishart, D., 2003. K-means clustering with outlier detection, mixed variables and missing values, in: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Munich, March 14-16, 2001, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 216-226.
Examples
set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""),
paste(c("bin"), 1:3, sep = ""),
paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)