R: Distances for mixed variables data set

distmix {kmed}

R Documentation

Distances for mixed variables data set

Description

This function computes a distance matrix for a mixed variable data set applying various methods.

Usage

distmix(data, method = "gower", idnum = NULL, idbin = NULL, idcat = NULL)

Arguments

`data`	A data frame or matrix object.
`method`	A method to calculate the mixed variables distance (see Details).
`idnum`	A vector of column index of the numerical variables.
`idbin`	A vector of column index of the binary variables.
`idcat`	A vector of column index of the categorical variables.

Details

There are six methods available to calculate the mixed variable distance. They are gower, wishart, podani, huang, harikumar, ahmad.

gower

The Gower (1971) distance is the most common distance for a mixed variable data set. Although the Gower distance accommodates missing values, a missing value is not allowed in this function. If there is a missing value, the Gower distance from the daisy function in the cluster package can be applied. The Gower distance between objects i and j is computed by d_{ij} = 1 - s_{ij}, where

s_{ij} = \frac{\sum_{l=1}^p \omega_{ijl} s_{ijl}} {\sum_{l=1}^p \omega_{ijl}}

\omega_{ijl} is a weight in variable l that is usually 1 or 0 (for a missing value). If the variable l is a numerical variable,

s_{ijl} = 1- \frac{|x_{il} - x_{jl}|}{R_l}

s_{ijl} \in {0, 1}, if the variable l is a binary/ categorical variable.

wishart

Wishart (2003) has proposed a different measure compared to Gower (1971) in the numerical variable part. Instead of a range, it applies a variance of the numerical variable in the s_{ijl} such that the distance becomes

d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}

where \delta_{ijl} = s_l when l is a numerical variable and \delta_{ijl} \in {0, 1} when l is a binary/ categorical variable.

podani

Podani (1999) has suggested a different method to compute a distance for a mixed variable data set. The Podani distance is calculated by

d_{ij} = \sqrt{\sum_{l=1}^p \omega_{ijl} \left(\frac{x_{il} - x_{jl}} {\delta_{ijl}}\right)^2}

where \delta_{ijl} = R_l when l is a numerical variable and \delta_{ijl} \in {0, 1} when l is a binary/ categorical variable.

huang

The Huang (1997) distance between objects i and j is computed by

d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \gamma \sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})

where P_n and P_c are the number of numerical and categorical variables, respectively,

\gamma = \frac{\sum_{r=1}^{P_n} s_{r}^2}{P_n}

and \delta_c(x_{is} - x_{js}) is the mismatch/ simple matching distance (see matching) between object i and object j in the variable s.

harikumar

Harikumar-PV (2015) has proposed a distance for a mixed variable data set:

d_{ij} = \sum_{r=1}^{P_n} |x_{ir} - x_{jr}| + \sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js}) + \sum_{t=1}^{p_b} \delta_b (x_{it}, x_{jt})

where P_b is the number of binary variables, \delta_c (x_{is}, x_{js}) is the co-occurrence distance (see cooccur), and \delta_b (x_{it}, x_{jt}) is the Hamming distance.

ahmad

Ahmad and Dey (2007) has computed a distance of a mixed variable set via

d_{ij} = \sum_{r=1}^{P_n} (x_{ir} - x_{jr})^2 + \sum_{s=1}^{P_c} \delta_c (x_{is} - x_{js})

where \delta_c (x_{it}, x_{jt}) are the co-occurrence distance (see cooccur). In the Ahmad and Dey distance, the binary and categorical variables are not separable such that the co-occurrence distance is based on the combined these two classes, i.e. binary and categorical variables. Note that this function applies standard version of Squared Euclidean, i.e without any weight.

At leas two arguments of the idnum, idbin, and idcat have to be provided because this function calculates the mixed distance. If the method is harikumar, the categorical variables have to be at least two variables such that the co-occurrence distance can be computed. It also applies when method = "ahmad". The idbin + idcat has to be more than 1 column. It returns to an Error message otherwise.

Value

Function returns a distance matrix (n x n).

Author(s)

Weksi Budiaji
Contact: budiaji@untirta.ac.id

References

Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.

Gower, J., 1971. A general coefficient of similarity and some of its properties. Biometrics 27, pp. 857-871

Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, pp. 226-237.

Huang, Z., 1997. Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21-34.

Podani, J., 1999. Extending gower's general coefficient of similarity to ordinal characters. Taxon 48, pp. 331-340.

Wishart, D., 2003. K-means clustering with outlier detection, mixed variables and missing values, in: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Munich, March 14-16, 2001, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 216-226.

Examples

set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3)
mixdata <- cbind(iris[1:7,1:3], a, a1)
colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""),
                       paste(c("bin"), 1:3, sep = ""),
                       paste(c("cat"), 1:3, sep = ""))
distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)

[Package kmed version 0.4.2 Index]