R: Co-occurrence distance for binary/ categorical variables data

cooccur {kmed}

R Documentation

Co-occurrence distance for binary/ categorical variables data

Description

This function calculates the co-occurrence distance proposed by Ahmad and Dey (2007).

Usage

cooccur(data)

Arguments

data

A matrix or data frame of binary/ categorical variables (see Details).

Details

This function computes co-occurrence distance, which is a binary/ categorical distance, that based on the other variable's distribution (see Examples). In the Examples, we have a data set:

object	x	y	z
1	1	2	2
2	1	2	1
3	2	1	2
4	2	1	2
5	1	1	1
6	2	2	2
7	2	1	2

The co-occurrence distance transforms each category of binary/ categorical in a variable based on the distribution of other variables, for example, the distance between categories 1 and 2 in the x variable can be different to the distance between categories 1 and 2 in the z variable. As an example, the transformed distance between categories 1 and 2 in the z variable is presented.

A cross tabulation between the z and x variables with corresponding (column) proportion is

	1	2	\|\|	1	2
1	2	1	\|\|	1.0	0.2
2	0	4	\|\|	0.0	0.8

A cross tabulation between the z and y variables with corresponding (column) proportion is

	1	2	\|\|	1	2
1	1	3	\|\|	0.5	0.6
2	1	2	\|\|	0.5	0.4

Then, the maximum values of the proportion in each row are taken such that they are 1.0, 0.8, 0.6, and 0.5. The new distance between categories 1 and 2 in the z variable is

\delta_{1,2}^z = \frac{(1.0+0.8+0.6+0.5) - 2}{2} = 0.45

The constant 2 in the formula applies because the z variable depends on the 2 other variable distributions, i.e the x and y variables. The new distances of each category in the for the x and y variables can be calculated in a similar way.

Thus, the distance between objects 1 and 2 is 0.45. It is only the z variable counted to calculate the distance between objects 1 and 2 because objects 1 and 2 have similar values in both the x and y variables.

The data argument can be supplied with either a matrix or data frame, in which the class of the element has to be an integer. If it is not an integer, it will be converted to an integer class. If the data of a variable only, a simple matching is calculated. The co-occurrence is absent due to its dependency to the distribution of other variables and a warning message appears.

Value

Function returns a distance matrix (n x n).

Author(s)

Weksi Budiaji
Contact: budiaji@untirta.ac.id

References

Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.

Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, 226-237.

Examples

set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
cooccur(a)

[Package kmed version 0.4.2 Index]