cooccur {kmed} | R Documentation |
Co-occurrence distance for binary/ categorical variables data
Description
This function calculates the co-occurrence distance proposed by Ahmad and Dey (2007).
Usage
cooccur(data)
Arguments
data |
A matrix or data frame of binary/ categorical variables (see Details). |
Details
This function computes co-occurrence distance, which is a binary/ categorical distance, that based on the other variable's distribution (see Examples). In the Examples, we have a data set:
object | x | y | z |
1 | 1 | 2 | 2 |
2 | 1 | 2 | 1 |
3 | 2 | 1 | 2 |
4 | 2 | 1 | 2 |
5 | 1 | 1 | 1 |
6 | 2 | 2 | 2 |
7 | 2 | 1 | 2 |
The co-occurrence distance transforms each category of binary/ categorical in a variable based on the distribution of other variables, for example, the distance between categories 1 and 2 in the x variable can be different to the distance between categories 1 and 2 in the z variable. As an example, the transformed distance between categories 1 and 2 in the z variable is presented.
A cross tabulation between the z and x variables with corresponding (column) proportion is
1 | 2 | || | 1 | 2 | |
1 | 2 | 1 | || | 1.0 | 0.2 |
2 | 0 | 4 | || | 0.0 | 0.8 |
A cross tabulation between the z and y variables with corresponding (column) proportion is
1 | 2 | || | 1 | 2 | |
1 | 1 | 3 | || | 0.5 | 0.6 |
2 | 1 | 2 | || | 0.5 | 0.4 |
Then, the maximum values of the proportion in each row are taken such that they are 1.0, 0.8, 0.6, and 0.5. The new distance between categories 1 and 2 in the z variable is
\delta_{1,2}^z = \frac{(1.0+0.8+0.6+0.5) - 2}{2} = 0.45
The constant 2
in the formula applies because the z variable
depends on the 2 other variable distributions, i.e the x and y
variables. The new distances of each category in the
for the x and y variables can be calculated in a similar way.
Thus, the distance between objects 1 and 2 is 0.45. It is only the z variable counted to calculate the distance between objects 1 and 2 because objects 1 and 2 have similar values in both the x and y variables.
The data
argument can be supplied with either a matrix or data frame,
in which the class of the element has to be an integer. If it is not
an integer, it will be converted to an integer class. If the data
of a variable only, a simple matching is calculated. The co-occurrence
is absent due to its dependency to the distribution of other variables
and a warning
message appears.
Value
Function returns a distance matrix (n x n).
Author(s)
Weksi Budiaji
Contact: budiaji@untirta.ac.id
References
Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.
Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, 226-237.
Examples
set.seed(1)
a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3)
cooccur(a)