R: Normalized Compression Distance

NC.dist {shipunov}

R Documentation

Normalized Compression Distance

Description

Calculates the normalized compression distance

Usage

NC.dist(data, method="gzip", character=TRUE)

Arguments

`data`	Matrix (or data frame) with variables that should be used in the computation of the distance between rows.
`method`	Taken from memCompress(): either "gzip", or "bzip2", or "xz"; the last is very slow
`character`	Convert to character mode (default), or use as raw?

Details

NC.dist() computes the distance based on the sizes of the compressed vectors. It is calculated as

dissimilarity(x, y) = B(x, y) - max(B(x), B(y)) / min(B(x), B(y))

where B(x) and B(y) are the bytesizes of the compressed 'x' and 'y', and B(x, y) is the comressed bytesize of concatenated 'x' and 'y'. The algorithm uses basic memCompress() function.

If argument is the data frame, NC.dist() internally converts it into the matrix. All columns by default will be converted into character mode (and if 'character=FALSE', into raw). This default behavior allows NC.dist() to be the universal distance which also does not mind NAs and zeroes.

Value

Distance object with distances among rows of 'data'

Author(s)

Alexey Shipunov

References

Cilibrasi, R., & Vitanyi, P. M. (2005). Clustering by compression. Information Theory, IEEE Transactions on, 51(4), 1523-1545.

Examples


## converts variables into character, universal method
iris.nc <- NC.dist(iris[, -5])
iris.hnc <- hclust(iris.nc, method="ward.D2")
## amazingly, it works even for vectors with length=4 (iris data rows)
plot(prcomp(iris[, -5])$x, col=cutree(iris.hnc, 3))

## using variables as raw, it is good when they are uniform
iris.nc2 <- NC.dist(iris[, -5], character=FALSE)
iris.hnc2 <- hclust(iris.nc2, method="ward.D2")
plot(prcomp(iris[, -5])$x, col=cutree(iris.hnc2, 3))

## bzip2 uses Burrows-Wheeler transform
NC.dist(matrix(runif(100), ncol=10), method="bzip2")

[Package shipunov version 1.17.1 Index]