R: K-means Clustering

Cluster {s2dv}

R Documentation

K-means Clustering

Description

Compute cluster centers and their time series of occurrences, with the K-means clustering method using Euclidean distance, of an array of input data with any number of dimensions that at least contain time_dim. Specifically, it partitions the array along time axis in K groups or clusters in which each space vector/array belongs to (i.e., is a member of) the cluster with the nearest center or centroid. This function is a wrapper of kmeans() and relies on the NbClust package (Charrad et al., 2014 JSS) to determine the optimal number of clusters used for K-means clustering if it is not provided by users.

Usage

Cluster(
  data,
  weights = NULL,
  time_dim = "sdate",
  space_dim = NULL,
  nclusters = NULL,
  index = "sdindex",
  ncores = NULL
)

Arguments

`data`	A numeric array with named dimensions that at least have 'time_dim' corresponding to time and 'space_dim' (optional) corresponding to either area-averages over a series of domains or the grid points for any sptial grid structure.
`weights`	A numeric array with named dimension of multiplicative weights based on the areas covering each domain/region or grid-cell of 'data'. The dimensions must be equal to the 'space_dim' in 'data'. The default value is NULL which means no weighting is applied.
`time_dim`	A character string indicating the name of time dimension in 'data'. The default value is 'sdate'.
`space_dim`	A character vector indicating the names of spatial dimensions in 'data'. The default value is NULL.
`nclusters`	A positive integer K that must be bigger than 1 indicating the number of clusters to be computed, or K initial cluster centers to be used in the method. The default value is NULL, which means that the number of clusters will be determined by NbClust(). The parameter 'index' therefore needs to be specified for NbClust() to find the optimal number of clusters to be used for K-means clustering calculation.
`index`	A character string of the validity index from NbClust package that can be used to determine optimal K if K is not specified with 'nclusters'. The default value is 'sdindex' (Halkidi et al. 2001, JIIS). Other indices available in NBClust are "kl", "ch", "hartigan", "ccc", "scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db", "silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball", "ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", "hubert", "sdindex", and "sdbw". One can also use all of them with the option 'alllong' or almost all indices except gap, gamma, gplus and tau with 'all', when the optimal number of clusters K is detremined by the majority rule (the maximum of histogram of the results of all indices with finite solutions). Use of some indices on a big and/or unstructured dataset can be computationally intense and/or could lead to numerical singularity.
`ncores`	An integer indicating the number of cores to use for parallel computation. The default value is NULL.

Value

A list containing:

`$cluster`	An integer array of the occurrence of a cluster along time, i.e., when certain data member in time is allocated to a specific cluster. The dimensions are same as 'data' without 'space_dim'.
`$centers`	A numeric array of cluster centres or centroids (e.g. [1:K, 1:spatial degrees of freedom]). The rest dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$totss`	A numeric array of the total sum of squares. The dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$withinss`	A numeric array of within-cluster sum of squares, one component per cluster. The first dimenion is the number of cluster, and the rest dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$tot.withinss`	A numeric array of the total within-cluster sum of squares, i.e., sum(withinss). The dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$betweenss`	A numeric array of the between-cluster sum of squares, i.e. totss-tot.withinss. The dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$size`	A numeric array of the number of points in each cluster. The first dimenion is the number of cluster, and the rest dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$iter`	A numeric array of the number of (outer) iterations. The dimensions are same as 'data' except 'time_dim' and 'space_dim'.
`$ifault`	A numeric array of an indicator of a possible algorithm problem. The dimensions are same as 'data' except 'time_dim' and 'space_dim'.

References

Wilks, 2011, Statistical Methods in the Atmospheric Sciences, 3rd ed., Elsevire, pp 676.

Examples

# Generating synthetic data
a1 <- array(dim = c(200, 4))
mean1 <- 0
sd1 <- 0.3 

c0 <- seq(1, 200)
c1 <- sort(sample(x = 1:200, size = sample(x = 50:150, size = 1), replace = FALSE))
x1 <- c(1, 1, 1, 1)
for (i1 in c1) {
 a1[i1, ] <- x1 + rnorm(4, mean = mean1, sd = sd1)
}

c1p5 <- c0[!(c0 %in% c1)]
c2 <- c1p5[seq(1, length(c1p5), 2)] 
x2 <- c(2, 2, 4, 4)
for (i2 in c2) {
 a1[i2, ] <- x2 + rnorm(4, mean = mean1, sd = sd1)
}

c3 <- c1p5[seq(2, length(c1p5), 2)]
x3 <- c(3, 3, 1, 1)
for (i3 in c3) {
 a1[i3, ] <- x3 + rnorm(4, mean = mean1, sd = sd1)
}

# Computing the clusters
names(dim(a1)) <- c('sdate', 'space')
res1 <- Cluster(data = a1, weights = array(1, dim = dim(a1)[2]), nclusters = 3)
res2 <- Cluster(data = a1, weights = array(1, dim = dim(a1)[2]))

[Package s2dv version 2.0.0 Index]