R: Trimmed Lloyd k-means for 3D shapes

trimmedLloydShapes {Anthropometry}

R Documentation

Trimmed Lloyd k-means for 3D shapes

Description

The basic foundation of k-means is that the sample mean is the value that minimizes the Euclidean distance from each point, to the centroid of the cluster to which it belongs. Two fundamental concepts of the statistical shape analysis are the Procrustes mean and the Procrustes distance. Therefore, by integrating the Procrustes mean and the Procrustes distance we can use k-means in the shape analysis context.

The k-means method has been proposed by several scientists in different forms. In computer science and pattern recognition the k-means algorithm is often termed the Lloyd algorithm (see Lloyd (1982)).

This function is proposed to incorporate a modification to LloydShapes in order to make the k-means algorithm robust. Robustness is a property very desirable in a lot of applications. As it is well known, the results of the k-means algorithm can be influenced by outliers and extreme data, or bridging points between clusters. Garcia-Escudero et al. (1999) propose a way of making k-means more robust, which combines the k-means idea with an impartial trimming procedure: a proportion alpha (between 0 and 1) of observations are trimmed (the trimmed observations are self-determined by the data). See also trimmedoid.

Note that in the generic name of the k-means algorithm, k refers to the number of clusters to search for. To be more specific in the R code, k is referred to as numClust, see next section arguments.

Usage

trimmedLloydShapes(array3D,n,alpha,numClust,algSteps=10,niter=10,
                   stopCr=0.0001,verbose)

Arguments

`array3D`	Array with the 3D landmarks of the sample objects. Each row corresponds to an observation, and each column corresponds to a dimension (x,y,z).
`n`	Number of individuals.
`alpha`	Proportion of trimmed sample.
`numClust`	Number of clusters.
`algSteps`	Number of steps per initialization. Default value is 10.
`niter`	Number of random initializations (iterations). Default value is 10.
`stopCr`	Relative stopping criteria. Default value is 0.0001.
`verbose`	A logical specifying whether to provide descriptive output about the running process.

Value

A list with the following elements:

asig: Optimal clustering.

cases: Anthropometric cases (optimal centers).

vopt: Optimal objective function.

trimmWomen: List to save the trimmed individual of each iteration.

trimmsIter: Vector with the number of iterations where the optimum was reached. The last number different from NA refers to the last iteration where the final optimum was reached.

bestNstep: Nstep of the iteration where the optimum has reached.

initials: Random initial values used in each iteration. These values can be used by HartiganShapes.

discarded: Discarded (trimmed) observations.

Note

We note that adding a trimmed procedure to the Lloyd algorithm is very direct and easy, while for the Hartigan-Wong algorithm, more modifications of the algorithm are needed, which makes the implementation of its trimmed version difficult.

Author(s)

Amelia Simo

References

Vinue, G., Simo, A., and Alemany, S., (2016). The k-means algorithm for 3D shapes with an application to apparel design, Advances in Data Analysis and Classification 10(1), 103–132.

Lloyd, S. P., (1982). Least Squares Quantization in PCM, IEEE Transactions on Information Theory 28, 129–137.

Dryden, I. L., and Mardia, K. V., (1998). Statistical Shape Analysis, Wiley, Chichester.

Garcia-Escudero, L. A., Gordaliza, A., and Matran, C., (2003). Trimming tools in exploratory data analysis, Journal of Computational and Graphical Statistics 12(2), 434–449.

Garcia-Escudero, L. A., and Gordaliza, A., (1999). Robustness properties of k-means and trimmed k-means, Journal of the American Statistical Association 94(447), 956–969.

Examples

#CLUSTERING INDIVIDUALS ACCORDING TO THEIR SHAPE:
landmarksNoNa <- na.exclude(landmarksSampleSpaSurv)
dim(landmarksNoNa) 
#[1] 574 198 
numLandmarks <- (dim(landmarksNoNa)[2]) / 3
#[1] 66
#As a toy example, only the first 10 individuals are used.
landmarksNoNa_First10 <- landmarksNoNa[1:10, ] 
(numIndiv <- dim(landmarksNoNa_First10)[1])
#[1] 10         
    
array3D <- array3Dlandm(numLandmarks, numIndiv, landmarksNoNa_First10)

numClust <- 2 ; alpha <- 0.01 ; algSteps <- 1 ; niter <- 1 ; stopCr <- 0.0001
#For reproducing results, seed for randomness:
#suppressWarnings(RNGversion("3.5.0"))
#set.seed(2013)
res <- trimmedLloydShapes(array3D, numIndiv, alpha, numClust, 
                          algSteps, niter, stopCr, FALSE)

#Optimal partition and prototypes:
clust <- res$asig 
table(clust)
prototypes <- anthrCases(res)

#Trimmed individuals:
trimmed <- trimmOutl(res)

[Package Anthropometry version 1.19 Index]