ApplyPAM {scellpam}R Documentation

ApplyPAM

Description

A function to implement the Partitioning-around-medoids algorithm described in
Schubert, E. and Rousseeuw, P.J.: "Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms."
Information Systems, vol. 101, p. 101804, 2021.
doi: https://doi.org/10.1016/j.is.2021.101804
Notice that the actual values of the vectors (instances) are not needed. To recover them, look at the data matrix used to generate the distance matrix.
The number of instances, N, is not passed since dissimilarity matrix is NxN and therefore its size indicates the N value.

Usage

ApplyPAM(
  dissim_file,
  k,
  init_method = "BUILD",
  initial_med = NULL,
  max_iter = 1000L,
  nthreads = 0L
)

Arguments

dissim_file

A string with the name of the binary file that contains the symmetric matrix of dissimilarities. Such matrix should have been generated by CalcAndWriteDissimilarityMatrix and it must be a symmetric matrix.

k

A possitive integer (the desired number of medoids).

init_method

One of the strings 'PREV', 'BUILD' or 'LAB'. See meaning of initialization algorithms BUILD and LAB in the original paper.
'PREV' should be used exclusively to start the second part of the algorithm (optimization) from a initial set of medoids generated by a former call.
Default: BUILD.

initial_med

A vector with initial medoids to start optimization. It is to be used only by the 'PREV' method and it will have been obtained as the first element (L$med) of the two-element list returned by a previous call to this function used in just-initialize mode (max_iter=0).
Default: empty vector.

max_iter

The maximum number of allowed iterations. 0 means stop immediately after finding initial medoids.
Default: 1000

nthreads

The number of used threads.
-1 means don't use threads (serial implementation).
0 means let the program choose according to the number of cores and of points.
Any other number forces this number of threads. Choosing more than the number of available cores is allowed, but discouraged.
Default: 0

Details

With respect to the returned value, L$med has as many components
as requested medoids and L$clasif has as many components as instances.
Medoids are expressed in L$med by its number in the array of points (row in the dissimilarity matrix) starting at 1 (R convention).
L$clasif contains the number of the medoid (i.e.: the cluster) to which each instance has been assigned, according to their order in
L$med (also from 1).
This means that if L$clasif[p] is m, the point p belongs to the
class grouped around medoid L$med[m].
Moreover, if the dissimilarity matrix contains as metadata
(row names) the cell names, the returned vector is a R-named vector with such names.

Value

L["med","clasif"] A list of two numeric vectors. See section Details for more information

Examples

# Synthetic problem: 10 random seeds with coordinates in [0..20]
# to which random values in [-0.1..0.1] are added
M<-matrix(0,100,500)
rownames(M)<-paste0("rn",c(1:100))
for (i in (1:10))
{
 p<-20*runif(500)
 Rf <- matrix(0.2*(runif(5000)-0.5),nrow=10)
 for (k in (1:10))
 {
  M[10*(i-1)+k,]=p+Rf[k,]
 }
}
tmpfile1=paste0(tempdir(),"/pamtest.bin")
JWriteBin(M,tmpfile1,dtype="float",dmtype="full")
tmpdisfile1=paste0(tempdir(),"/pamDL2.bin")
CalcAndWriteDissimilarityMatrix(tmpfile1,tmpdisfile1,distype="L2",restype="float",nthreads=0)
L <- ApplyPAM(tmpdisfile1,10,init_method="BUILD")
# Final value of sum of distances to closest medoid
GetTD(L,tmpdisfile1)
# Medoids:
L$med
# Medoid in which each individual has been classified
n<-names(L$med)
n[L$clasif]

[Package scellpam version 1.4.5 Index]