FilterBySilhouetteQuantile {parallelpam}R Documentation

FilterBySilhouetteQuantile

Description

Takes a silhouette, as returned by CalculateSilhouette, the list of medoids and class assignments, as returned by ApplyPam, a quantile and the matrices of values and dissimilarities and constructs the corresponding matrices clearing off the points whose silhoutte is below the lower quantile, except if they are medoids.

Usage

FilterBySilhouetteQuantile(
  s,
  L,
  fallcounts,
  ffilcounts,
  falldissim,
  ffildissim,
  q = 0.2,
  addcom = TRUE
)

Arguments

s

A numeric vector with the sihouette coefficient of each point in a classification, as returned by CalculateSilhouette.

L

A list of two numeric vectors, L$med and L$clasif, obtained normally as the object returned by ApplyPAM.

fallcounts

A string with the name of the binary file containing the matrix of data per point. It can be either a full or a sparse matrix.

ffilcounts

A string with the name of the binary file that will contain the selected points. It will have the same character (full/sparse) and type of the complete file.

falldissim

A string with the name of the binary file containing the dissimilarity matrix of the complete set of points. It must be a symmetric matrix.

ffildissim

A string with the name of the binary file that will contain the dissimilarity matrix for the remaining points. It will be a symmetric matrix of.

q

Quantile to filter. All points whose silhouette is below this quantile will be filtered out. Default: 0.2

addcom

Boolean to indicate if a comment must be appended to the current comment of values and dissimilarity matrices to indicate that they are the result of a filtering process. This comment is automatically generated and contains the value of quantile q. Succesive applications add comments at the end of those already present. Default: TRUE

Details

The renumbering of indices in the returned cluster may seem confusing at first but it was the way of fitting this with the rest of the package. Anyway, notice that if the numeric vectors in the input parameter L were named vectors, the point names are appropriately kept in the result so point identity is preserved. Moreover, if the values and dissimilarity input matrices had row and/or column names, they are preserved in the filtered matrices, too.

Value

Lr["med","clasif"] A list of two numeric vectors.
Lr$med is a modification of the correponding first element of the passed L parameter.
Lr$clasif has as many components as remaining instances.
Since points will have been removed, medoid numbering is modified. Therefore, Lr$med has the NEW index of each medoid in the filtered set.
Lr$clasif contains the number of the medoid (i.e.: the cluster) to which each instance has been assigned, and therefore does not change.
All indexes start at 1 (R convention). Please, see Details section

Examples

# Synthetic problem: 10 random seeds with coordinates in [0..20]
# to which random values in [-0.1..0.1] are added
M<-matrix(0,100,500)
rownames(M)<-paste0("rn",c(1:100))
for (i in (1:10))
{
 p<-20*runif(500)
 Rf <- matrix(0.2*(runif(5000)-0.5),nrow=10)
 for (k in (1:10))
 {
  M[10*(i-1)+k,]=p+Rf[k,]
 }
}
tmpfile1=paste0(tempdir(),"/pamtest.bin")
JWriteBin(M,tmpfile1,dtype="float",dmtype="full")
tmpdisfile1=paste0(tempdir(),"/pamDl2.bin")
CalcAndWriteDissimilarityMatrix(tmpfile1,tmpdisfile1,distype="L2",restype="float",nthreads=0)
L <- ApplyPAM(tmpdisfile1,10,init_method="BUILD")
# Which are the medoids
L$med
sil <- CalculateSilhouette(L$clasif,tmpdisfile1)
tmpfiltfile1=paste0(tempdir(),"/pamtestfilt.bin")
tmpfiltdisfile1=paste0(tempdir(),"/pamDL2filt.bin")
Lf<-FilterBySilhouetteQuantile(sil,L,tmpfile1,tmpfiltfile1,tmpdisfile1,tmpfiltdisfile1,
                               q=0.4,addcom=TRUE)
# The new medoids are the same points but renumbered, since the L$clasif array has less points
Lf$med

[Package parallelpam version 1.4 Index]