FilterBySilhouetteQuantile {scellpam} | R Documentation |
FilterBySilhouetteQuantile
Description
Takes a silhouette, as returned by CalculateSilhouette, the list of medoids and class assignments, as returned by ApplyPam,
a quantile and the matrices of counts and dissimilarities and constructs the corresponding matrices clearing off the points (cells) whose silhoutte is
below the lower quantile, except if they are medoids.
Usage
FilterBySilhouetteQuantile(
s,
L,
fallcounts,
ffilcounts,
falldissim,
ffildissim,
q = 0.2,
addcom = TRUE
)
Arguments
s |
A numeric vector with the sihouette coefficient of each point (cell) in a classification, as returned by CalculateSilhouette. |
L |
A list of two numeric vectors, L$med and L$clasif, obtained normally as the object returned by ApplyPAM. |
fallcounts |
A string with the name of the binary file containing the matrix of counts per cell. It can be either a full or a sparse matrix. |
ffilcounts |
A string with the name of the binary file that will contain the selected cells. It will have the same character (full/sparse) and type of the complete file. |
falldissim |
A string with the name of the binary file containing the dissimilarity matrix of the complete set of cells. It must be a symmetric matrix. |
ffildissim |
A string with the name of the binary file that will contain the dissimilarity matrix for the remaining cells. It will be a symmetric matrix. |
q |
Quantile to filter. All points (cells) whose silhouette is below this quantile will be filtered out. Default: 0.2 |
addcom |
Boolean to indicate if a comment must be appended to the current comment of counts and dissimilarity matrices to indicate that they are the result of a filtering process. This comment is automatically generated and contains the value of quantile q. Succesive applications add comments at the end of those already present. Default: TRUE |
Details
The renumbering of indices in the returned cluster may seem confusing at first but it was the way of fitting this with the rest of the package. Anyway, notice that if the numeric vectors in the input parameter L were named vectors, the cells names are appropriately kept in the result so cell identity is preserved. Moreover, if the counts and dissimilarity input matrices had row and/or column names, they are preserved in the filtered matrices, too.
Value
Lr["med","clasif"] A list of two numeric vectors.
Lr$med is a modification of the correponding first element of the passed L parameter.
Lr$clasif has as many components as remaining instances.
Since points (cells) will have been removed, medoid numbering is modified. Therefore, Lr$med has the NEW index of each medoid in the filtered set.
Lr$clasif contains the number of the medoid (i.e.: the cluster) to which each instance has been assigned, and therefore does not change.
All indexes start at 1 (R convention). Please, see Details section
Examples
# Synthetic problem: 10 random seeds with coordinates in [0..20]
# to which random values in [-0.1..0.1] are added
M<-matrix(0,100,500)
rownames(M)<-paste0("rn",c(1:100))
for (i in (1:10))
{
p<-20*runif(500)
Rf <- matrix(0.2*(runif(5000)-0.5),nrow=10)
for (k in (1:10))
{
M[10*(i-1)+k,]=p+Rf[k,]
}
}
tmpfile1=paste0(tempdir(),"/pamtest.bin")
JWriteBin(M,tmpfile1,dtype="float",dmtype="full")
tmpdisfile1=paste0(tempdir(),"/pamDl2.bin")
CalcAndWriteDissimilarityMatrix(tmpfile1,tmpdisfile1,distype="L2",restype="float",nthreads=0)
L <- ApplyPAM(tmpdisfile1,10,init_method="BUILD")
# Which are the medoids
L$med
sil <- CalculateSilhouette(L$clasif,tmpdisfile1)
tmpfiltfile1=paste0(tempdir(),"/pamtestfilt.bin")
tmpfiltdisfile1=paste0(tempdir(),"/pamDL2filt.bin")
Lf<-FilterBySilhouetteQuantile(sil,L,tmpfile1,tmpfiltfile1,tmpdisfile1,tmpfiltdisfile1,
q=0.4,addcom=TRUE)
# The new medoids are the same points but renumbered, since the L$clasif array has less points
Lf$med