purity {NMF} | R Documentation |
Purity and Entropy of a Clustering
Description
The functions purity
and entropy
respectively compute the purity and the entropy of a
clustering given a priori known classes.
The purity and entropy measure the ability of a clustering method, to recover known classes (e.g. one knows the true class labels of each sample), that are applicable even when the number of cluster is different from the number of known classes. Kim et al. (2007) used these measures to evaluate the performance of their alternate least-squares NMF algorithm.
Usage
purity(x, y, ...)
entropy(x, y, ...)
## S4 method for signature 'NMFfitXn,ANY'
purity(x, y, method = "best",
...)
## S4 method for signature 'NMFfitXn,ANY'
entropy(x, y, method = "best",
...)
Arguments
x |
an object that can be interpreted as a factor or
can generate such an object, e.g. via a suitable method
|
y |
a factor or an object coerced into a factor that
gives the true class labels for each sample. It may be
missing if |
... |
extra arguments to allow extension, and usually passed to the next method. |
method |
a character string that specifies how the
value is computed. It may be either |
Details
Suppose we are given l
categories, while the
clustering method generates k
clusters.
The purity of the clustering with respect to the known categories is given by:
Purity = \frac{1}{n}
\sum_{q=1}^k \max_{1 \leq j \leq l} n_q^j
,
where:
-
n
is the total number of samples; -
n_q^j
is the number of samples in clusterq
that belongs to original classj
(1 \leq j \leq l
).
The purity is therefore a real number in [0,1]
. The
larger the purity, the better the clustering performance.
The entropy of the clustering with respect to the known categories is given by:
Entropy = - \frac{1}{n
\log_2 l} \sum_{q=1}^k \sum_{j=1}^l n_q^j \log_2
\frac{n_q^j}{n_q}
,
where:
-
n
is the total number of samples; -
n
is the total number of samples in clusterq
(1 \leq q \leq k
); -
n_q^j
is the number of samples in clusterq
that belongs to original classj
(1 \leq j \leq l
).
The smaller the entropy, the better the clustering performance.
Value
a single numeric value
the entropy (i.e. a single numeric value)
Methods
- entropy
signature(x = "table", y = "missing")
: Computes the purity directly from the contingency tablex
.This is the workhorse method that is eventually called by all other methods.
- entropy
signature(x = "factor", y = "ANY")
: Computes the purity on the contingency table ofx
andy
, that is coerced into a factor if necessary.- entropy
signature(x = "ANY", y = "ANY")
: Default method that should work for results of clustering algorithms, that have a suitablepredict
method that returns the cluster membership vector: the purity is computed betweenx
andpredict{y}
- entropy
signature(x = "NMFfitXn", y = "ANY")
: Computes the best or mean entropy across all NMF fits stored inx
.- purity
signature(x = "table", y = "missing")
: Computes the purity directly from the contingency tablex
- purity
signature(x = "factor", y = "ANY")
: Computes the purity on the contingency table ofx
andy
, that is coerced into a factor if necessary.- purity
signature(x = "ANY", y = "ANY")
: Default method that should work for results of clustering algorithms, that have a suitablepredict
method that returns the cluster membership vector: the purity is computed betweenx
andpredict{y}
- purity
signature(x = "NMFfitXn", y = "ANY")
: Computes the best or mean purity across all NMF fits stored inx
.
References
Kim H and Park H (2007). "Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis." _Bioinformatics (Oxford, England)_, *23*(12), pp. 1495-502. ISSN 1460-2059, <URL: http://dx.doi.org/10.1093/bioinformatics/btm134>, <URL: http://www.ncbi.nlm.nih.gov/pubmed/17483501>.
See Also
Other assess: sparseness
Examples
# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8)
n <- 50; counts <- c(5, 5, 8);
V <- syntheticNMF(n, counts)
cl <- unlist(mapply(rep, 1:3, counts))
# perform default NMF with rank=2
x2 <- nmf(V, 2)
purity(x2, cl)
entropy(x2, cl)
# perform default NMF with rank=2
x3 <- nmf(V, 3)
purity(x3, cl)
entropy(x3, cl)