Reaper-class {Thresher} | R Documentation |
Class "Reaper"
Description
The Reaper
class implements the second step in the algorithm to
combine outlier detection with cliustering. The first step, implemented
in the Thresher-class, performs principal components analysis an
computes the PC dimension. Features with short loading vectors are
identified as outliers. Remaining features are clustering, based on the
directions of the loading vectors, using mixtures of von Mises-Fisher
distributions.
Usage
Reaper(thresher, useLoadings = FALSE, cutoff = 0.3,
metric = NULL, linkage="ward.D2",
maxSampleGroups = 0, ...)
Arguments
thresher |
A |
useLoadings |
A logical value; should model-based clustering using von Mises-Fisher distributions be performed in the principal component space? |
cutoff |
A real number; what length loading vector should be used to separate outliers from significant contributers. |
metric |
A character string containing the name of a clustering metric
recognized by either |
linkage |
A character string containing the name of a linkage rule
recognized by |
maxSampleGroups |
An integer; the maximum number of sample groups to be indicated by color in plots of the object. |
... |
Additional arguments to be passed to the
|
Details
Using the dimension computed when constructing the
Thresher
object, we computed the lengths of the loading
vectors associated to features in the data set. Features whose length
is less than a specified cutoff
are identified as outliers and
removed. (Based on extensive simulations, the default cutoff is
taken to be 0.3.) We then refit the Thresher model on the remaining
features, which should, in theory, leave the PC dimension, D
,
unchanged. We then rescale the remaining loading vectors to unit
length, so they can be viewed as points on a hypersphere. In order to
cluster points on a hypersphere, we use a model based on a mixture
of von Mises-Fisher distributions. We fit mixtures for every integer
in the range D \le N \le 2D+1
; this range accounts for the
possibility that each axis has both positively and negatively
correlated features. The extra +1
handles the degenerate case when
D=0
. The best fit is determined using the Bayes Information
Criterion (BIC). The final step is to compute a
SignalSet
; see the description of that class for more
details.
Value
The Reaper
function returns an object of the Reaper class.
Objects from the Class
Objects should be defined using the Reaper
constructor. In
the simplest case, you simply pass in a previously computed
Thresher
object.
Slots
useLoadings
:Logical; should model-based clustering be performed in PC space?
keep
:Logical vector: which of the features (columns) should be retained as meaningful signal instead of being removed as outliers?
nGroups
:Object of class
"number or miss"
; the optimal number of groups/clusters found by the algorithm. If all of the fits fail, this is NA.fit
:Object of class
"fit or miss"
; the best mixture model fit. Can be an NA if something goes wrong when trying to fit mixture models.allfits
:Object of class
"list"
; a list, each of whose entries should be the results of fitting a mixture model with a different number of components.bic
:Object of class
"number or miss"
; the optimal valus of the Bayes Information Criterion; can be NA if all attempts to fit models fail.metric
:A character string; the preferred distance metric for hierarchical clustering. If not specified by the user, then this is computed using the
bestMetric
function.signalSet
:Object of class
SignalSet
maxSampleGroups
:An integer; the maximum number of sample groups to be distinguished by color in plots of the object.
Extends
Class "Thresher"
, directly.
Methods
- makeFigures
signature(object = "Reaper")
: This is a convenience function to produce a standard set of figures. In addition tot he plots preodcued forThresher
object, this function also produces heatmaps where sample clustering depends on either the continuous or binary signal sets. If theDIR
argument is non-null, it is treated as the name of an existing directory where the figures are stored as PNG files. Otherwise, the figures are displayed interactively, one at a time, in a window on screen.- getColors
signature(object = "Reaper")
: Returns the vector of colors assigned to the clustered columns in the data set.- getSplit
signature(object = "Reaper")
: Returns the vector of colors assigned to the clustered rows in the data set.
Author(s)
Kevin R. Coombes <krc@silicovore.com>, Min Wang.
References
Wang M, Abrams ZB, Kornblau SM, Coombes KR. Thresher: determining the number of clusters while removing outliers. BMC Bioinformatics, 2018; 19(1):1-9. doi://10.1186/s12859-017-1998-9.
Wang M, Kornblau SM, Coombes KR. Decomposing the Apoptosis Pathway Into Biologically Interpretable Principal Components. bioRxiv, 2017. doi://10.1101/237883.
Banerjee A, Dhillon IS, Ghosh J, Sra S. Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 2005; 6:1345–1382.
Kurt Hornik and Bettina Gr\"un. movMF: An R Package for Fitting Mixtures of von Mises-Fisher Distributions. Journal of Statistical Software, 2014; 58(10):1–31.
See Also
Examples
# Simulate a data set with some structure
set.seed(250264)
sigma1 <- matrix(0, ncol=16, nrow=16)
sigma1[1:7, 1:7] <- 0.7
sigma1[8:14, 8:14] <- 0.3
diag(sigma1) <- 1
st <- SimThresher(sigma1, nSample=300)
# Threshing is completed; now we can reap
reap <- Reaper(st)
screeplot(reap, col='pink', lcol='red')
scatter(reap)
plot(reap)
heat(reap)