mt_cluster_k {mousetrap} | R Documentation |
Estimate optimal number of clusters.
Description
Estimates the optimal number of clusters (k
) using various methods.
Usage
mt_cluster_k(
data,
use = "ln_trajectories",
dimensions = c("xpos", "ypos"),
kseq = 2:15,
compute = c("stability", "gap", "jump", "slope"),
method = "hclust",
weights = rep(1, length(dimensions)),
pointwise = TRUE,
minkowski_p = 2,
hclust_method = "ward.D",
kmeans_nstart = 10,
n_bootstrap = 10,
model_based = FALSE,
n_gap = 10,
na_rm = FALSE,
verbose = FALSE
)
Arguments
data |
a mousetrap data object created using one of the mt_import
functions (see mt_example for details). Alternatively, a trajectory
array can be provided directly (in this case |
use |
a character string specifying which trajectory data should be used. |
dimensions |
a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively. |
kseq |
a numeric vector specifying set of candidates for k. Defaults to
2:15, implying that all values of k within that range are compared using
the metrics specified in |
compute |
character vector specifying the to be computed measures. Can
be any subset of |
method |
character string specifying the type of clustering procedure
for the stability-based method. Either |
weights |
numeric vector specifying the relative importance of the
variables specified in |
pointwise |
boolean specifying the way in which dissimilarity between
the trajectories is measured. If |
minkowski_p |
an integer specifying the distance metric for the cluster
solution. |
hclust_method |
character string specifying the linkage criterion used.
Passed on to the |
kmeans_nstart |
integer specifying the number of reruns of the kmeans
procedure. Larger numbers minimize the risk of finding local minima. Passed
on to the |
n_bootstrap |
an integer specifying the number of bootstrap comparisons
used by |
model_based |
boolean specifying whether the model-based or the
model-free should be used by |
n_gap |
integer specifying the number of simulated datasets used by
|
na_rm |
logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories. |
verbose |
logical indicating whether function should report its progress. |
Details
mt_cluster_k
estimates the number of clusters (k
) using four
commonly used k-selection methods (specified via compute
): cluster
stability (stability
), the gap statistic (gap
), the jump
statistic (jump
), and the slope statistic (slope
).
Cluster stability methods select k
as the number of clusters for which
the assignment of objects to clusters is most stable across bootstrap
samples. This function implements the model-based and model-free methods
described by Haslbeck & Wulff (2020). See references.
The remaining three methods select k
as the value that optimizes the
gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic
(Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, &
Patriota, 2014), respectively.
For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use length normalized trajectories (see mt_length_normalize; Wulff et al., 2019).
Value
A list containing two lists that store the results of the different
methods. kopt
contains the estimated k
for each of the
methods specified in compute
. paths
contains the values for
each k
in kseq
as computed by each of the methods specified
in compute
. The values in kopt
are optima for each of the
vectors in paths
.
Author(s)
Dirk U. Wulff
Jonas M. B. Haslbeck
References
Haslbeck, J. M. B., & Wulff, D. U. (2020). Estimating the Number of Clusters via a Corrected Clustering Instability. Computational Statistics, 35, 1879–1894.
Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.
Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.
Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.
Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.
See Also
mt_distmat for more information about how the distance matrix is computed when the hclust method is used.
mt_cluster for performing trajectory clustering with a specified number of clusters.
Examples
## Not run:
# Length normalize trajectories
KH2017 <- mt_length_normalize(KH2017)
# Find k
results <- mt_cluster_k(KH2017, use="ln_trajectories")
# Retrieve results
results$kopt
results$paths
## End(Not run)