R: Estimate optimal number of clusters.

mt_cluster_k {mousetrap}

R Documentation

Estimate optimal number of clusters.

Description

Estimates the optimal number of clusters (k) using various methods.

Usage

mt_cluster_k(
  data,
  use = "ln_trajectories",
  dimensions = c("xpos", "ypos"),
  kseq = 2:15,
  compute = c("stability", "gap", "jump", "slope"),
  method = "hclust",
  weights = rep(1, length(dimensions)),
  pointwise = TRUE,
  minkowski_p = 2,
  hclust_method = "ward.D",
  kmeans_nstart = 10,
  n_bootstrap = 10,
  model_based = FALSE,
  n_gap = 10,
  na_rm = FALSE,
  verbose = FALSE
)

Arguments

`data`	a mousetrap data object created using one of the mt_import functions (see mt_example for details). Alternatively, a trajectory array can be provided directly (in this case `use` will be ignored).
`use`	a character string specifying which trajectory data should be used.
`dimensions`	a character vector specifying which trajectory variables should be used. Can be of length 2 or 3, for two-dimensional or three-dimensional trajectories respectively.
`kseq`	a numeric vector specifying set of candidates for k. Defaults to 2:15, implying that all values of k within that range are compared using the metrics specified in `compute`.
`compute`	character vector specifying the to be computed measures. Can be any subset of `c("stability","gap","jump","slope")`.
`method`	character string specifying the type of clustering procedure for the stability-based method. Either `hclust` or `kmeans`.
`weights`	numeric vector specifying the relative importance of the variables specified in `dimensions`. Defaults to a vector of 1s implying equal importance. Technically, each variable is rescaled so that the standard deviation matches the corresponding value in `weights`. To use the original variables, set `weights = NULL`.
`pointwise`	boolean specifying the way in which dissimilarity between the trajectories is measured. If `TRUE` (the default), `mt_distmat` measures the average dissimilarity and then sums the results. If `FALSE`, `mt_distmat` measures dissimilarity once (by treating the various points as independent dimensions). This is only relevant if `method` is "hclust". See mt_distmat for further details.
`minkowski_p`	an integer specifying the distance metric for the cluster solution. `minkowski_p = 1` computes the city-block distance, `minkowski_p = 2` (the default) computes the Euclidian distance, `minkowski_p = 3` the cubic distance, etc. Only relevant if `method` is "hclust". See mt_distmat for further details.
`hclust_method`	character string specifying the linkage criterion used. Passed on to the `method` argument of hclust. Default is set to `ward.D`. Only relevant if `method` is "hclust".
`kmeans_nstart`	integer specifying the number of reruns of the kmeans procedure. Larger numbers minimize the risk of finding local minima. Passed on to the `nstart` argument of kmeans. Only relevant if `method` is "kmeans".
`n_bootstrap`	an integer specifying the number of bootstrap comparisons used by `stability`. See cStability.
`model_based`	boolean specifying whether the model-based or the model-free should be used by `stability`, when method is `kmeans`. See cStability and Haslbeck & Wulff (2020).
`n_gap`	integer specifying the number of simulated datasets used by `gap`. See Tibshirani et al. (2001).
`na_rm`	logical specifying whether trajectory points containing NAs should be removed. Removal is done column-wise. That is, if any trajectory has a missing value at, e.g., the 10th recorded position, the 10th position is removed for all trajectories. This is necessary to compute distance between trajectories.
`verbose`	logical indicating whether function should report its progress.

Details

mt_cluster_k estimates the number of clusters (k) using four commonly used k-selection methods (specified via compute): cluster stability (stability), the gap statistic (gap), the jump statistic (jump), and the slope statistic (slope).

Cluster stability methods select k as the number of clusters for which the assignment of objects to clusters is most stable across bootstrap samples. This function implements the model-based and model-free methods described by Haslbeck & Wulff (2020). See references.

The remaining three methods select k as the value that optimizes the gap statistic (Tibshirani, Walther, & Hastie, 2001), the jump statistic (Sugar & James, 2013), and the slope statistic (Fujita, Takahashi, & Patriota, 2014), respectively.

For clustering trajectories, it is often useful that the endpoints of all trajectories share the same direction, e.g., that all trajectories end in the top-left corner of the coordinate system (mt_remap_symmetric or mt_align can be used to achieve this). Furthermore, it is recommended to use length normalized trajectories (see mt_length_normalize; Wulff et al., 2019).

Value

A list containing two lists that store the results of the different methods. kopt contains the estimated k for each of the methods specified in compute. paths contains the values for each k in kseq as computed by each of the methods specified in compute. The values in kopt are optima for each of the vectors in paths.

Author(s)

Dirk U. Wulff

Jonas M. B. Haslbeck

References

Haslbeck, J. M. B., & Wulff, D. U. (2020). Estimating the Number of Clusters via a Corrected Clustering Instability. Computational Statistics, 35, 1879–1894.

Wulff, D. U., Haslbeck, J. M. B., Kieslich, P. J., Henninger, F., & Schulte-Mecklenbeck, M. (2019). Mouse-tracking: Detecting types in movement trajectories. In M. Schulte-Mecklenbeck, A. Kühberger, & J. G. Johnson (Eds.), A Handbook of Process Tracing Methods (pp. 131-145). New York, NY: Routledge.

Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423.

Sugar, C. A., & James, G. M. (2013). Finding the number of clusters in a dataset. Journal of the American Statistical Association, 98(463), 750-763.

Fujita, A., Takahashi, D. Y., & Patriota, A. G. (2014). A non-parametric method to estimate the number of clusters. Computational Statistics & Data Analysis, 73, 27-39.

Examples


## Not run: 
# Length normalize trajectories
KH2017 <- mt_length_normalize(KH2017)

# Find k
results <- mt_cluster_k(KH2017, use="ln_trajectories")

# Retrieve results
results$kopt
results$paths

## End(Not run)

[Package mousetrap version 3.2.3 Index]