R: Set control values for use with MEDseq

MEDseq_control {MEDseq}

R Documentation

Set control values for use with MEDseq_fit

Description

Supplies a list of arguments (with defaults) for use with MEDseq_fit.

Usage

MEDseq_control(algo = c("EM", "CEM", "cemEM"), 
               init.z = c("kmedoids", "kmodes", "kmodes2", "hc", "random", "list"), 
               z.list = NULL, 
               dist.mat = NULL, 
               unique = TRUE, 
               criterion = c("bic", "icl", "aic", "dbs", "asw", "cv", "nec"), 
               tau0 = NULL, 
               noise.gate = TRUE, 
               random = TRUE,
               do.cv = FALSE, 
               do.nec = FALSE, 
               nfolds = 10L, 
               nstarts = 1L, 
               stopping = c("aitken", "relative"), 
               equalPro = FALSE, 
               equalNoise = FALSE, 
               tol = c(1E-05, 1E-08), 
               itmax = c(.Machine$integer.max, 1000L), 
               opti = c("mode", "medoid", "first", "GA"), 
               ordering = c("none", "decreasing", "increasing"), 
               MaxNWts = 1000L, 
               verbose = TRUE, 
               ...)

Arguments

`algo`	Switch controlling whether models are fit using the `"EM"` (the default) or `"CEM"` algorithm. The option `"cemEM"` allows running the EM algorithm starting from convergence of the CEM algorithm.
`init.z`	The method used to initialise the cluster labels. All options respect the presence of sampling `weights`, if any. Defaults to `"kmedoids"`. Other options include `"kmodes"`, `"kmodes2"`, Ward's hierarchical clustering (`"hc"`, via `hclust`), `"random"` initialisation, and a user-supplied `"list"` (see `z.list` below). For weighted sequences, `"kmedoids"` is itself initialised using Ward's hierarchical clustering. The `"kmodes"` and `"kmodes2"` options both internally call the function `wKModes`, which typically uses random initial modes. Under `"kmodes"`, the algorithm is instead initialised via the medoids of the clusters obtained from a call to `hclust`. The option `"kmodes2"` is slightly faster, by virtue of using the random initial medoids. However, final results are by default still subject to randomness under both options (unless `set.seed` is invoked), as ties for modes and cluster assignments are typically broken at random throughout the algorithm (see the `random` argument below, and in `wKModes` itself).
`z.list`	A user supplied list of initial cluster allocation matrices, with number of rows given by the number of observations, and numbers of columns given by the range of component numbers being considered. Only relevant if `init.z == "z.list"`. These matrices are allowed correspond to both soft or hard clusterings, and will be internally normalised so that the rows sum to 1.
`dist.mat`	An optional distance matrix to use for initialisation when `init.z` is one of `"kmedoids"` or `"hc"`. Defaults to a Hamming distance matrix. This is an experimental feature and should only be tampered with by expert users.
`unique`	A logical indicating whether the model is fit only to the unique observations (defaults to `TRUE`). When there are covariates, this means all unique combinations of covariate and sequence patterns, otherwise only the sequence patterns. When `weights` are not supplied to `MEDseq_fit` and `isTRUE(unique)`, weights are given by the occurrence frequency of the corresponding sequences, and the model is then fit to the unique observations only. When `weights` are supplied and `isTRUE(unique)`, the weights are summed for each set of duplicate observations and assigned to one retained copy of each corresponding unique sequence. Hence, observations with different weights that are otherwise duplicates are treated as duplicates and significant computational gains can be made. In both cases, the results will be unchanged, but setting `unique` to `TRUE` can often be much faster.
`criterion`	When either `G` or `modtype` is a vector, `criterion` governs how the 'best' model is determined when gathering output. Defaults to `"bic"`. Note that all criteria will be returned in any case, if possible.
`tau0`	Prior mixing proportion for the noise component. If supplied, a noise component will be added to the model in the estimation, with `tau0` giving the prior probability of belonging to the noise component for all observations. Typically supplied as a scalar in the interval (0, 1), e.g. `0.1`. Can be supplied as a vector when gating covariates are present and `noise.gate` is `TRUE`.
`noise.gate`	A logical indicating whether gating network covariates influence the mixing proportion for the noise component, if any. Defaults to `TRUE`, but leads to greater parsimony if `FALSE`. Only relevant in the presence of a noise component (i.e. the `"CCN"`, `"UCN"`, `"CUN"`, and `"UUN"` models); only affects estimation in the presence of gating covariates.
`random`	A logical governing how ties for estimated central sequence positions are handled. When `TRUE` (the default), such ties are broken at random. When `FALSE` (the implied default prior to version `1.2.0` of this package), the first candidate state is always chosen. This argument affects all `opti` options. If `verbose` is `TRUE` and there are tie-breaking operations performed, a warning message is printed once per model, regardless of the number of such operations. Note that this argument is also passed to `wKModes` if `init.z` is `"kmodes"` or `"kmodes2"` and that, in certain rare cases when the `"CEM"` `algo` is invoked when `equalPro` is `TRUE` and the precision parameter(s) are somehow constrained across clusters, this argument also governs ties for cluster assignments within `MEDseq_fit` as well.
`do.cv`	A logical indicating whether cross-validated log-likelihood scores should also be computed (see `nfolds`). Defaults to `FALSE` due to significant computational burden incurred.
`do.nec`	A logical indicating whether the normalised entropy criterion (NEC) should also be computed (for models with more than one component). Defaults to `FALSE`. When `TRUE`, models with `G=1` are fitted always.
`nfolds`	The number of folds to use when `isTRUE{do.cv}`.
`nstarts`	The number of random initialisations to use when `init.z="random"`. Defaults to `1`. Results will be based on the random start yielding the highest estimated log-likelihood.
`stopping`	The criterion used to assess convergence of the EM/CEM algorithm. The default (`"aitken"`) uses Aitken's acceleration method, otherwise the `"relative"` change in log-likelihood is monitored (which may be less strict).
`equalPro`	Logical variable indicating whether or not the mixing proportions are to be constrained to be equal in the model. Default: `equalPro = FALSE`. Only relevant when `gating` covariates are not supplied within `MEDseq_fit`, otherwise ignored. In the presence of a noise component, only the mixing proportions for the non-noise components are constrained to be equal (by default, see `equalNoise`), after accounting for the noise component.
`equalNoise`	Logical which is only invoked when `isTRUE(equalPro)` and gating covariates are not supplied. Under the default setting (`FALSE`), the mixing proportion for the noise component is estimated, and remaining mixing proportions are equal; when `TRUE` all components, including the noise component, have equal mixing proportions.
`tol`	A vector of length two giving relative convergence tolerances for 1) the log-likelihood of the EM/CEM algorithm, and 2) optimisation in the multinomial logistic regression in the gating network, respectively. The default is `c(1e-05, 1e-08)`. If only one number is supplied, it is used as the tolerance in both cases.
`itmax`	A vector of length two giving integer limits on the number of iterations for 1) the EM/CEM algorithm, and 2) the multinomial logistic regression in the gating network, respectively. The default is `c(.Machine$integer.max, 1000)`. This allows termination of the EM/CEM algorithm to be completely governed by `tol[1]`. If only one number is supplied, it is used as the iteration limit for the EM/CEM algorithm only and the other element of `itmax` retains its usual default. If, for any model with gating covariates, the multinomial logistic regression in the gating network fails to converge in `itmax[2]` iterations at any stage of the EM/CEM algorithm, an appropriate warning will be printed, prompting the user to modify this argument.
`opti`	Character string indicating how central sequence parameters should be estimated. The default `"mode"` is exact and thus this experimental argument should only be tampered with by expert users. The option `"medoid"` fixes the central sequence(s) to be one of the observed sequences (like k-medoids). The other options `"first"` and `"GA"` use stochastic local search with the first-improvement and genetic algorithms, respectively, to mutate the medoid. Pre-computation of the Hamming distance matrix for the observed sequences speeds-up computation of all options other than `"mode"`.
`ordering`	Experimental feature that should only be tampered with by experienced users. Allows sequences to be reordered on the basis of the column-wise entropy when `opti` is `"first"` or `"GA"`.
`MaxNWts`	The maximum allowable number of weights in the call to `multinom` for the multinomial logistic regression in the gating network. There is no intrinsic limit in the code, but increasing `MaxNWts` will probably allow fits that are very slow and time-consuming. It may be necessary to increase `MaxNWts` when categorical concomitant variables with many levels are included or the number of components is high.
`verbose`	Logical indicating whether to print messages pertaining to progress to the screen during fitting. By default is `TRUE` if the session is interactive, and `FALSE` otherwise. If `FALSE`, warnings and error messages will still be printed to the screen, but everything else will be suppressed.
`...`	Catches unused arguments, and also allows the optional arguments `ztol` and `summ` to be passed to `dbs` (`ztol` and `summ`) as well as the ASW computation (`summ`), and the optional `wKModes` arguments `iter.max`, `freq.weighted`, and `fast` (provided `init.z` is one of `"kmodes"` or `"kmodes2"`). In such cases, the `wKModes` argument `random` is already controlled by `random` above here.

Details

MEDseq_control is provided for assigning values and defaults within MEDseq_fit. While the criterion argument controls the choice of the optimal number of components and MEDseq model type (in terms of the constraints or lack thereof on the precision parameters), MEDseq_compare is provided for choosing between fits with different combinations of covariates or different initialisation settings.

Value

A named list in which the names are the names of the arguments and the values are the values supplied to the arguments.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.

Menardi, G. (2011). Density-based silhouette diagnostics for clustering methods. Statistics and Computing, 21(3): 295-308.

Hoos, H. and T. Stützle (2004). Stochastic Local Search: Foundations and Applications. The Morgan Kaufman Series in Artificial Intelligence. San Francisco, CA, USA: Morgan Kaufman Publishers Inc.

Examples

# The CC MEDseq model is almost equivalent to k-medoids when the
# CEM algorithm is employed, mixing proportions are constrained,
# and the central sequences are restricted to the observed sequences
ctrl  <- MEDseq_control(algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw")

data(mvad)
# Note that ctrl must be explicitly named 'ctrl'
mod   <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight, ctrl=ctrl)

# Alternatively, specify the control arguments directly
mod   <- MEDseq_fit(seqdef(mvad[,17:86]), G=11, modtype="CC", weights=mvad$weight,
                    algo="CEM", equalPro=TRUE, opti="medoid", criterion="asw")

# Note that supplying control arguments via a mix of the ... construct and the named argument 
# 'control' or supplying MEDseq_control output without naming it 'control' can throw an error

[Package MEDseq version 1.4.1 Index]