R: MEDseq: Mixtures of Exponential-Distance Models with...

MEDseq_fit {MEDseq}

R Documentation

MEDseq: Mixtures of Exponential-Distance Models with Covariates

Description

Fits MEDseq models: mixtures of Exponential-Distance models with gating covariates and sampling weights. Typically used for clustering categorical/longitudinal life-course sequences. Additional arguments are available via the function MEDseq_control.

Usage

MEDseq_fit(seqs, 
           G = 1L:9L, 
           modtype = c("CC", "UC", "CU", "UU", 
                       "CCN", "UCN", "CUN", "UUN"), 
           gating = NULL, 
           weights = NULL, 
           ctrl = MEDseq_control(...), 
           covars = NULL, 
           ...)

## S3 method for class 'MEDseq'
summary(object,
        classification = TRUE,
        parameters = FALSE,
        network = FALSE,
        SPS = FALSE,
        ...)

## S3 method for class 'MEDseq'
print(x,
      digits = 3L,
      ...)

Arguments

`seqs`	A state-sequence object of class `"stslist"` as created by the `seqdef` function in the TraMineR package (which is reexported by MEDseq for convenience). Note that the data set must have equal sequence lengths, the intervals are assumed to be evenly spaced, and missingness is not allowed.
`G`	A positive integer vector specifying the numbers of mixture components (clusters) to fit. Defaults to `G=1:9`.
`modtype`	A vector of character strings indicating the type of MEDseq models to be fitted, in terms of the constraints or lack thereof on the precision parameters. By default, all valid model types are fitted (except some only where `G > 1` or `G > 2`, see `Note`). The models are named `"CC"`, `"CU"`, `"UC"`, `"UU"`, `"CCN"`, `"CUN"`, `"UCN"`, and `"UUN"`. The first letter denotes whether the precision parameters are constrained/unconstrained across clusters. The second letter denotes whether the precision parameters are constrained/unconstrained across sequence positions (i.e. time points). The third letter denotes whether one of the components is constrained to have zero-precision/infinite variance. Such a noise component assumes sequences in that cluster follow a uniform distribution.
`gating`	A `formula` for determining the model matrix for the multinomial logistic regression in the gating network when fixed covariates enter the mixing proportions. Defaults to `~1`, i.e. no covariates. This will be ignored where `G=1`. Continuous, categorical, and/or ordinal covariates are allowed. Logical covariates will be coerced to factors. Interactions, transformations, and higher order terms are permitted: the latter must be specified explicitly using the `AsIs` operator (`I`). The specification of the LHS of the formula is ignored. Intercept terms are included by default.
`weights`	Optional numeric vector containing observation-specific sampling weights, which are accounted for in the model fitting and other functions where applicable. `weights` are always internally normalised to sum to the sample size. See the `unique` argument to `MEDseq_control` to see how incorporating weights also yields computational benefits. Note that `weights` must always be explicitly supplied here; it is not enough to use weights when constructing the state sequence object via `seqdef` (reexported by MEDseq for convenience). If you are using a weighted `"stslist"` state sequence object and do not specify `weights`, you will be prompted to explicitly specify `weights=attr(seqs, "weights")` for a weighted model or `weights=NULL` for an unweighted model.
`ctrl`	A list of control parameters for the EM/CEM and other aspects of the algorithm. The defaults are set by a call to `MEDseq_control`.
`covars`	An optional data frame (or a matrix with named columns) in which to look for the covariates in the `gating` network formula, if any. If not found in `covars`, any supplied `gating` covariates are taken from the environment from which `MEDseq_fit` is called. Try to ensure the names of variables in `covars` do not match any of those in `seqs`.
`...`	Catches unused arguments (see `MEDseq_control`).
`x`, `object`, `digits`, `classification`, `parameters`, `network`, `SPS`	Arguments required for the `print` and `summary` functions: `x` and `object` are objects of class `"MEDseq"` resulting from a call to `MEDseq_fit`, while `digits` gives the number of decimal places to round to for printing purposes (defaults to `3`). `classification`, `parameters`, and `network` are logicals which govern whether a table of the MAP classification of observations, the mixture component parameters, and the gating network coefficients are printed, respectively. `SPS` governs the printing of the relevant quantities in `"summaryMEDseq"` objects when any of `classification`, `parameters`, &/or `network` are `TRUE` (see `MEDseq_clustnames` and `seqformat`).

Details

The function effectively allows 8 different MEDseq precision parameter settings for models with or without gating network covariates. By constraining the mixing proportions to be equal (see equalPro in MEDseq_control) an extra special case is facilitated in the latter case.

While model selection in terms of choosing the optimal number of components and the MEDseq model type is performed within MEDseq_fit, using one of the criterion options within MEDseq_control, choosing between multiple fits with different combinations of covariates or different initialisation settings can be done by supplying objects of class "MEDseq" to MEDseq_compare.

Value

A list (of class "MEDseq") with the following named entries (of which some may be missing, depending on the criterion employed), mostly corresponding to the chosen optimal model (as determined by the criterion within MEDseq_control):

`call`	The matched call.
`data`	The input data, `seqs`.
`modtype`	A character string denoting the MEDseq model type at which the optimal `criterion` occurs.
`G`	The optimal number of mixture components according to `criterion`.
`params`	A list with the following named components: `theta` A matrix with `G` rows and T columns, where T is the number of sequence positions, giving the central sequences of each cluster. The mean of the noise component is not reported, as it does not contribute in any way to the likelihood. A dedicated `print` function is provided. `lambda` A matrix of precision parameters. Will contain `1` row if the 1st letter of `modtype` is "C" and `G` columns otherwise. Will contain `1` column if the 2nd letter of `modtype` is "C" and T columns otherwise, where T is the number of sequence positions. Precision parameter values of zero are reported for the noise component, if any. Note that values of `Inf` are also possible, corresponding to zero-variance, which is most likely under the `"UU"` or `"UUN"` models. A dedicated `print` function is provided. `tau` The mixing proportions: either a vector of length `G` or, if `gating` covariates were supplied, a matrix with an entry for each observation (rows) and component (columns).
`gating`	An object of class `"MEDgating"` (for which dedicated `print`, `summary`, and `predict` methods exist) and either `"multinom"` or `"glm"` (only for single-component models) giving the `multinom` regression coefficients of the `gating` network. If `gating` covariates were NOT supplied (or the best model has just one component), this corresponds to a RHS of `~1`, otherwise the supplied `gating` formula. As such, a fitted `gating` network is always returned even in the absence of supplied covariates or clusters. If there is a noise component (and the option `noise.gate=TRUE` is invoked), its coefficients are those for the last component. Users are cautioned against making inferences about statistical significance from summaries of the coefficients in the gating network. Users are instead advised to use the function `MEDseq_stderr`.
`z`	The final responsibility matrix whose `[i,k]`-th entry is the probability that observation i belongs to the k-th component. If there is a noise component, its values are found in the last column.
`MAP`	The vector of cluster labels for the chosen model corresponding to `z`, i.e. `max.col(z)`. Observations belonging to the noise component, if any, will belong to component `0`.
`BIC`	A matrix of all BIC values with `length{G}` rows and `length(modtype)` columns. See `Note`.
`ICL`	A matrix of all ICL values with `length{G}` rows and `length(modtype)` columns. See `Note`.
`AIC`	A matrix of all AIC values with `length{G}` rows and `length(modtype)` columns. See `Note`.
`DBS`	A matrix of all (weighted) mean/median DBS values with `length{G}` rows and `length(modtype)` columns. See `Note` and `dbs`.
`DBSvals`	A list of lists giving the observation-specific DBS values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.
`dbs`	The (weighted) mean/median DBS value corresponding to the optimal model. May not necessarily be the optimal DBS.
`dbsvals`	Observation-specific DBS values corresponding to the optimum model, which may not be optimal in terms of DBS.
`ASW`	A matrix of all (weighted) mean/median ASW values with `length{G}` rows and `length(modtype)` columns. See `Note`.
`ASWvals`	A list of lists giving the observation-specific ASW values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.
`asw`	The (weighted) mean/median ASW value corresponding to the optimal model. May not necessarily be the optimal ASW.
`aswvals`	Observation-specific ASW values corresponding to the optimum model, which may not be optimal in terms of ASW.
`LOGLIK`	A matrix of all maximal log-likelihood values with `length{G}` rows and `length(modtype)` columns. See `Note`.
`DF`	A matrix giving the numbers of estimated parameters (i.e. the number of 'used' degrees of freedom) for all visited models, with `length{G}` rows and `length(modtype)` columns. Subtract these numbers from the sample size to get the degrees of freedom. See `Note`.
`ITERS`	A matrix giving the total number of EM/CEM iterations for all visited models, with `length{G}` rows and `length(modtype)` columns. See `Note`.
`CV`	A matrix of all cross-validated log-likelihood values with `length{G}` rows and `length(modtype)` columns, if available. See `Note` and the arguments `do.cv` and `nfolds` to `MEDseq_control`.
`NEC`	A matrix of all NEC values with `length{G}` rows and `length(modtype)` columns, if available. See `Note` and the argument `do.nec` to `MEDseq_control`.
`bic`	The BIC value corresponding to the optimal model. May not necessarily be the optimal BIC.
`icl`	The ICL value corresponding to the optimal model. May not necessarily be the optimal ICL.
`aic`	The AIC value corresponding to the optimal model. May not necessarily be the optimal AIC.
`loglik`	The vector of increasing log-likelihood values for every EM/CEM iteration under the optimal model. The last element of this vector is the maximum log-likelihood achieved by the parameters returned at convergence.
`df`	The number of estimated parameters in the optimal model (i.e. the number of 'used' degrees of freedom). Subtract this number from the sample size to get the degrees of freedom.
`iters`	The total number of EM/CEM iterations for the optimal model.
`cv`	The cross-validated log-likelihood value corresponding to the optimal model, if available. May not necessarily be the optimal one.
`nec`	The NEC value corresponding to the optimal model, if available. May not necessarily be the optimal NEC.
`ZS`	A list of lists giving the `z` matrices for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types.
`uncert`	The uncertainty associated with the `classification`.
`covars`	A data frame gathering the set of covariates used in the `gating` network, if any. Will contain zero columns in the absence of gating covariates. Supplied gating covariates will be excluded if the optimal model has only one component. May have fewer columns than covariates supplied via the `covars` argument also, as only the included covariates are gathered here.

Dedicated plot, print, and summary functions exist for objects of class "MEDseq".

Note

Where BIC, ICL, AIC, DBS, ASW, LOGLIK, DF, ITERS, CV, and NEC contain NA entries, this corresponds to a model which was not run; for instance a UU model is never run for single-component models as it is equivalent to CU, while a UCN model is never run for two-component models as it is equivalent to CCN. As such, one can consider the value as not really missing, but equivalent to the corresponding value. On the other hand, -Inf represents models which were terminated due to error, for which a log-likelihood could not be estimated. These objects all inherit the class "MEDCriterion" for which dedicated print and summary methods exist. For plotting, please see plot.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.

Examples


# Load the MVAD data
data(mvad)
mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x) 
                 which(x == "yes")), labels = colnames(mvad[,5:9]))
mvad          <- list(covariates = mvad[c(3:4,10:14,87)],
                      sequences = mvad[,15:86], 
                      weights = mvad[,2])
mvad.cov      <- mvad$covariates

# Create a state sequence object with the first two (summer) time points removed
states        <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels        <- c("Employment", "Further Education", "Higher Education", 
                   "Joblessness", "School", "Training")
mvad.seq      <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels)

# Fit a range of exponential-distance models without clustering
mod0          <- MEDseq_fit(mvad.seq, G=1)


# Fit a range of unweighted mixture models without covariates
# Only consider models with a noise component
# Supply some MEDseq_control() arguments
# mod1        <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"),
#                           algo="CEM", init.z="kmodes", criterion="icl")

# Fit a model with weights and a gating covariate
# Have the probability of noise-component membership be constant
mod2          <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, 
                            gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE)
                            
# Examine this model in greater detail
summary(mod2, classification=TRUE, parameters=TRUE)
summary(mod2$gating, SPS=TRUE)
print(mod2$params$theta, SPS=TRUE)
plot(mod2, "clusters")

[Package MEDseq version 1.4.1 Index]