MEDseq_fit {MEDseq} | R Documentation |
MEDseq: Mixtures of Exponential-Distance Models with Covariates
Description
Fits MEDseq models: mixtures of Exponential-Distance models with gating covariates and sampling weights. Typically used for clustering categorical/longitudinal life-course sequences. Additional arguments are available via the function MEDseq_control
.
Usage
MEDseq_fit(seqs,
G = 1L:9L,
modtype = c("CC", "UC", "CU", "UU",
"CCN", "UCN", "CUN", "UUN"),
gating = NULL,
weights = NULL,
ctrl = MEDseq_control(...),
covars = NULL,
...)
## S3 method for class 'MEDseq'
summary(object,
classification = TRUE,
parameters = FALSE,
network = FALSE,
SPS = FALSE,
...)
## S3 method for class 'MEDseq'
print(x,
digits = 3L,
...)
Arguments
seqs |
A state-sequence object of class |
G |
A positive integer vector specifying the numbers of mixture components (clusters) to fit. Defaults to |
modtype |
A vector of character strings indicating the type of MEDseq models to be fitted, in terms of the constraints or lack thereof on the precision parameters. By default, all valid model types are fitted (except some only where |
gating |
A |
weights |
Optional numeric vector containing observation-specific sampling weights, which are accounted for in the model fitting and other functions where applicable. |
ctrl |
A list of control parameters for the EM/CEM and other aspects of the algorithm. The defaults are set by a call to |
covars |
An optional data frame (or a matrix with named columns) in which to look for the covariates in the |
... |
Catches unused arguments (see |
x , object , digits , classification , parameters , network , SPS |
Arguments required for the |
Details
The function effectively allows 8 different MEDseq precision parameter settings for models with or without gating network covariates. By constraining the mixing proportions to be equal (see equalPro
in MEDseq_control
) an extra special case is facilitated in the latter case.
While model selection in terms of choosing the optimal number of components and the MEDseq model type is performed within MEDseq_fit
, using one of the criterion
options within MEDseq_control
, choosing between multiple fits with different combinations of covariates or different initialisation settings can be done by supplying objects of class "MEDseq"
to MEDseq_compare
.
Value
A list (of class "MEDseq"
) with the following named entries (of which some may be missing, depending on the criterion
employed), mostly corresponding to the chosen optimal model (as determined by the criterion
within MEDseq_control
):
call |
The matched call. |
data |
The input data, |
modtype |
A character string denoting the MEDseq model type at which the optimal |
G |
The optimal number of mixture components according to |
params |
A list with the following named components:
|
gating |
An object of class |
z |
The final responsibility matrix whose |
MAP |
The vector of cluster labels for the chosen model corresponding to |
BIC |
A matrix of all BIC values with |
ICL |
A matrix of all ICL values with |
AIC |
A matrix of all AIC values with |
DBS |
A matrix of all (weighted) mean/median DBS values with |
DBSvals |
A list of lists giving the observation-specific DBS values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types. |
dbs |
The (weighted) mean/median DBS value corresponding to the optimal model. May not necessarily be the optimal DBS. |
dbsvals |
Observation-specific DBS values corresponding to the optimum model, which may not be optimal in terms of DBS. |
ASW |
A matrix of all (weighted) mean/median ASW values with |
ASWvals |
A list of lists giving the observation-specific ASW values for all fitted models. The first level of the list corresponds to numbers of components, the second to the MEDseq model types. |
asw |
The (weighted) mean/median ASW value corresponding to the optimal model. May not necessarily be the optimal ASW. |
aswvals |
Observation-specific ASW values corresponding to the optimum model, which may not be optimal in terms of ASW. |
LOGLIK |
A matrix of all maximal log-likelihood values with |
DF |
A matrix giving the numbers of estimated parameters (i.e. the number of 'used' degrees of freedom) for all visited models, with |
ITERS |
A matrix giving the total number of EM/CEM iterations for all visited models, with |
CV |
A matrix of all cross-validated log-likelihood values with |
NEC |
A matrix of all NEC values with |
bic |
The BIC value corresponding to the optimal model. May not necessarily be the optimal BIC. |
icl |
The ICL value corresponding to the optimal model. May not necessarily be the optimal ICL. |
aic |
The AIC value corresponding to the optimal model. May not necessarily be the optimal AIC. |
loglik |
The vector of increasing log-likelihood values for every EM/CEM iteration under the optimal model. The last element of this vector is the maximum log-likelihood achieved by the parameters returned at convergence. |
df |
The number of estimated parameters in the optimal model (i.e. the number of 'used' degrees of freedom). Subtract this number from the sample size to get the degrees of freedom. |
iters |
The total number of EM/CEM iterations for the optimal model. |
cv |
The cross-validated log-likelihood value corresponding to the optimal model, if available. May not necessarily be the optimal one. |
nec |
The NEC value corresponding to the optimal model, if available. May not necessarily be the optimal NEC. |
ZS |
A list of lists giving the |
uncert |
The uncertainty associated with the |
covars |
A data frame gathering the set of covariates used in the |
Dedicated plot
, print
, and summary
functions exist for objects of class "MEDseq"
.
Note
Where BIC
, ICL
, AIC
, DBS
, ASW
, LOGLIK
, DF
, ITERS
, CV
, and NEC
contain NA
entries, this corresponds to a model which was not run; for instance a UU model is never run for single-component models as it is equivalent to CU, while a UCN model is never run for two-component models as it is equivalent to CCN. As such, one can consider the value as not really missing, but equivalent to the corresponding value. On the other hand, -Inf
represents models which were terminated due to error, for which a log-likelihood could not be estimated. These objects all inherit the class "MEDCriterion"
for which dedicated print
and summary
methods exist. For plotting, please see plot
.
Author(s)
Keefe Murphy - <keefe.murphy@mu.ie>
References
Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021). Clustering longitudinal life-course sequences using mixtures of exponential-distance models. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4): 1414-1451. <doi:10.1111/rssa.12712>.
See Also
seqdef
(reexported by MEDseq for convenience), MEDseq_control
, MEDseq_compare
, plot.MEDseq
, predict.MEDgating
, MEDseq_stderr
, I
, MEDseq_clustnames
, seqformat
Examples
# Load the MVAD data
data(mvad)
mvad$Location <- factor(apply(mvad[,5:9], 1L, function(x)
which(x == "yes")), labels = colnames(mvad[,5:9]))
mvad <- list(covariates = mvad[c(3:4,10:14,87)],
sequences = mvad[,15:86],
weights = mvad[,2])
mvad.cov <- mvad$covariates
# Create a state sequence object with the first two (summer) time points removed
states <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
mvad.seq <- seqdef(mvad$sequences[-c(1,2)], states=states, labels=labels)
# Fit a range of exponential-distance models without clustering
mod0 <- MEDseq_fit(mvad.seq, G=1)
# Fit a range of unweighted mixture models without covariates
# Only consider models with a noise component
# Supply some MEDseq_control() arguments
# mod1 <- MEDseq_fit(mvad.seq, G=9:10, modtype=c("CCN", "CUN", "UCN", "UUN"),
# algo="CEM", init.z="kmodes", criterion="icl")
# Fit a model with weights and a gating covariate
# Have the probability of noise-component membership be constant
mod2 <- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights,
gating=~ gcse5eq, covars=mvad.cov, noise.gate=FALSE)
# Examine this model in greater detail
summary(mod2, classification=TRUE, parameters=TRUE)
summary(mod2$gating, SPS=TRUE)
print(mod2$params$theta, SPS=TRUE)
plot(mod2, "clusters")