fitMSmix {MSmix}R Documentation

MLE of mixtures of Mallows models with Spearman distance via EM algorithms

Description

Perform the MLE of mixtures of Mallows model with Spearman distance on full and partial rankings via EM algorithms. Partial rankings with arbitrary missing positions are supported.

print method for class "emMSmix".

Usage

fitMSmix(
  rankings,
  n_clust = 1,
  n_start = 1,
  n_iter = 200,
  mc_em = FALSE,
  eps = 10^(-6),
  init = list(list(rho = NULL, theta = NULL, weights = NULL))[rep(1, n_start)],
  plot_log_lik = FALSE,
  comp_log_lik_part = FALSE,
  plot_log_lik_part = FALSE,
  parallel = FALSE,
  theta_max = 3,
  theta_tol = 1e-05,
  theta_tune = 1,
  subset = NULL,
  item_names = NULL
)

## S3 method for class 'emMSmix'
print(x, ...)

Arguments

rankings

Integer N\timesn matrix or data frame with partial rankings in each row. Missing positions must be coded as NA.

n_clust

Number of mixture components. Defaults to 1.

n_start

Number of starting points. Defaults to 1.

n_iter

Maximum number of EM iterations. Defaults to 200.

mc_em

Logical: whether the Monte Carlo EM algorithm must be used for MLE on partial rankings completion, see Details. Ignored when rankings does not contain any partial sequence. Defaults to FALSE.

eps

Positive tolerance value for the convergence of the EM algorithm. Defaults to 10^{-6}.

init

List of n_start lists with the starting values of the parameters to initialize the EM algorithm. Each list must contain three named objects, namely: 1) rho: integer G\timesn matrix with the component-specific consensus rankings in each row; 2) theta: numeric vector of G non-negative component-specific precision parameters; 3) weights: numeric vector of G positive mixture weights. Defaults to NULL, meaning that the starting points are automatically generated from the uniform distribution.

plot_log_lik

Logical: whether the iterative log-likelihood values (based on full or augmented rankings) must be plotted. Defaults to FALSE.

comp_log_lik_part

Logical: whether the maximized observed-data log-likelihood value (based on partial rankings) must be returned. Ignored when rankings does not contain any partial sequence or data_augmentation cannot be applied. See Details. Defaults to FALSE.

plot_log_lik_part

Logical: whether the iterative observed-data log-likelihood values (based on partial rankings) must be plotted. Ignored when rankings does not contain any partial sequence. In the presence of partial rankings, this argument is ignored when comp_log_lik_part = FALSE or data_augmentation cannot be applied. Defaults to FALSE.

parallel

Logical: whether parallelization over multiple initializations must be used. Defaults to FALSE.

theta_max

Positive upper bound for the precision parameters. Defaults to 3.

theta_tol

Positive convergence tolerance for the Mstep on theta. Defaults to 10^{-5}.

theta_tune

Positive tuning constant affecting the precision parameters in the Monte Carlo step. Ignored when rankings does not contain any partial sequence or mc_em = FALSE. Defaults to 1.

subset

Optional logical or integer vector specifying the subset of observations, i.e. rows of the rankings, to be kept. Missing values are taken as FALSE.

item_names

Character vector for the names of the items. Defaults to NULL, meaning that colnames(rankings) is used and, if not available, item_names is set equal to "Item1","Item2",....

x

An object of class "emMSmix" returned by fitMSmix.

...

Further arguments passed to or from other methods (not used).

Details

The EM algorithms are launched from n_start initializations and the best solution in terms of maximized log-likelihood value (based on full or augmented rankings) is returned.

When mc_em = FALSE, the scheme introduced by Crispino et al. (2023) is performed, where partial rankings are augmented with all compatible full rankings. This type of data augmentation is supported up to 10 missing positions in the partial rankings.

When mc_em = TRUE, the - computationally more efficient - Monte Carlo EM algorithm introduced by Crispino et al. (2024+) is implemented. In the case of a large number of censored positions and sample sizes, the mc_em = TRUE must be preferred.

Regardless of the fitting method adopted for inference on partial rankings, note that setting the argument comp_log_lik_part = TRUE for the computation of the observed-data log-likelihood values (based on partial rankings) can slow down the procedure in the case of a large number of censored positions and sample sizes.

Value

An object of class "emMSmix", namely a list with the following named components:

mod

List of named objects describing the best fitted model in terms of maximized log-likelihood over the n_start initializations. See Details.

max_log_lik

Maximized log-likelihood values for each initialization.

partial_data

Logical: whether the dataset includes some partially-ranked sequences.

convergence

Binary convergence indicators of the EM algorithm for each initialization: 1 = convergence has been achieved, 0 = otherwise.

record

Best log-likelihood values sequentially achieved over the n_start initializations.

em_settings

List of settings used to fit the model.

call

The matched call.

The mod sublist contains the following named objects:

rho

Integer G\timesn matrix with the MLEs of the component-specific consensus rankings in each row.

theta

Numeric vector with the MLEs of the G component-specific precision parameters.

weights

Numeric vector with the MLEs of the G mixture weights.

z_hat

Numeric N\timesG matrix of the estimated posterior component membership probabilities. Returned when n_clust > 1, otherwise NULL.

map_classification

Integer vector of N mixture component memberships based on the MAP allocation from the z_hat matrix. Returned when n_clust > 1, otherwise NULL.

log_lik

Numeric vector of the log-likelihood values (based on full or augmented rankings) at each iteration.

best_log_lik

Maximized log-likelihood value (based on full or augmented rankings) of the fitted model.

bic

BIC value of the fitted model based on best_log_lik.

log_lik_part

Numeric vector of the observed-data log-likelihood values (based on partial rankings) at each iteration. Returned when rankings contains some partial sequences that can be completed with data_augmentation and plot_log_lik_part = TRUE, otherwise NULL. See Details.

best_log_lik_part

Maximized observed-data log-likelihood value (based on partial rankings) of the fitted model. Returned when rankings contains some partial sequences that can be completed with data_augmentation, otherwise NULL. See Details.

bic_part

BIC value of the fitted model based on best_log_lik_part. Returned when rankings contains some partial sequences that can be completed with data_augmentation, otherwise NULL. See Details.

conv

Binary convergence indicator of the best fitted model: 1 = convergence has been achieved, 0 = otherwise.

augmented_rankings

Integer N\timesn matrix with rankings completed through the Monte Carlo step in each row. Returned when rankings contains some partial sequences and mc_em = TRUE, otherwise NULL.

References

Crispino M, Mollica C and Modugno L (2024+). MSmix: An R Package for clustering partial rankings via mixtures of Mallows Models with Spearman distance. (submitted)

Crispino M, Mollica C, Astuti V and Tardella L (2023). Efficient and accurate inference for mixtures of Mallows models with Spearman distance. Statistics and Computing, 33(98), DOI: 10.1007/s11222-023-10266-8.

Sørensen Ø, Crispino M, Liu Q and Vitelli V (2020). BayesMallows: An R Package for the Bayesian Mallows Model. The R Journal, 12(1), pages 324–342, DOI: 10.32614/RJ-2020-026.

Beckett LA (1993). Maximum likelihood estimation in Mallows’s model using partially ranked data. In Probability models and statistical analyses for ranking data, pages 92–107. Springer New York.

See Also

summary.emMSmix, plot.emMSmix

Examples

## Example 1. Fit the 3-component mixture of Mallow models with Spearman distance
## to the Antifragility dataset.
r_antifrag <- ranks_antifragility[, 1:7]
set.seed(123)
mms_fit <- fitMSmix(rankings = r_antifrag, n_clust = 3, n_start = 10)
mms_fit$mod$rho; mms_fit$mod$theta; mms_fit$mod$weights

## Example 2. Fit the Mallow model with Spearman distance
## to simulated partial rankings through data augmentation.
rank_data <- rbind(c(NA, 4, NA, 1, NA), c(NA, NA, NA, NA, 1), c(2, NA, 1, NA, 3),
                   c(4, 2, 3, 5, 1), c(NA, 4, 1, 3, 2))
mms_fit <- fitMSmix(rankings = rank_data, n_start = 10)
mms_fit$mod$rho; mms_fit$mod$theta

## Example 3. Fit the Mallow model with Spearman distance
## to the Reading genres dataset through Monte Carlo EM.
top5_read <- ranks_read_genres[, 1:11]
mms_fit <- fitMSmix(rankings = top5_read, n_start = 10, mc_em = TRUE)
mms_fit$mod$rho; mms_fit$mod$theta


[Package MSmix version 1.0.2 Index]