MGHM {MixtureMissing}R Documentation

Multivariate Generalized Hyperbolic Mixture (MGHM)

Description

Carries out model-based clustering using a multivariate generalized hyperbolic mixture (MGHM). The function will determine itself if the data set is complete or incomplete and fit the appropriate model accordingly. In the incomplete case, the data set must be at least bivariate, and missing values are assumed to be missing at random (MAR).

Usage

MGHM(
  X,
  G,
  model = c("GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H", "SH"),
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL,
  outlier_cutoff = 0.95,
  deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
    = 6, v = 2, show.details = FALSE),
  progress = TRUE
)

Arguments

X

An n x d matrix or data frame where n is the number of observations and d is the number of variables.

G

An integer vector specifying the numbers of clusters, which must be at least 1.

model

A string indicating the mixture model to be fitted; "GH" for generalized hyperbolic by default. See the details section for a list of available distributions.

criterion

A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria.

max_iter

(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default.

epsilon

(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default.

init_method

(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", "mclust", and "manual". When "manual" is chosen, a vector clusters of length n must be specified. If the data set is incomplete, missing values will be first filled based on the mean imputation method.

clusters

(optional) A vector of length n that specifies the initial cluster memberships of the user when init_method is set to "manual". Both numeric and character vectors are acceptable. This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.

outlier_cutoff

(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture.

deriv_ctrl

(optional) A list containing arguments to control the numerical procedures for calculating the first and second derivatives. Some values are suggested by default. Refer to functions grad and hessian under the package numDeriv for more information.

progress

(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default.

Details

Beside the generalized hyperbolic distribution, the function can fit mixture via its special and limiting cases. Available distributions include

Available information criteria include

Value

An object of class MixtureMissing with:

model

The model used to fit the data set.

pi

Mixing proportions.

mu

Component location vectors.

Sigma

Component dispersion matrices.

beta

Component skewness vectors. Only available if model is GH, NIG, SNIG, SC, SGH, HUM, H, or SH; NULL otherwise.

lambda

Component index parameters. Only available if model is GH, NIG, SNIG, SGH, HUM, H, or SH; NULL otherwise.

omega

Component concentration parameters. Only available if model is GH, NIG, SNIG, SGH, HUM, H, or SH; NULL otherwise.

df

Component degrees of freedom. Only available if model is St or t; NULL otherwise.

z_tilde

An n by G matrix where each row indicates the expected probabilities that the corresponding observation belongs to each cluster.

clusters

A numeric vector of length n indicating cluster memberships determined by the model.

outliers

A logical vector of length n indicating observations that are outliers. Only available if model is t; NULL otherwise.

data

The original data set if it is complete; otherwise, this is the data set with missing values imputed by appropriate expectations.

complete

An n by d logical matrix indicating which cells have no missing values.

npar

The breakdown of the number of parameters to estimate.

max_iter

Maximum number of iterations allowed in the EM algorithm.

iter_stop

The actual number of iterations needed when fitting the data set.

final_loglik

The final value of log-likelihood.

loglik

All the values of log-likelihood.

AIC

Akaike information criterion.

BIC

Bayesian information criterion.

KIC

Kullback information criterion.

KICc

Corrected Kullback information criterion.

AIC3

Modified AIC.

CAIC

Bozdogan's consistent AIC.

AICc

Small-sample version of AIC.

ent

Entropy.

ICL

Integrated Completed Likelihood criterion.

AWE

Approximate weight of evidence.

CLC

Classification likelihood criterion.

init_method

The initialization method used in model fitting.

References

Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176–198.

Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics & Data Analysis, 130:18–41.

Examples


data('bankruptcy')

#++++ With no missing values ++++#

X <- bankruptcy[, 2:3]
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)

#++++ With missing values ++++#

set.seed(1234)

X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- MGHM(X, G = 2, init_method = 'kmedoids', max_iter = 10)

summary(mod)
plot(mod)


[Package MixtureMissing version 3.0.2 Index]