R: Finite Mixture Modeling for Raw and Binned Data

mixR-package {mixR}

R Documentation

Finite Mixture Modeling for Raw and Binned Data

Description

The package mixR performs maximum likelihood estimation for finite mixture models for families including Normal, Weibull, Gamma and Lognormal via EM algorithm. It also conducts model selection by using information criteria or bootstrap likelihood ratio test. The data used for mixture model fitting can be raw data or binned data. The model fitting is accelerated by using R package Rcpp.

Details

Finite mixture models can be represented by

f(x; \Phi) = \sum_{j = 1}^g \pi_j f_j(x; \theta_j)

where f(x; \Phi) is the probability density function (p.d.f.) or probability mass function (p.m.f.) of the mixture model, f_j(x; \theta_j) is the p.d.f. or p.m.f. of the jth component of the mixture model, \pi_j is the proportion of the jth component and \theta_j is the parameter of the jth component, which can be a scalar or a vector, \Phi is a vector of all the parameters of the mixture model. The maximum likelihood estimate of the parameter vector \Phi can be obtained by using the EM algorithm (Dempster et al, 1977). The binned data is present sometimes instead of the raw data, for the reason of storage convenience or necessity. The binned data is recorded in the form of (a_i, b_i, n_i) where a_i is the lower bound of the ith bin, b_i is the upper bound of the ith bin, and n_i is the number of observations that fall in the ith bin, for i = 1, \dots, r, and r is the total number of bins.

To obtain maximum likelihood estimate of the finite mixture model for binned data, we can introduce two types of latent variables x and z, wherex represents the value of the unknown raw data, and z is a vector of zeros and one indicating the component that x belongs to. To use the EM algorithm we first write the complete-data log-likelihood

Q(\Phi; \Phi^{(p)}) = \sum_{j = 1}^{g} \sum_{i = 1}^r n_i z^{(p)} [\log f(x^{(p)}; \theta_j) + \log \pi_j ]

where z^{(p)} is the expected value of z given the estimated value of \Phi and expected value x^{(p)} at pth iteration. The estimated value of \Phi can be updated iteratively via the E-step, in which we estimate \Phi by maximizing the complete-data loglikelihood, and M-step, in which we calculate the expected value of the latent variables x and z. The EM algorithm is terminated by using a stopping rule. The M-step of the EM algorithm may or may not have closed-form solution (e.g. the Weibull mixture model or Gamma mixture model). If not, an iterative approach like Newton's algorithm or bisection method may be used.

For a given data set, when we have no prior information about the number of components g, its value should be estimated from the data. Because mixture models don't satisfy the regularity condition for the likelihood ratio test (which requires that the true parameter under the null hypothesis should be in the interior of the parameter space of the full model under the alternative hypothesis), a bootstrap approach is usually used in the literature (see McLachlan (1987, 2004), Feng and McCulloch (1996)). The general step of bootstrap likelihood ratio test is as follows.

For the given data x, estimate \Phi under both the null and the alternative hypothesis to get \hat\Phi_0 and \hat\Phi_1. Calculate the observed log-likelihood \ell(x; \hat\Phi_0) and \ell(x; \hat\Phi_1). The likelihood ratio test statistic is defined as

w_0 = -2(\ell(x; \hat\Phi_0) - \ell(x; \hat\Phi_1)).
Generate random data of the same size as the original data x from the model under the null hypothesis using estimated parameter \hat\Phi_0, then repeat step 1 using the simulated data. Repeat this process for B times to get a vector of the simulated likelihood ratio test statistics w_1^{1}, \dots, w_1^{B}.
Calculate the empirical p-value

p = \frac{1}{B} \sum_{i=1}^B I(w_1^{(i)} > w_0)

where I is the indicator function.

This package does the following three things.

Fitting finite mixture models for both raw data and binned data by using EM algorithm, together with Newton-Raphson algorithm and bisection method.
Do parametric bootstrap likelihood ratio test for two candidate models.
Do model selection by Bayesian information criterion.

To speed up computation, the EM algorithm is fulfilled in C++ by using Rcpp (Eddelbuettel and Francois (2011)).

Author(s)

Maintainer: Youjiao Yu jiaoisjiao@gmail.com

References

Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1-38, 1977.

Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.

Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1-26, 01 1979.

Feng, Z. D. and McCulloch, C. E. Using bootstrap likelihood ratios in finite mixture models. Journal of the Royal Statistical Society. Series B (Methodological), pages 609-617, 1996.

Lo, Y., Mendell, N. R., and Rubin, D. B. Testing the number of components in a normal mixture. Biometrika, 88(3):767-778, 2001.

McLachlan, G. J. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied statistics, pages 318-324, 1987.

McLachlan, G. and Jones, P. Fitting mixture models to grouped and truncated data via the EM algorithm. Biometrics, pages 571-578, 1988.

McLachlan, G. and Peel, D. Finite mixture models. John Wiley & Sons, 2004.

[Package mixR version 0.2.0 Index]