mixR-package {mixR} | R Documentation |
Finite Mixture Modeling for Raw and Binned Data
Description
The package mixR
performs maximum likelihood estimation for finite
mixture models for families including Normal, Weibull, Gamma and Lognormal via EM algorithm.
It also conducts model selection by using information criteria or bootstrap likelihood ratio
test. The data used for mixture model fitting can be raw data or binned data. The model fitting
is accelerated by using R package Rcpp.
Details
Finite mixture models can be represented by
where is the probability density function (p.d.f.) or probability mass function
(p.m.f.) of the mixture model,
is the p.d.f. or p.m.f. of the
th
component of the mixture model,
is the proportion of the
th component and
is the parameter of the
th component, which can be a scalar or a vector,
is a vector of all the parameters of the mixture model. The maximum likelihood
estimate of the parameter vector
can be obtained by using
the EM algorithm (Dempster et al, 1977).
The binned data is present sometimes instead of the raw data, for the reason of storage
convenience or necessity. The binned data is recorded in the form of
where
is the lower bound of the
th bin,
is
the upper bound of the
th bin, and
is the number of observations that fall
in the
th bin, for
, and
is the total number of bins.
To obtain maximum likelihood estimate of the finite mixture model for binned data, we can
introduce two types of latent variables and
, where
represents the
value of the unknown raw data, and
is a vector of zeros and one indicating the
component that
belongs to. To use the EM algorithm we first write the complete-data
log-likelihood
where is the expected value of
given the estimated value of
and expected value
at
th iteration. The estimated value of
can be updated iteratively via the E-step, in which we estimate
by maximizing
the complete-data loglikelihood, and M-step, in which we calculate the expected value of
the latent variables
and
. The EM algorithm is terminated by using a stopping
rule.
The M-step of the EM algorithm may or may not have closed-form solution (e.g. the Weibull
mixture model or Gamma mixture model). If not, an iterative approach like Newton's algorithm
or bisection method may be used.
For a given data set, when we have no prior information about the number of components
, its value should be estimated from the data. Because mixture models don't satisfy
the regularity condition for the likelihood ratio test (which requires that the true
parameter under the null hypothesis should be in the interior of the parameter space
of the full model under the alternative hypothesis), a bootstrap approach is usually
used in the literature (see McLachlan (1987, 2004), Feng and McCulloch (1996)). The general
step of bootstrap likelihood ratio test is as follows.
For the given data
, estimate
under both the null and the alternative hypothesis to get
and
. Calculate the observed log-likelihood
and
. The likelihood ratio test statistic is defined as
Generate random data of the same size as the original data
from the model under the null hypothesis using estimated parameter
, then repeat step 1 using the simulated data. Repeat this process for
times to get a vector of the simulated likelihood ratio test statistics
.
Calculate the empirical p-value
where
is the indicator function.
This package does the following three things.
Fitting finite mixture models for both raw data and binned data by using EM algorithm, together with Newton-Raphson algorithm and bisection method.
Do parametric bootstrap likelihood ratio test for two candidate models.
Do model selection by Bayesian information criterion.
To speed up computation, the EM algorithm is fulfilled in C++ by using Rcpp (Eddelbuettel and Francois (2011)).
Author(s)
Maintainer: Youjiao Yu jiaoisjiao@gmail.com
References
Dempster, A. P., Laird, N. M., and Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1-38, 1977.
Dirk Eddelbuettel and Romain Francois (2011). Rcpp: Seamless R and C++ Integration. Journal of Statistical Software, 40(8), 1-18. URL http://www.jstatsoft.org/v40/i08/.
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1-26, 01 1979.
Feng, Z. D. and McCulloch, C. E. Using bootstrap likelihood ratios in finite mixture models. Journal of the Royal Statistical Society. Series B (Methodological), pages 609-617, 1996.
Lo, Y., Mendell, N. R., and Rubin, D. B. Testing the number of components in a normal mixture. Biometrika, 88(3):767-778, 2001.
McLachlan, G. J. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied statistics, pages 318-324, 1987.
McLachlan, G. and Jones, P. Fitting mixture models to grouped and truncated data via the EM algorithm. Biometrics, pages 571-578, 1988.
McLachlan, G. and Peel, D. Finite mixture models. John Wiley & Sons, 2004.