R: Model parameter estimation

estimate.mocca {fdaMocca}

R Documentation

Model parameter estimation

Description

Function to estimate model parameters by maximizing the observed log likelihood via an EM algorithm. The estimation procedure is based on an algorithm proposed by James and Sugar (2003).

The function is not normally called directly, but rather service routines for mocca. See the description of the mocca function for more detailed information of arguments.

Usage

estimate.mocca(data,K=5,q=6,h=2,random=TRUE,B=NULL,svd=TRUE,
       use.covariates=FALSE,stand.cov=TRUE,index.cov=NULL,
       lambda=1.4e-4,EM.maxit=50, EMstep.tol=1e-8,Mstep.maxit=10,
       Mstep.tol=1e-4, EMplot=TRUE,trace=TRUE,n.cores=NULL)

Arguments

`data`	a list containing at least five objects (vectors) named as `x`, `time`, `timeindex`, `curve`, `grid`, `covariates` (optional). See `mocca` for the detailed explanation of each object.
`K`	number of clusters (default: `K=3`).
`q`	number of B-splines used to describe the individual curves. Evenly spaced knots are used (default: `q=6`). (currently only B-splines are implemented, however, it is possible to use other basis functions such as, e.g. Fourier basis functions)
`h`	a positive integer, parameter vector dimension in low-dimensionality representation of the curves (spline coefficients). `h` should be less or equal to the number of clusters `K` (default: `h=2`).
`random`	`TRUE/FALSE`, if `TRUE` each subject is randomly assigned to one of the `K` clusters initially, otherwise `k-`means is used to initialize cluster belongings (default: TRUE).
`B`	an `N x q` matrix of spline coefficients, the spline approximation of the yearly curves based on `p` number of splines. If `B=NULL` (default), the coefficients are estimated using `fda:: create.bspline.basis`.
`svd`	`TRUE/FALSE`, whether SVD decomposition should be used for the matrix of spline coefficients (default: TRUE).
`use.covariates`	`TRUE/FALSE`, whether covariates should be included when modelling (default: FALSE).
`stand.cov`	`TRUE/FALSE`, whether covariates should be standardized when modelling (default: `TRUE`).
`index.cov`	a vector of indices indicating which covariates should be used when modelling. If `NULL` (default) all present covariates are included.
`lambda`	a positive real number, smoothing parameter value to be used when estimating B-spline coefficients.
`EM.maxit`	a positive integer which gives the maximum number of iterations for a EM algorithm (default: EM.maxit=50).
`EMstep.tol`	the tolerance to use within iterative procedure of the EM algorithm (default: EMstep.tol=1e-8).
`Mstep.maxit`	a positive scalar which gives the maximum number of iterations for an inner loop of the parameter estimation in M step (default: Mstep.maxit=20).
`Mstep.tol`	the tolerance to use within iterative procedure to estimate model parameters (default: Mstep.tol=1e-4).
`EMplot`	`TRUE/FALSE`, whether plots of cluster means with some summary information should be produced at each iteration of the EM algorithm (default: FALSE).
`trace`	`TRUE/FALSE`, whether to print the current values of `\sigma^2` and `\sigma^2_x` of the covariates at each iteration of `M step` (default: TRUE).
`n.cores`	number of cores to be used with parallel computing.

Value

A list is returned with the following items:

`loglik`	the maximized log likelihood value.
`sig2`	estimated residual variance for the spline coefficients (for the model without covariates), or a vector of the estimated residual variances for the spline coefficients and for the covariates (for the model with covariates).
`conv`	indicates why the EM algorithm terminated: 0: indicates successful completion. 1: indicates that the iteration limit `EM.maxit` has been reached.
`iter`	number of iterations of the EM algorithm taken to get convergence.
`score.hist`	a matrix of the succesive values of the scores: residual variances and log likelihood, up until convergence.
`parameters`	a list containing all the estimated parameters: `\bm\lambda_0`, `\bm\Lambda`, `\bm\alpha_k`, `\bm\Gamma_k` (or `\bm\Delta_k` in presence of covariates), `\pi_k` (probabilities of cluster belongnings), `\sigma^2_x` (residual variance for the covariates if present), `\mathbf{v}_k` (mean values of the covariates for each cluster, in presence of covariates), `k=1,..., K`, where `K` is the number of clusters.
`vars`	a list containing results from the E step of the algorithm: the posterior probabilities for each subject `\pi_{k\|i}`'s, the expected values of the `\bm\gamma_i`'s, `\bm\gamma_i\bm\gamma_i^T`, and the covariance matrix of `\bm\gamma_i` given cluster membership and the observed values of the curve. See Arnqvist and Sjöstedt de Luna (2019) that explains these values.
`data`	a list containing all the original data plus re-arranged functional data and covariates (if supplied) needed for EM-steps.
`design`	a list of spline basis matrices with and without covariates: `FullS.bmat` is the spline basis matrix `\mathbf{S}` computed on the grid of uniquily specified time points; `FullS` is the spline basis matrix `FullS.bmat` or `\mathbf U` matrix from the svd of `FullS` (if applied); `\mathbf{S}` is the spline basis matrix computed on `timeindex`, a vector of time indices from `T` possible from `grid`; the inverse `(\mathbf{S}^T\mathbf{S})^{-1}`; `tag.S` is the matrix `\mathbf{S}` with covariates; `tag.FullS` is the matrix `FullS` with covariates.
`initials`	a list of initial settings: `q` is the spline basis dimension, `N` is the number of objects/curves, `Q` is the number of basis dimension plus the number of covariates (if present), `random` is whether k-means was used to initialize cluster belonings, `h` is the vector dimension in low-dimensionality representation of the curves, `K` is the number of clusters, `r` is the number of scalar covariates.

Author(s)

Per Arnqvist, Natalya Pya Arnqvist, Sara Sjöstedt de Luna

References

James, G.M., Sugar, C.A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98.462, 397–408.

Arnqvist, P., and Sjöstedt de Luna, S. (2019). Model based functional clustering of varved lake sediments. arXiv preprint arXiv:1904.10265.