mocca {fdaMocca}R Documentation

Model-based clustering for functional data with covariates

Description

This function fits a functional clustering model to observed independent functional subjects, where a functional subject consists of a function and possibly a set of covariates. Here, each curve is projected onto a finite dimensional basis and clustering is done on the resulting basis coefficients. However, rather than treating basis coefficients as parameters, mixed effect modelling is used for the coefficients. In the model-based functional clustering approach the functional subjects (i.e. the spline/basis coefficients and the potential covariates) are assumed to follow a multivariate Gaussian mixture model, where the number of distributions in the mixture model corresponds to the number of (predefined) clusters, KK. Given that a functional subject belongs to a cluster kk, the basis coefficients and covariate values are normally distributed with a cluster-specific mean and covariance structure.

An EM-style algorithm based on James and Sugar (2003) is implemented to fit the Gaussian mixture model for a prespecified number of clusters K. The model allows for different cluster covariance structure for the spline coefficients and model coefficients for the covariates. See Arnqvist and Sjöstedt de Luna (2019) for details about differences to the clustering model and its implementation.

The routine calls estimate.mocca for the model fitting.

Usage

mocca(data=stop("No data supplied"), K = 5, q = 6, h = 2,
     use.covariates=FALSE,stand.cov=TRUE,index.cov=NULL,
     random=TRUE, B=NULL,svd=TRUE,lambda=1.4e-4, EM.maxit=50, 
     EMstep.tol=1e-6,Mstep.maxit=20,Mstep.tol=1e-4,EMplot=TRUE,
     trace=FALSE,n.cores=NULL)

Arguments

data

a list containing at least three objects (vectors) named as x, time, curve, and optional timeindex, grid and covariates:

i) suppose we observe NN independent subjects, each consisting of a curve and potentially a set of scalar covariates, where the ithi^{th} curve has been observed at nin_i different time points, i=1,...,Ni=1,...,N. x is a vector of length i=1Nni\sum_{i=1}^N n_i with the first n1n_1 elements representing the observations of the first curve, followed by n2n_2 observations of the second curve, etc;

ii) time is a ini\sum_i n_i vector of the concatenated time points for each curve (tij,j=1,...,ni,i=1,...,N)(t_{ij}, j=1,...,n_i, i=1,...,N), with the first n1n_1 elements being the time points at which the first curve is observed, etc. Often, the time points within each curve are scaled to [0,1][0,1].

iii) timeindex is a ini\sum_i n_i vector of time indices from TT possible from grid. So each observation has a corresponding location (time index) within [0,1][0,1] uniquely specified time points. If not supplied, obtained from time and grid;

iv) curve is a ini\sum_i n_i vector of integers from 1,...,N1,..., N, specifying the subject number for each observation in x;

v) grid is a TT vector of all unique time points (values within [0,1][0,1] interval) for all NN subjects, needed for estimation of the B-spline coefficients in fda::eval.basis(). timeindex and grid together give the timepoint for each subject (curve). If not supplied, obtained from time.

vi) if supplied, covariates is an N×rN \times r matrix (or data frame) of scalar covariates (finite-dimensional covariates).

K

number of clusters (default: K=3).

q

number of B-splines for the individual curves. Evenly spaced knots are used (default: q=6).

h

a positive integer, parameter vector dimension in the low-dimensionality representation of the curves (spline coefficients). hh should be smaller than the number of clusters KK (default: h=2).

use.covariates

TRUE/FALSE, whether covariates should be used when modelling (default: FALSE).

stand.cov

TRUE/FALSE, whether covariates should be standardized when modelling (default: TRUE).

index.cov

a vector of indices indicating which covariates should be used when modelling. If NULL (default) all present covariates are included.

random

TRUE/FALSE, if TRUE the initial cluster belongings is given by uniform distribution, otherwise k-means is used to initialize cluster belongings (default: TRUE).

B

an N×qN \times q matrix of spline coefficients, the spline approximation of the yearly curves based on pp number of splines. If B=NULL (default), the coefficients are estimated using fda:: create.bspline.basis.

svd

TRUE/FALSE, whether SVD decomposition should be used for the matrix of spline coefficients (default: TRUE).

lambda

a positive real number, smoothing parameter value to be used when estimating B-spline coefficients.

EM.maxit

a positive integer which gives the maximum number of iterations for a EM algorithm (default: EM.maxit=50).

EMstep.tol

the tolerance to use within iterative procedure of the EM algorithm (default: EMstep.tol=1e-8).

Mstep.maxit

a positive scalar which gives the maximum number of iterations for an inner loop of the parameter estimation in M step (default: Mstep.maxit=20).

Mstep.tol

the tolerance to use within iterative procedure to estimate model parameters (default: Mstep.tol=1e-4).

EMplot

TRUE/FALSE, whether plots of cluster means with some summary information should be produced at each iteration of the EM algorithm (default: TRUE).

trace

TRUE/FALSE, whether to print the current values of σ2\sigma^2 and σx2\sigma^2_x for the covariates at each iteration of M step (default: FALSE).

n.cores

number of cores to be used with parallel computing. If NULL (default) n.cores is set to the numbers of available cores - 1 (n.cores= detectCores()-1).

Details

A model-based clustering with covariates (mocca) for the functional subjects (curves and potentially covariates) is a gaussian mixture model with KK components. Let gi(t)g_i(t) be the true function (curve) of the ithi^{th} subject, for a set of NN independent subjects. Assume that for each subject we have a vector of observed values of the function gi(t)g_i(t) at times ti1,...,tini,t_{i1},...,t_{in_i}, obtained with some measurement errors. We are interested in clustering the subjects into KK (homogenous) groups. Let yijy_{ij} be the observed value of the iith curve at time point tijt_{ij}. Then

yij=gi(tij)+ϵij,i=1,...,N,j=1,...,ni, y_{ij} = g_i(t_{ij})+ \epsilon_{ij}, i=1,...,N, j=1,...,n_i,

where ϵij\epsilon_{ij} are assumed to be independent and normally distributed measurement errors with mean 00 and variance σ2\sigma^2. Let yi\mathbf{y}_i, gi,\mathbf{g}_i, and ϵi\boldsymbol{\epsilon}_i be the nin_i-dimensional vectors for subject ii, corresponding to the observed values, true values and measurement errors, respectively. Then, in matrix notation, the above could be written as

yi=gi+ϵi,    i=1,,N, \mathbf{y}_i=\mathbf{g}_i+\boldsymbol{\epsilon}_i, ~~~~i=1,\ldots, N,

where ϵi  Nni(0,σ2Ini).\boldsymbol{\epsilon}_i ~\sim ~ N_{n_i}(\mathbf{0},\sigma^2 \mathbf{I}_{n_i}). We further assume that the smooth function gi(t)g_i(t) can be expressed as

gi(t)=ϕT(t)ηi, g_i(t) = \boldsymbol{\phi}^T(t) \boldsymbol{\eta}_i,

where ϕ(t)=(ϕ1(t),,ϕp(t))T\boldsymbol{\phi}(t)=\left(\phi_{1}(t),\ldots,\phi_{p}(t)\right)^T is a pp-dimensional vector of known basis functions evaluated at time t, e.g. B-splines, and ηi\boldsymbol{\eta}_i a pp-dimensional vector of unknown (random) coefficients. The ηi\boldsymbol{\eta}_i's are modelled as

ηi=μzi+γi,   ηi  Np(μzi,Γzi), \boldsymbol{\eta}_i = \boldsymbol{\mu}_{z_i} + \boldsymbol{\gamma}_i, ~~~ \boldsymbol{\eta}_i ~ \sim ~ N_p(\boldsymbol{\mu}_{z_i},\bm{\Gamma}_{z_i}),

where μzi\boldsymbol{\mu}_{z_i} is a vector of expected spline coefficients for a cluster kk and ziz_i denotes the unknown cluster membership, with P(zi=k)=πkP(z_i=k)=\pi_k, k=1,,Kk=1,\ldots,K. The random vector γi\boldsymbol{\gamma}_i corresponds to subject-specific within-cluster variability. Note that this variability is allowed to be different in different clusters, due to Γzi\bm\Gamma_{z_i}. If desirable, given that subject ii belongs to cluster zi=kz_i=k, a further parametrization of μk,  k=1,,K,\boldsymbol{\mu}_{k},~~ k=1,\ldots,K, may prove useful, for producing low-dimensional representations of the curves as suggested by James and Sugar (2003):

μk=λ0+Λαk, \bm\mu_k = \bm\lambda_0+ \bm\Lambda \bm\alpha_k,

where λ0\bm\lambda_0 and αk\bm\alpha_k are pp- and hh-dimensional vectors respectively, and Λ\bm\Lambda is a p×hp \times h matrix with hK1h \leq K-1. Choosing h<K1h<K-1 may be valuable, especially for sparse data. In order to ensure identifiability, some restrictions need to be put on the parameters. Imposing the restriction that k=1Kαk=0\sum_{k=1}^K \bm\alpha_k=\mathbf{0} implies that ϕT(t)λ0\bm\phi^T(t)\bm\lambda_0 can be viewed as the overall mean curve. Depending on the choice of h,ph,p and KK, further restrictions may need to be imposed in order to have identifiability of the parameters (λ0,Γ\bm\lambda_0, \bm\Gamma and αk\bm\alpha_k are confounded if no restrictions are imposed). In vector-notation we thus have

yi=Bi(λ0+Λαzi+γi)+ϵi,  i=1,...,N, \mathbf{y}_i = \mathbf{B}_i(\bm\lambda_0 + \bm\Lambda\bm\alpha_{z_i}+\bm\gamma_i)+\bm\epsilon_i,~~ i=1,...,N,

where Bi\mathbf{B}_i is an ni×pn_i \times p matrix with ϕT(tij)\bm\phi^T(t_{ij}) on the jthj^\textrm{th} row, j=1,,ni.j=1,\ldots,n_i. We will also assume that the γi\bm\gamma_i's, ϵi\bm\epsilon_i's and the ziz_i's are independent. Hence, given that subject ii belongs to cluster zi=kz_i=k we have

yizi=k    Nni(Bi(λ0+Λαk),  BiΓkBiT+σ2Ini). \mathbf{y}_i | z_i=k ~~\sim ~~ N_{n_i}\left(\mathbf{B}_i(\bm\lambda_0 + \bm\Lambda \bm\alpha_k), ~~\mathbf{B}_i \bm\Gamma_k \mathbf{B}_i^T+ \sigma^2\mathbf{I}_{n_i}\right).

Based on the observed data y1,,yN\mathbf{y}_1,\ldots,\mathbf{y}_N, the parameters θ\bm\theta of the model can be estimated by maximizing the observed likelihood

Lo(θy1,,yN)=i=1Nk=1Gπkfk(yi,θ), L_o(\bm\theta|\mathbf{y}_1,\ldots,\mathbf{y}_N)=\prod_{i=1}^N \sum_{k=1}^G \pi_k f_k(\mathbf{y}_i,\bm\theta),

where θ={λ0,Λ,α1,,αK,π1,,πK,σ2,Γ1,,ΓK},\bm\theta = \left\{\bm\lambda_0,\bm\Lambda,\bm\alpha_1,\ldots,\bm\alpha_K,\pi_1,\ldots,\pi_K,\sigma^2,\bm\Gamma_1,\ldots,\bm\Gamma_K\right\}, and fk(yi,θ)f_k(\mathbf{y}_i,\bm\theta) is the normal density given above. Note that here θ\bm\theta will denote all scalar, vectors and matrices of parameters to be estimated. An EM-type algorithm is used to maximize the likelihood above.

If additional covariates have been observed for each subject besides the curves, they can also be included in the model when clustering the subjects. Given that the subject ii belongs to cluster k,(zi=k)k, (z_{i}=k) the rr covariates xiRr\boldsymbol{x}_i \in \mathbf{R}^r are assumed to have mean value υk\boldsymbol{\upsilon}_k and moreover xi=υk+δi+ei,\boldsymbol{x}_{i} = \boldsymbol{\upsilon}_{k} + \boldsymbol{\delta}_{i} + \boldsymbol{e}_i, where we assume that δizi=kNr(0,Dk)\boldsymbol{\delta}_{i}|z_{i}=k \sim N_r(\boldsymbol{0}, \mathbf{D}_k) is the random deviation within cluster and eiNr(0,σx2Ir)\boldsymbol{e}_i \sim N_r(\boldsymbol{0},\sigma_x^2 \mathbf{I}_r) independent remaining unexplained variability. Note that this model also incorporates the dependence between covariates and the random curves via the random basis coefficients. See Arnqvist and Sjöstedt de Luna (2019) for further details. EM-algorithm is implemented to maximize the mixture likelihood.

The method is applied to annually varved lake sediment data from the lake Kassjön in Northern Sweden. See an example and also varve for the data description.

Value

The function returns an object of class "mocca" with the following elements:

loglik

the maximized log likelihood value.

sig2

estimated residual variance for the functional data (for the model without covariates), or a vector of the estimated residual variances for the functional data and for the covariates (for the model with covariates).

conv

indicates why the EM algorithm terminated:

0: indicates successful completion.

1: indicates that the iteration limit EM.maxit has been reached.

iter

number of iterations of the EM algorithm taken to get convergence.

nobs

number of subjects/curves.

score.hist

a matrix of the succesive values of the scores, residual variances and log likelihood, up until convergence.

pars

a list containing all the estimated parameters: λ0\bm\lambda_0, Λ\bm\Lambda, αk\bm\alpha_k, Γk\bm\Gamma_k (or Δk\bm\Delta_k in presence of the covariates), πk\pi_k (probabilities of cluster belongnings), σ2\sigma^2, σx2\sigma^2_x (residual variance for the covariates if present), vk\mathbf{v}_k (mean values of the covariates for each cluster).

vars

a list containing results from the E step of the algorithm: the posterior probabilities for each subject πki\pi_{k|i}'s, the expected values of the γi\bm\gamma_i's, γiγiT\bm\gamma_i\bm\gamma_i^T, and the covariance matrix of γi\bm\gamma_i given cluster membership and the observed values of the curve. See Arnqvist and Sjöstedt de Luna (2019) that explains these values.

data

a list containing all the original data plus re-arranged functional data and covariates (if supplied).

design

a list of spline basis matrices with and without covariates: FullS.bmat is the spline basis matrix S\mathbf{S} computed on the grid of uniquily specified time points; FullS is the spline basis matrix FullS.bmat or U\mathbf U matrix from the svd of FullS (if applied); S\mathbf{S} is the spline basis matrix computed on timeindex, a vector of time indices from TT possible from grid; the inverse (STS)1(\mathbf{S}^T\mathbf{S})^{-1}; tag.S is the matrix S\mathbf{S} with covariates; tag.FullS is the matrix FullS with covariates. See Arnqvist and Sjöstedt de Luna (2019) for further details.

initials

a list of initial settings: qq is the spline basis dimension, NN is the number of subjects/curves, QQ is the number of basis dimension plus the number of covariates (if present), randomrandom is whether k-means was used to initialize cluster belonings, hh is the vector dimension in low-dimensionality representation of the curves, KK is the number of clusters, rr is the number of scalar covariates, mocmoc TRUE/FALSE signaling if the model includes covariates.

Author(s)

Per Arnqvist, Natalya Pya Arnqvist, Sara Sjöstedt de Luna

References

Arnqvist, P., Bigler, C., Renberg, I., Sjöstedt de Luna, S. (2016). Functional clustering of varved lake sediment to reconstruct past seasonal climate. Environmental and Ecological Statistics, 23(4), 513–529.

Abramowicz, K., Arnqvist, P., Secchi, P., Sjöstedt de Luna, S., Vantini, S., Vitelli, V. (2017). Clustering misaligned dependent curves applied to varved lake sediment for climate reconstruction. Stochastic Environmental Research and Risk Assessment. Volume 31.1, 71–85.

Arnqvist, P., and Sjöstedt de Luna, S. (2019). Model based functional clustering of varved lake sediments. arXiv preprint arXiv:1904.10265.

James, G.M., Sugar, C.A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98.462, 397–408.

See Also

fdaMocca-package

Examples

 
## example with lake sediment data from lake Kassjön...
library(fdaMocca)
data(varve) ## reduced data set

## run without covariates...
m <- mocca(data=varve,K=3,n.cores=2)
m
## some summary information...
summary(m)
criteria.mocca(m)
AIC(m)
BIC(m)
## various plots...
plot(m)
plot(m,select=2)
plot(m,type=2,years=c(-750:750)) 
plot(m,type=2,probs=TRUE,pts=TRUE,years=c(-750:750)) 
plot(m,type=2,pts=TRUE,select=c(3,1),years=c(-750:750))
plot(m,type=3)
plot(m,type=3,covariance=FALSE)


## model with two covariates...
## note, it takes some time to analyze the data...
m1 <- mocca(data=varve, use.covariates=TRUE,index.cov=c(2,3), K=3,n.cores=2)
m1
## summary information...
summary(m1)
criteria.mocca(m1)
## various plots...
plot(m1)
plot(m1,type=2,pts=TRUE,years=c(-750:750)) 
plot(m1,type=3)
plot(m1,type=3,covariance=FALSE)
plot(m1,type=3,covariates=TRUE)

## simple simulated data...
data(simdata)
set.seed(2)
m2 <- mocca(data=simdata,K=2,q=8,h=1,lambda=1e-10,n.cores=2,EMstep.tol=1e-3)
summary(m2)
criteria.mocca(m2)
plot(m2)
plot(m2,select=2)


## even simpler simulated data
##(reduced from 'simdata', EMstep.tol set high, q lower to allow automatic testing)...
library(fdaMocca)
data(simdata0)
set.seed(2)
m3 <- mocca(data=simdata0,K=2,q=5,h=1,lambda=1e-10,n.cores=2,EMstep.tol=.5,
      EMplot=FALSE,B=simdata0$B)
summary(m3)
#plot(m3)
#plot(m3,select=2))
 

[Package fdaMocca version 0.1-1 Index]