mocca {fdaMocca} | R Documentation |
Model-based clustering for functional data with covariates
Description
This function fits a functional clustering model to observed independent functional subjects, where a functional subject consists of a function and possibly a set of covariates. Here, each curve is projected onto a finite dimensional basis and clustering is done on the resulting basis coefficients. However, rather than treating basis coefficients as parameters, mixed effect modelling is used for the coefficients. In the model-based functional clustering approach the functional subjects (i.e. the spline/basis coefficients and the potential covariates) are assumed to follow a multivariate Gaussian mixture model, where the number of distributions in the mixture model corresponds to the number of (predefined) clusters, K
. Given that a functional subject belongs to a cluster k
, the basis coefficients and covariate values are normally distributed with a cluster-specific mean and covariance structure.
An EM-style algorithm based on James and Sugar (2003) is implemented to fit the Gaussian mixture model for a prespecified number of clusters K
. The model allows for different cluster covariance structure for the spline coefficients and model coefficients for the covariates. See Arnqvist and Sjöstedt de Luna (2019) for details about differences to the clustering model and its implementation.
The routine calls estimate.mocca
for the model fitting.
Usage
mocca(data=stop("No data supplied"), K = 5, q = 6, h = 2,
use.covariates=FALSE,stand.cov=TRUE,index.cov=NULL,
random=TRUE, B=NULL,svd=TRUE,lambda=1.4e-4, EM.maxit=50,
EMstep.tol=1e-6,Mstep.maxit=20,Mstep.tol=1e-4,EMplot=TRUE,
trace=FALSE,n.cores=NULL)
Arguments
data |
a list containing at least three objects (vectors) named as i) suppose we observe ii) iii) iv) v) vi) if supplied, |
K |
number of clusters (default: |
q |
number of B-splines for the individual curves. Evenly spaced knots are used (default: |
h |
a positive integer, parameter vector dimension in the low-dimensionality representation of the curves (spline coefficients).
|
use.covariates |
|
stand.cov |
|
index.cov |
a vector of indices indicating which covariates should be used when modelling. If |
random |
|
B |
an |
svd |
|
lambda |
a positive real number, smoothing parameter value to be used when estimating B-spline coefficients. |
EM.maxit |
a positive integer which gives the maximum number of iterations for a EM algorithm (default:
|
EMstep.tol |
the tolerance to use within iterative procedure of the EM algorithm (default: |
Mstep.maxit |
a positive scalar which gives the maximum number of iterations for an inner loop of the parameter estimation in M step (default: |
Mstep.tol |
the tolerance to use within iterative procedure to estimate model parameters (default: Mstep.tol=1e-4). |
EMplot |
|
trace |
|
n.cores |
number of cores to be used with parallel computing. If |
Details
A model-based clustering with covariates (mocca) for the functional subjects (curves and potentially covariates) is a gaussian mixture model with K
components.
Let g_i(t)
be the true function (curve) of the i^{th}
subject, for a set of N
independent subjects. Assume that for each subject we have a vector of observed values of the function g_i(t)
at times t_{i1},...,t_{in_i},
obtained with some measurement errors. We are interested in clustering the subjects into K
(homogenous) groups. Let y_{ij}
be the observed value of the i
th curve at time point t_{ij}
. Then
y_{ij} = g_i(t_{ij})+ \epsilon_{ij}, i=1,...,N, j=1,...,n_i,
where \epsilon_{ij}
are assumed to be independent and normally distributed measurement errors
with mean 0
and variance \sigma^2
. Let \mathbf{y}_i
, \mathbf{g}_i,
and \boldsymbol{\epsilon}_i
be the n_i
-dimensional vectors for subject i
, corresponding to the observed values, true values and measurement errors, respectively. Then, in matrix notation, the above could be written as
\mathbf{y}_i=\mathbf{g}_i+\boldsymbol{\epsilon}_i, ~~~~i=1,\ldots, N,
where \boldsymbol{\epsilon}_i ~\sim ~ N_{n_i}(\mathbf{0},\sigma^2 \mathbf{I}_{n_i}).
We further assume that the smooth function g_i(t)
can be expressed as
g_i(t) = \boldsymbol{\phi}^T(t) \boldsymbol{\eta}_i,
where \boldsymbol{\phi}(t)=\left(\phi_{1}(t),\ldots,\phi_{p}(t)\right)^T
is a p
-dimensional vector of known basis functions evaluated at time t, e.g. B-splines, and \boldsymbol{\eta}_i
a p
-dimensional vector of unknown (random) coefficients. The \boldsymbol{\eta}_i
's are modelled as
\boldsymbol{\eta}_i = \boldsymbol{\mu}_{z_i} + \boldsymbol{\gamma}_i, ~~~ \boldsymbol{\eta}_i ~ \sim ~ N_p(\boldsymbol{\mu}_{z_i},\bm{\Gamma}_{z_i}),
where \boldsymbol{\mu}_{z_i}
is a vector of expected spline coefficients for a cluster k
and z_i
denotes the unknown cluster membership, with P(z_i=k)=\pi_k
, k=1,\ldots,K
. The random vector \boldsymbol{\gamma}_i
corresponds to subject-specific within-cluster variability.
Note that this variability is allowed to be different in different clusters, due to \bm\Gamma_{z_i}
. If desirable, given that subject i
belongs to cluster z_i=k
, a further parametrization of \boldsymbol{\mu}_{k},~~ k=1,\ldots,K,
may prove useful, for producing low-dimensional representations of the curves as suggested by James and Sugar (2003):
\bm\mu_k = \bm\lambda_0+ \bm\Lambda \bm\alpha_k,
where \bm\lambda_0
and \bm\alpha_k
are p
- and h
-dimensional vectors respectively, and \bm\Lambda
is a p \times h
matrix with h \leq K-1
. Choosing h<K-1
may be valuable, especially for sparse data. In order to ensure identifiability, some restrictions need to be put on the parameters. Imposing the restriction that \sum_{k=1}^K \bm\alpha_k=\mathbf{0}
implies that \bm\phi^T(t)\bm\lambda_0
can be viewed as the overall mean curve. Depending on the choice of h,p
and K
, further restrictions may need to be imposed in order to have identifiability of the parameters (\bm\lambda_0, \bm\Gamma
and \bm\alpha_k
are confounded if no restrictions are imposed).
In vector-notation we thus have
\mathbf{y}_i = \mathbf{B}_i(\bm\lambda_0 + \bm\Lambda\bm\alpha_{z_i}+\bm\gamma_i)+\bm\epsilon_i,~~ i=1,...,N,
where \mathbf{B}_i
is an n_i \times p
matrix with \bm\phi^T(t_{ij})
on the j^\textrm{th}
row, j=1,\ldots,n_i.
We will also assume that the \bm\gamma_i
's, \bm\epsilon_i
's and the z_i
's are independent. Hence, given that subject i
belongs to cluster z_i=k
we have
\mathbf{y}_i | z_i=k ~~\sim ~~ N_{n_i}\left(\mathbf{B}_i(\bm\lambda_0 + \bm\Lambda \bm\alpha_k), ~~\mathbf{B}_i \bm\Gamma_k \mathbf{B}_i^T+ \sigma^2\mathbf{I}_{n_i}\right).
Based on the observed data \mathbf{y}_1,\ldots,\mathbf{y}_N
, the parameters \bm\theta
of the model can be estimated by maximizing the observed likelihood
L_o(\bm\theta|\mathbf{y}_1,\ldots,\mathbf{y}_N)=\prod_{i=1}^N \sum_{k=1}^G \pi_k f_k(\mathbf{y}_i,\bm\theta),
where
\bm\theta = \left\{\bm\lambda_0,\bm\Lambda,\bm\alpha_1,\ldots,\bm\alpha_K,\pi_1,\ldots,\pi_K,\sigma^2,\bm\Gamma_1,\ldots,\bm\Gamma_K\right\},
and f_k(\mathbf{y}_i,\bm\theta)
is the normal density given above. Note that here \bm\theta
will denote all scalar, vectors and matrices of parameters to be estimated. An EM-type algorithm is used to maximize the likelihood above.
If additional covariates have been observed for each subject besides the curves, they can also be included in the model when clustering the subjects. Given that the subject i
belongs to cluster k, (z_{i}=k)
the r
covariates \boldsymbol{x}_i \in \mathbf{R}^r
are assumed to have mean value \boldsymbol{\upsilon}_k
and moreover \boldsymbol{x}_{i} = \boldsymbol{\upsilon}_{k} + \boldsymbol{\delta}_{i} + \boldsymbol{e}_i,
where we assume that \boldsymbol{\delta}_{i}|z_{i}=k \sim N_r(\boldsymbol{0}, \mathbf{D}_k)
is the random deviation within cluster and \boldsymbol{e}_i \sim N_r(\boldsymbol{0},\sigma_x^2 \mathbf{I}_r)
independent remaining unexplained variability.
Note that this model also incorporates the dependence between covariates and the random curves via the random basis coefficients. See Arnqvist and Sjöstedt de Luna (2019) for further details.
EM-algorithm is implemented to maximize the mixture likelihood.
The method is applied to annually varved lake sediment data from the lake Kassjön in Northern Sweden. See an example and also varve
for the data description.
Value
The function returns an object of class "mocca"
with the following elements:
loglik |
the maximized log likelihood value. |
sig2 |
estimated residual variance for the functional data (for the model without covariates), or a vector of the estimated residual variances for the functional data and for the covariates (for the model with covariates). |
conv |
indicates why the EM algorithm terminated: 0: indicates successful completion. 1: indicates that the iteration limit |
iter |
number of iterations of the EM algorithm taken to get convergence. |
nobs |
number of subjects/curves. |
score.hist |
a matrix of the succesive values of the scores, residual variances and log likelihood, up until convergence. |
pars |
a list containing all the estimated parameters: |
vars |
a list containing results from the E step of the algorithm: the posterior probabilities for each subject |
data |
a list containing all the original data plus re-arranged functional data and covariates (if supplied). |
design |
a list of spline basis matrices with and without covariates:
|
initials |
a list of initial settings: |
Author(s)
Per Arnqvist, Natalya Pya Arnqvist, Sara Sjöstedt de Luna
References
Arnqvist, P., Bigler, C., Renberg, I., Sjöstedt de Luna, S. (2016). Functional clustering of varved lake sediment to reconstruct past seasonal climate. Environmental and Ecological Statistics, 23(4), 513–529.
Abramowicz, K., Arnqvist, P., Secchi, P., Sjöstedt de Luna, S., Vantini, S., Vitelli, V. (2017). Clustering misaligned dependent curves applied to varved lake sediment for climate reconstruction. Stochastic Environmental Research and Risk Assessment. Volume 31.1, 71–85.
Arnqvist, P., and Sjöstedt de Luna, S. (2019). Model based functional clustering of varved lake sediments. arXiv preprint arXiv:1904.10265.
James, G.M., Sugar, C.A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98.462, 397–408.
See Also
Examples
## example with lake sediment data from lake Kassjön...
library(fdaMocca)
data(varve) ## reduced data set
## run without covariates...
m <- mocca(data=varve,K=3,n.cores=2)
m
## some summary information...
summary(m)
criteria.mocca(m)
AIC(m)
BIC(m)
## various plots...
plot(m)
plot(m,select=2)
plot(m,type=2,years=c(-750:750))
plot(m,type=2,probs=TRUE,pts=TRUE,years=c(-750:750))
plot(m,type=2,pts=TRUE,select=c(3,1),years=c(-750:750))
plot(m,type=3)
plot(m,type=3,covariance=FALSE)
## model with two covariates...
## note, it takes some time to analyze the data...
m1 <- mocca(data=varve, use.covariates=TRUE,index.cov=c(2,3), K=3,n.cores=2)
m1
## summary information...
summary(m1)
criteria.mocca(m1)
## various plots...
plot(m1)
plot(m1,type=2,pts=TRUE,years=c(-750:750))
plot(m1,type=3)
plot(m1,type=3,covariance=FALSE)
plot(m1,type=3,covariates=TRUE)
## simple simulated data...
data(simdata)
set.seed(2)
m2 <- mocca(data=simdata,K=2,q=8,h=1,lambda=1e-10,n.cores=2,EMstep.tol=1e-3)
summary(m2)
criteria.mocca(m2)
plot(m2)
plot(m2,select=2)
## even simpler simulated data
##(reduced from 'simdata', EMstep.tol set high, q lower to allow automatic testing)...
library(fdaMocca)
data(simdata0)
set.seed(2)
m3 <- mocca(data=simdata0,K=2,q=5,h=1,lambda=1e-10,n.cores=2,EMstep.tol=.5,
EMplot=FALSE,B=simdata0$B)
summary(m3)
#plot(m3)
#plot(m3,select=2))