bllim {xLLiM} | R Documentation |
EM Algorithm for Block diagonal Gaussian Locally Linear Mapping
Description
EM Algorithm for Block diagonal Gaussian Locally Linear Mapping
Usage
bllim(tapp,yapp,in_K,in_r=NULL,ninit=20,maxiter=100,verb=0,in_theta=NULL,plot=TRUE)
Arguments
tapp |
An |
yapp |
An |
in_K |
Initial number of components or number of clusters |
in_r |
Initial assignments (default NULL). If NULL, the model is initialized with the best initialisation among 20, computed by a joint Gaussian mixture model on both response and covariates. |
ninit |
Number of random initializations. Not used of |
maxiter |
Maximum number of iterations (default 100). The algorithm stops if the number of iterations exceeds |
verb |
Verbosity: print out the progression of the algorithm. If |
in_theta |
Initial parameters (default NULL), same structure as the output of this function. The EM algorithm can be initialized either with initial assignments or initial parameters values. |
plot |
Displays plots to allow user to check that the slope heuristics can be applied confidently to select the conditional block structure of predictors, as in the |
Details
The BLLiM model implemented in this function adresses the following non-linear mapping issue:
where is a L-vector of multivariate responses and
is a large D-vector of covariates' profiles such that
. As
gllim
and sllim
, the bllim
function aims at estimating the non linear regression function .
First, the methods of this package are based on an inverse regression strategy. The inverse conditional relation is specified in a way that the forward relation of interest
can be deduced in closed-from. Under some hypothesis on covariance structures, the large number
of covariates is handled by this inverse regression trick, which acts as a dimension reduction technique. The number of parameters to estimate is therefore drastically reduced. Second, we propose to approximate the non linear
regression function by a piecewise affine function. Therefore, a hidden discrete variable
is introduced, in order to divide the space into
regions such that an affine model holds between responses Y and variables X in each region
:
where is a
matrix of coeffcients for regression
,
is a D-vector of intercepts and
is a Gaussian noise with covariance matrix
.
BLLiM is defined as the following hierarchical Gaussian mixture model for the inverse conditional density :
where is a
block diagonal covariance structure automatically learnt from data.
is the set of parameters
.
The forward conditional density of interest
is deduced from these equations and is also a Gaussian mixture of regression model.
For a given number of affine components (or clusters) K and a given block structure, the number of parameters to estimate is:
where is the dimension of the response,
is the dimension of covariates and
is the total number of parameters in the large covariance matrix
in each cluster. This number of parameters depends on the number and size of blocks in each matrices.
Two hyperparameters must be estimated to run BLLiM:
Number of mixtures components (or clusters)
: we propose to use BIC criterion or slope heuristics as implemented in
capushe-package
For a given number of clusters K, the block structure of large covariance matrices specific of each cluster: the size and the number of blocks of each
matrix is automatically learnt from data, using an extension of the shock procedure (see
shock-package
). This procedure is based on a successive thresholding of sample conditional covariance matrix within clusters, building a collection of block structure candidates. The final block structure is retained using slope heuristics.
BLLiM is not only a prediction model but also an interpretable tool. For example, it is useful for the analysis of transcriptomic data. Indeed, if covariates are genes and response is a phenotype, the model provides clusters of individuals based on the relation between gene expression data and the phenotype, and also leads to infer a gene regulatory network specific for each cluster of individuals.
Value
Returns a list with the following elements:
LLf |
Final log-likelihood |
LL |
Log-likelihood value at each iteration of the EM algorithm |
pi |
A vector of length |
c |
An |
Gamma |
An |
A |
An |
b |
An |
Sigma |
An |
r |
An |
nbpar |
The number of parameters estimated in the model |
Author(s)
Emeline Perthame (emeline.perthame@pasteur.fr), Emilie Devijver (emilie.devijver@kuleuven.be), Melina Gallopin (melina.gallopin@u-psud.fr)
References
[1] E. Devijver, M. Gallopin, E. Perthame. Nonlinear network-based quantitative trait prediction from transcriptomic data. Submitted, 2017, available at https://arxiv.org/abs/1701.07899.
See Also
xLLiM-package
, emgm
, gllim_inverse_map
,capushe-package
,shock-package
Examples
data(data.xllim)
## Setting 5 components in the model
K = 5
## the model can be initialized by running an EM algorithm for Gaussian Mixtures (EMGM)
r = emgm(data.xllim, init=K);
## and then the gllim model is estimated
responses = data.xllim[1:2,] # 2 responses in rows and 100 observations in columns
covariates = data.xllim[3:52,] # 50 covariates in rows and 100 observations in columns
## if initialization is not specified, the model is automatically initialized by EMGM
# mod = bllim(responses,covariates,in_K=K)
## Prediction can be performed using prediction function gllim_inverse_map
# pred = gllim_inverse_map(covariates,mod)$x_exp