fit_glmpca_pois {fastglmpca} | R Documentation |
Fit Poisson GLM-PCA Model to Count Data
Description
Fit a Poisson GLM-PCA model by maximum-likelihood.
Usage
fit_glmpca_pois(
Y,
K,
fit0 = init_glmpca_pois(Y, K),
verbose = TRUE,
control = list()
)
fit_glmpca_pois_control_default()
init_glmpca_pois(
Y,
K,
U,
V,
X = numeric(0),
Z = numeric(0),
B = numeric(0),
W = numeric(0),
fixed_b_cols = numeric(0),
fixed_w_cols = numeric(0),
col_size_factor = TRUE,
row_intercept = TRUE
)
Arguments
Y |
The n x m matrix of counts; all entries of |
K |
Integer 1 or greater specifying the rank of the matrix
factorization. This should only be provided if the initial fit
( |
fit0 |
Initial model fit. It should be an object of class
“glmpca_fit_pois”, such as an output from
|
verbose |
If |
control |
List of control parameters to modify behavior of the optimization algorithm; see “Details”. |
U |
An optional argument giving the initial estimate of the
loadings matrix. It should be an n x K matrix, where n is the
number of rows in the counts matrix |
V |
An optional argument giving is the initial estimate of the
factors matrix. It should be a m x K matrix, where m is the number
of columns in the counts matrix |
X |
Optional argument giving row covariates of the count
matrix |
Z |
Optional argument giving column covariates of the count
matrix |
B |
Optional argument giving the initial estimates for the coefficients of the row covariates. It should be an m x nx matrix, where nx is the number of row covariates. This argument is ignored if X is not provided. |
W |
Optional argument giving the initial estimates for the coefficients of the column covariates. It should be an n x nz matrix, where nz is the number of column covariates. This argument is ignored if Z is not provided. |
fixed_b_cols |
Optional numeric vector specifying which
columns of |
fixed_w_cols |
Optional numeric vector specifying which
columns of |
col_size_factor |
If |
row_intercept |
If |
Details
In generalized principal component analysis (GLM-PCA)
based on a Poisson likelihood, the counts y_{ij}
stored in an
n \times m
matrix Y
are modeled as
y_{ij}
\sim Pois(\lambda_{ij}),
in which the logarithm of each rate
parameter \lambda_{ij}
is defined as a linear combination of
rank-K matrices to be estimated from the data:
\log
\lambda_{ij} = (UDV')_{ij},
where U
and V
are
orthogonal matrices of dimension n \times K
and m
\times K
, respectively, and D
is a diagonal K
\times K
matrix in which the entries along its diagonal are
positive and decreasing. K
is a tuning parameter specifying
the rank of the matrix factorization. This is the same as the
low-rank matrix decomposition underlying PCA (that is, the singular
value decomposition), but because we are not using a linear
(Gaussian) model, this is called “generalized PCA” or
“GLM PCA”.
To allow for additional components that may be fixed,
fit_glmpca_pois
can also fit the more general model
\log \lambda_{ij} = (UDV' + XB' + WZ')_{ij},
in which
X
, Z
are fixed matrices of dimension n \times
n_x
and m \times n_z
, respectively, and
B
, W
are matrices of dimension m \times n_x
and n \times n_z
to be estimated from the data.
fit_glmpca_pois
computes maximum-likelihood estimates (MLEs)
of U
, V
, D
, B
and W
satistifying the
orthogonality constraints for U
and V
and the
additional constraints on D
that the entries are positive and
decreasing. This is accomplished by iteratively fitting a series of
Poisson GLMs, where each of these individual Poissons GLMs is fitted
using a fast “cyclic co-ordinate descent” (CCD) algorithm.
The control
argument is a list in which any of the following
named components will override the default optimization algorithm
settings (as they are defined by
fit_glmpca_pois_control_default
). Additional control
arguments not listed here can be used to control the behaviour of
fpiter
or daarem
; see
the help accompanying these functions for details.
use_daarem
If
use_daarem = TRUE
, the updates are accelerated using DAAREM; seedaarem
for details.tol
This is the value of the “tol” control argument for
fpiter
ordaarem
that determines when to stop the optimization. In brief, the optimization stops when the change in the estimates or in the log-likelihood between two successive updates is less than “tol”.maxiter
This is the value of the “maxiter” control argument for
fpiter
ordaarem
. In brief, it sets the upper limit on the number of CCD updates.convtype
This is the value of the “convtype” control argument for
daarem
. It determines whether the stopping criterion is based on the change in the estimates or the change in the log-likelihood between two successive updates.mon.tol
This is the value of the “mon.tol” control argument for
daarem
. This setting determines to what extent the monotonicity condition can be violated.num_ccd_iter
Number of co-ordinate descent updates to be made to parameters at each iteration of the algorithm.
line_search
If
line_search = TRUE
, a backtracking line search is performed at each iteration of CCD to guarantee improvement in the objective (the log-likelihood).ls_alpha
alpha parameter for backtracking line search. (Should be a number between 0 and 0.5, typically a number near zero.)
ls_beta
beta parameter for backtracking line search controlling the rate at which the step size is decreased. (Should be a number between 0 and 0.5.)
calc_deriv
If
calc_deriv = TRUE
, the maximum gradient ofU
andV
is calculated and stored after each update. This may be useful for assessing convergence of the optimization, though increases overhead.calc_max_diff
If
calc_max_diff = TRUE
, the largest change inU
andV
after each update is calculated and stored. This may be useful for monitoring progress of the optimization algorithm.orthonormalize
If
orthonormalize = TRUE
, the matricesU
andV
are made to be orthogonal after each update step. This improves the speed of convergence without the DAAREM acceleration; however, should not be used whenuse_daarem = TRUE
.
You may use function set_fastglmpca_threads
to adjust
the number of threads used in performing the updates.
Value
An object capturing the state of the model fit. It contains
estimates of U
, V
and D
(stored as matrices
U
, V
and a vector of diagonal entries d
,
analogous to the svd
return value); the other
parameters (X
, B
, Z
, W
); the log-likelihood
achieved (loglik
); information about which columns of
B
and W
are fixed (fixed_b_cols
,
fixed_w_cols
); and a data frame progress
storing
information about the algorithm's progress after each update.
References
Townes, F. W., Hicks, S. C., Aryee, M. J. and Irizarry, R. A. (2019). Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model. Genome Biology 20, 295. doi:10.1186/s13059-019-1861-6
Collins, M., Dasgupta, S. and Schapire, R. E. (2002). A generalization of principal components analysis to the exponential family. In Advances in Neural Information Processing Systems 14.
See Also
Examples
set.seed(1)
n <- 200
p <- 100
K <- 3
dat <- generate_glmpca_data_pois(n,p,K)
fit0 <- init_glmpca_pois(dat$Y,K)
fit <- fit_glmpca_pois(dat$Y,fit0 = fit0)