befa {BayesFM} | R Documentation |
Bayesian Exploratory Factor Analysis
Description
This function implements the Bayesian Exploratory Factor Analysis
(befa
) approach developed in Conti et al. (CFSHP, 2014). It runs a
MCMC sampler for a factor model with dedicated factors, where each manifest
variable is allowed to load on at most one latent factor. The allocation of
the manifest variables to the latent factors is not fixed a priori but
determined stochastically during sampling. The minimum number of variables
dedicated to each factor can be controlled by the user to achieve the desired
level of identification. The manifest variables can be continuous or
dichotomous, and control variables can be introduced as covariates.
Usage
befa(model, data, burnin = 1000, iter = 10000, Nid = 3, Kmax, A0 = 10,
B0 = 10, c0 = 2, C0 = 1, HW.prior = TRUE, nu0 = Kmax + 1, S0 = 1,
kappa0 = 2, xi0 = 1, kappa = 1/Kmax, indp.tau0 = TRUE,
rnd.step = TRUE, n.step = 5, search.delay = min(burnin, 10),
R.delay = min(burnin, 100), dedic.start, alpha.start, sigma.start,
beta.start, R.start, verbose = TRUE)
Arguments
model |
This argument specifies the manifest variables and the covariates used in the model (if any). It can be specified in two different ways:
Binary manifest variables should be specified as logical vectors in
the data frame to be treated as dichotomous. |
data |
Data frame. If missing, parent data frame if used. |
burnin |
Burn-in period of the MCMC sampler. |
iter |
Number of MCMC iterations saved for posterior inference (after burn-in). |
Nid |
Minimum number of manifest variables dedicated to each latent factor for identification. |
Kmax |
Maximum number of latent factors. If missing, the maximum number of
factors that satisfies the identification condition determined by
|
A0 |
Scaling parameters of the variance of the Normal prior on the nonzero factor loadings. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
B0 |
Variances of the Normal prior on the regression coefficients. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
c0 |
Shape parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
C0 |
Scale parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables. |
HW.prior |
If |
nu0 |
Degrees of freedom of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model. |
S0 |
Scale parameters of the Inverse-Wishart prior on the covariance matrix of latent factors in the expanded model:
Either a scalar or a numeric vector of length equal to |
kappa0 |
First shape parameter of the Beta prior distribution on the
probability |
xi0 |
Second shape parameter of the Beta prior distribution on the
probability |
kappa |
Concentration parameters of the Dirichlet prior distribution on the
indicators. Either a scalar or a numeric vector of length equal to
|
indp.tau0 |
If |
rnd.step |
If |
n.step |
Controls the number of intermediate steps in non-identified models:
|
search.delay |
Number of MCMC iterations run with fixed indicator matrix (specified
in |
R.delay |
Number of MCMC iterations run with fixed correlation matrix (specified
in |
dedic.start |
Starting values for the indicators. Vector of integers of length equal
to the number of manifest variables. Each element takes a value among
0, 1, ..., |
alpha.start |
Starting values for the factor loadings. Numeric vector of length
equal to the number of manifest variables. If missing, sampled from a
Normal distribution with zero mean and variance |
sigma.start |
Starting values for the idiosyncratic variances. Numeric vector of length equal to the number of manifest variables. Sampled from prior if missing. |
beta.start |
Starting values for the regression coefficients. Numeric vector of length equal to the total number of regression coefficients, concatenated for all the manifest variables. Sampled from prior if missing. |
R.start |
Starting values for the correlation matrix of the latent factors.
Numeric matrix with |
verbose |
If |
Details
Model specification. The model is specified as follows, for
each observation i = 1, ..., N
:
Y^*_i = \beta X_i + \alpha \theta_i + \epsilon_i
\theta_i \sim \mathcal{N}(0, R)
\epsilon_i \sim \mathcal{N}(0, \Sigma)
\Sigma = diag(\sigma^2_1, ..., \sigma^2_M)
where Y^*_i
is the M
-vector containing the latent
variables underlying the corresponding M
manifest variables
Y_i
, which can be continuous such that
Y_{im} = Y^*_{im}
, or binary with
Y_{im} = 1[Y^*_{im} > 0]
, for m = 1, ..., M
.
The K
-vector \theta_i
contains the latent factors, and
\alpha
is the (M \times K)
-matrix of factor loadings.
The M
-vector \epsilon_i
is the vector of error terms.
Covariates can be included in the Q
-vector X_i
and are related to
the manifest variables through the (M \times Q)
-matrix of
regression coefficients \beta
. Intercept terms are automatically
included, but can be omitted in some or all equations using the usual syntax
for R formulae (e.g., 'Y1 ~ X1 - 1' specifies that that Y1 is regressed on X1
and no intercept is included in the corresponding equation).
The number of latent factors K
is specified as Kmax
. However,
during MCMC sampling the stochastic search process on the matrix \alpha
may produce zero columns, thereby reducing the number of active factors.
The covariance matrix R
of the latent factors is assumed to be a
correlation matrix for identification.
Each row of the factor loading matrix \alpha
contains at most one
nonzero element (dedicated factor model). The allocation of the manifest
variables to the latent factors is indicated by the binary matrix
\Delta
with same dimensions as \alpha
, such that each row
\Delta_m
indicates which factor loading is different from zero, e.g.:
\Delta_m = (0, .., 0, 1, 0, ..., 0) \equiv e_k
indicates that variable m
loads on the k
th factor, where
e_k
is a K
-vector that contains only zeros, besides its k
th
element that equals 1.
Identification. The function verifies that the maximum number of
latent factors Kmax
does not exceed the Ledermann bound. It also
checks that Kmax
is consistent with the identification restriction
specified with Nid
(enough variables should be available to load on
the factors when Kmax
is reached). The default value for Kmax
is the minimum between the Ledermann bound and the maximum number of factors
that can be loaded by Nid
variables. The user is free to select the
level of identification, see CFSHP section 2.2 (non-identified models are
allowed with Nid = 1
).
Note that identification is achieved only with respect to the scale of the
latent factors. Non-identifiability problems may affect the posterior sample
because of column switching and sign switching of the factor loadings.
These issues can be addressed a posteriori with the functions
post.column.switch
and post.sign.switch
.
Prior specification.
The indicators are assumed to have the following probabilities,
for k = 1, ..., K
:
Prob(\Delta_m = e_k \mid \tau_k) = \tau_k
\tau = (\tau_0, \tau_1, ..., \tau_K)
If indp.tau0 = FALSE
, the probabilities are specified as:
\tau = [\tau_0, (1-\tau_0)\tau^*_1, ..., (1-\tau_0)\tau^*_K]
\tau_0 \sim \mathcal{B}eta(\kappa_0, \xi_0)
\tau^* = (\tau^*_1, ..., \tau^*_K) \sim \mathcal{D}ir(\kappa)
with \kappa_0
= kappa0
, \xi_0
= xi0
and
\kappa
= kappa
.
Alternatively, if indp.tau0 = TRUE
, the probabilities are specified
as:
\tau_m = [\tau_{0m}, (1-\tau_{0m})\tau^*_1, ...,
(1-\tau_{0m})\tau^*_K]
\tau_{0m} \sim \mathcal{B}eta(\kappa_0, \xi_0)
for each manifest variable m = 1, ..., M
.
A normal-inverse-Gamma prior distribution is assumed on the nonzero factor loadings and on the idiosyncratic variances:
\sigma^2_m \sim \mathcal{I}nv-\mathcal{G}amma(c_{0m}, C_{0m})
\alpha_m^\Delta \mid \sigma^2_m \sim \mathcal{N}(0, A_{0m}\sigma^2_m)
where \alpha_m^\Delta
denotes the only nonzero loading in the m
th
row of \alpha
.
For the regression coefficients, a multivariate Normal prior distribution is
assumed on each row m = 1, ..., M
of \beta
:
\beta_m \sim \mathcal{N}(0, B_0 I_Q)
The covariates can be different across manifest variables, implying zero
restrictions on the matrix \beta
. To specify covariates, use a list
of formulas as model
(see example below). Intercept terms can be
introduced using
To sample the correlation matrix R
of the latent factors, marginal data
augmentation is implemented (van Dyk and Meng, 2001), see CFSHP section 2.2.
Using the transformation \Omega = \Lambda^{1/2} R \Lambda^{1/2}
, the
parameters \Lambda = diag(\lambda_1, ..., \lambda_K)
are used as
working parameters. These parameters correspond to the variances of
the latent factors in an expanded version of the model where the factors do
not have unit variances. Two prior distributions can be specified on the
covariance matrix \Omega
in the expanded model:
If
HW.prior = FALSE
, inverse-Wishart distribution:\Omega \sim \mathcal{I}nv-\mathcal{W}ishart(\nu_0, diag(S_0))
with
\nu_0
=nu0
andS_0
=S0
.If
HW.prior = TRUE
, Huang-Wand (2013) prior:\Omega \sim \mathcal{I}nv-\mathcal{W}ishart(\nu_0, W), \qquad W = diag(w_1, ..., w_K)
w_k \sim \mathcal{G}amma\left(\frac{1}{2}, \frac{1}{2\nu^*S_{0k}}\right)
with
\nu^*
=nu0
-Kmax
+ 1, and the shape and rate parameters are specified such that the mean of the gamma distribution is equal to\nu^* S_{0k}
, for eachk = 1, ..., K
.
Missing values. Missing values (NA
) are allowed in the
manifest variables. They are drawn from their corresponding conditional
distributions during MCMC sampling. Control variables with missing values
can be passed to the function. However, all the observations with at least
one missing value in the covariates are discarded from the sample (a warning
message is issued in that case).
Value
The function returns an object of class 'befa
' containing the
MCMC draws of the model parameters saved in the following matrices (each
matrix has 'iter
' rows):
-
alpha
: Factor loadings. -
sigma
: Idiosyncratic variances. -
R
: Correlation matrix of the latent factors (off-diagonal elements only). -
beta
: regression coefficients (if any). -
dedic
: indicators (integers indicating on which factors the manifest variable load).
The returned object also contains:
-
nfac
: Vector of number of 'active' factors across MCMC iterations (i.e., factors loaded by at leastNid
manifest variables). -
MHacc
: Logical vector indicating accepted proposals of Metropolis-Hastings algorithm.
The parameters Kmax
and Nid
are saved as object attributes, as
well as the function call and the number of mcmc iterations (burnin
and iter
), and two logical variables indicating if the returned object
has been post processed to address the column switching problem
(post.column.switch
) and the sign switching problem
(post.sign.switch
).
Author(s)
Rémi Piatek remi.piatek@gmail.com
References
G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.
A. Huang, M.P. Wand (2013): “Simple Marginally Noninformative Prior Distributions for Covariance Matrices”, Bayesian Analysis, 8(2), pages 439-452, doi:10.1214/13-BA815.
D.A. van Dyk, X.-L. Meng (2001): “The Art of Data Augmentation”, Journal of Computational and Graphical Statistics, 10(1), pages 1-50, doi:10.1198/10618600152418584.
See Also
post.column.switch
and post.sign.switch
for column switching and sign switching of the factor loading matrix and of
the correlation matrix of the latent factors to restore identification
a posteriori.
summary.befa
and plot.befa
to summarize
and plot the posterior results.
simul.R.prior
and simul.nfac.prior
to
simulate the prior distribution of the correlation matrix of the factors and
the prior distribution of the indicator matrix, respectively. This is useful
to perform prior sensitivity analysis and to understand the role of the
corresponding parameters in the factor search.
Examples
#### model without covariates
set.seed(6)
# generate fake data with 15 manifest variables and 3 factors
N <- 100 # number of observations
Y <- simul.dedic.facmod(N, dedic = rep(1:3, each = 5))
# run MCMC sampler
# notice: 1000 MCMC iterations for illustration purposes only,
# increase this number to obtain reliable posterior results!
mcmc <- befa(Y, Kmax = 5, iter = 1000)
# post process MCMC draws to restore identification
mcmc <- post.column.switch(mcmc)
mcmc <- post.sign.switch(mcmc)
summary(mcmc) # summarize posterior results
plot(mcmc) # plot posterior results
# summarize highest posterior probability (HPP) model
summary(mcmc, what = 'hppm')
#### model with covariates
# generate covariates and regression coefficients
Xcov <- cbind(1, matrix(rnorm(4*N), ncol = 4))
colnames(Xcov) <- c('(Intercept)', paste0('X', 1:4))
beta <- rbind(rnorm(15), rnorm(15), diag(3) %x% t(rnorm(5)))
# add covariates to previous model
Y <- Y + Xcov %*% beta
# specify model
model <- c('~ X1', # X1 covariate in all equations
paste0('Y', 1:5, ' ~ X2'), # X2 covariate for Y1-Y5 only
paste0('Y', 6:10, ' ~ X3'), # X3 covariate for Y6-Y10 only
paste0('Y', 11:15, ' ~ X4')) # X4 covariate for Y11-Y15 only
model <- lapply(model, as.formula) # make list of formulas
# run MCMC sampler, post process and summarize
mcmc <- befa(model, data = data.frame(Y, Xcov), Kmax = 5, iter = 1000)
mcmc <- post.column.switch(mcmc)
mcmc <- post.sign.switch(mcmc)
mcmc.sum <- summary(mcmc)
mcmc.sum
# compare posterior mean of regression coefficients to true values
beta.comp <- cbind(beta[beta != 0], mcmc.sum$beta[, 'mean'])
colnames(beta.comp) <- c('true', 'mcmc')
print(beta.comp, digits = 3)