R: Bayesian Exploratory Factor Analysis

befa {BayesFM}

R Documentation

Bayesian Exploratory Factor Analysis

Description

This function implements the Bayesian Exploratory Factor Analysis (befa) approach developed in Conti et al. (CFSHP, 2014). It runs a MCMC sampler for a factor model with dedicated factors, where each manifest variable is allowed to load on at most one latent factor. The allocation of the manifest variables to the latent factors is not fixed a priori but determined stochastically during sampling. The minimum number of variables dedicated to each factor can be controlled by the user to achieve the desired level of identification. The manifest variables can be continuous or dichotomous, and control variables can be introduced as covariates.

Usage

befa(model, data, burnin = 1000, iter = 10000, Nid = 3, Kmax, A0 = 10,
  B0 = 10, c0 = 2, C0 = 1, HW.prior = TRUE, nu0 = Kmax + 1, S0 = 1,
  kappa0 = 2, xi0 = 1, kappa = 1/Kmax, indp.tau0 = TRUE,
  rnd.step = TRUE, n.step = 5, search.delay = min(burnin, 10),
  R.delay = min(burnin, 100), dedic.start, alpha.start, sigma.start,
  beta.start, R.start, verbose = TRUE)

Arguments

`model`	This argument specifies the manifest variables and the covariates used in the model (if any). It can be specified in two different ways: A numeric matrix or a data frame containing the manifest variables. This corresponds to a model without covariates, and the argument `data` is not required. A list of model formulas. Each element of the list specifies a manifest variable and its corresponding control variables (e.g., '`Y1 ~ X1 + X2`' to use `X1` and `X2` as control variables for `Y1`). If a formula has no left-hand side variable, covariates on the right-hand side are included in all equations (e.g., '`~ X3`' means that `X3` is used as a control variable for all the manifest variables). Argument `data` can be passed to the function in that case, otherwise parent data frame is used. Binary manifest variables should be specified as logical vectors in the data frame to be treated as dichotomous. `NA` values are accepted in manifest variables only.
`data`	Data frame. If missing, parent data frame if used.
`burnin`	Burn-in period of the MCMC sampler.
`iter`	Number of MCMC iterations saved for posterior inference (after burn-in).
`Nid`	Minimum number of manifest variables dedicated to each latent factor for identification.
`Kmax`	Maximum number of latent factors. If missing, the maximum number of factors that satisfies the identification condition determined by `Nid` and the Ledermann bound is specified (see CFSHP, section 2.2).
`A0`	Scaling parameters of the variance of the Normal prior on the nonzero factor loadings. Either a scalar or a numeric vector of length equal to the number of manifest variables.
`B0`	Variances of the Normal prior on the regression coefficients. Either a scalar or a numeric vector of length equal to the number of manifest variables.
`c0`	Shape parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables.
`C0`	Scale parameters of the Inverse-Gamma prior on the idiosyncratic variances. Either a scalar or a numeric vector of length equal to the number of manifest variables.
`HW.prior`	If `TRUE`, implement Huang-Wand (2013) prior on the covariance matrix of the factors in the expanded model, otherwise use an Inverse-Wishart prior if `FALSE`, see CFSHP section 2.3.5.
`nu0`	Degrees of freedom of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model.
`S0`	Scale parameters of the Inverse-Wishart prior on the covariance matrix of latent factors in the expanded model: if `HW.prior = TRUE`, scale parameter of the Gamma hyperprior distribution on the individual scales of the Inverse-Wishart prior. if `HW.prior = FALSE`, diagonal elements of the scale matrix of the Inverse-Wishart prior on the covariance matrix of the latent factors in the expanded model. Either a scalar or a numeric vector of length equal to `Kmax`.
`kappa0`	First shape parameter of the Beta prior distribution on the probability `\tau_0` that a manifest variable does not load on any factor.
`xi0`	Second shape parameter of the Beta prior distribution on the probability `\tau_0` that a manifest variable does not load on any factor.
`kappa`	Concentration parameters of the Dirichlet prior distribution on the indicators. Either a scalar or a numeric vector of length equal to `Kmax`.
`indp.tau0`	If `TRUE`, specify the alternative prior specification with independent parameters `\tau_{0m}` across manifest variables `m = 1, ..., M`, otherwise use a common parameter `\tau_0` if `FALSE`.
`rnd.step`	If `TRUE`, select randomly the number of intermediate steps in non-identified models at each MCMC iteration, otherwise use a fixed number of steps if `FALSE`.
`n.step`	Controls the number of intermediate steps in non-identified models: if `rnd.step = TRUE`, average number of steps. The number of steps is sampled at each MCMC iteration from 1+Poisson(`n.step`-1). if `rnd.step = FALSE`, fixed number of steps.
`search.delay`	Number of MCMC iterations run with fixed indicator matrix (specified in `dedic.start`) at beginning of MCMC sampling.
`R.delay`	Number of MCMC iterations run with fixed correlation matrix (specified in `dedic.start`) at beginning of MCMC sampling.
`dedic.start`	Starting values for the indicators. Vector of integers of length equal to the number of manifest variables. Each element takes a value among 0, 1, ..., `Kmax`. If missing, random allocation of the manifest variables to the maximum number of factors `Kmax`, with a minimum of `Nid` manifest variables loading on each factor.
`alpha.start`	Starting values for the factor loadings. Numeric vector of length equal to the number of manifest variables. If missing, sampled from a Normal distribution with zero mean and variance `A0`.
`sigma.start`	Starting values for the idiosyncratic variances. Numeric vector of length equal to the number of manifest variables. Sampled from prior if missing.
`beta.start`	Starting values for the regression coefficients. Numeric vector of length equal to the total number of regression coefficients, concatenated for all the manifest variables. Sampled from prior if missing.
`R.start`	Starting values for the correlation matrix of the latent factors. Numeric matrix with `Kmax` rows and columns, and unit diagonal elements. If missing, identity matrix is used.
`verbose`	If `TRUE`, display information on the progress of the function.

Details

Model specification. The model is specified as follows, for each observation i = 1, ..., N:

Y^*_i = \beta X_i + \alpha \theta_i + \epsilon_i

\theta_i \sim \mathcal{N}(0, R)

\epsilon_i \sim \mathcal{N}(0, \Sigma)

\Sigma = diag(\sigma^2_1, ..., \sigma^2_M)

where Y^*_i is the M-vector containing the latent variables underlying the corresponding M manifest variables Y_i, which can be continuous such that Y_{im} = Y^*_{im}, or binary with Y_{im} = 1[Y^*_{im} > 0], for m = 1, ..., M. The K-vector \theta_i contains the latent factors, and \alpha is the (M \times K)-matrix of factor loadings. The M-vector \epsilon_i is the vector of error terms. Covariates can be included in the Q-vector X_i and are related to the manifest variables through the (M \times Q)-matrix of regression coefficients \beta. Intercept terms are automatically included, but can be omitted in some or all equations using the usual syntax for R formulae (e.g., 'Y1 ~ X1 - 1' specifies that that Y1 is regressed on X1 and no intercept is included in the corresponding equation).

The number of latent factors K is specified as Kmax. However, during MCMC sampling the stochastic search process on the matrix \alpha may produce zero columns, thereby reducing the number of active factors.

The covariance matrix R of the latent factors is assumed to be a correlation matrix for identification.

Each row of the factor loading matrix \alpha contains at most one nonzero element (dedicated factor model). The allocation of the manifest variables to the latent factors is indicated by the binary matrix \Delta with same dimensions as \alpha, such that each row \Delta_m indicates which factor loading is different from zero, e.g.:

\Delta_m = (0, .., 0, 1, 0, ..., 0) \equiv e_k

indicates that variable m loads on the kth factor, where e_k is a K-vector that contains only zeros, besides its kth element that equals 1.

Identification. The function verifies that the maximum number of latent factors Kmax does not exceed the Ledermann bound. It also checks that Kmax is consistent with the identification restriction specified with Nid (enough variables should be available to load on the factors when Kmax is reached). The default value for Kmax is the minimum between the Ledermann bound and the maximum number of factors that can be loaded by Nid variables. The user is free to select the level of identification, see CFSHP section 2.2 (non-identified models are allowed with Nid = 1).

Note that identification is achieved only with respect to the scale of the latent factors. Non-identifiability problems may affect the posterior sample because of column switching and sign switching of the factor loadings. These issues can be addressed a posteriori with the functions post.column.switch and post.sign.switch.

Prior specification. The indicators are assumed to have the following probabilities, for k = 1, ..., K:

Prob(\Delta_m = e_k \mid \tau_k) = \tau_k

\tau = (\tau_0, \tau_1, ..., \tau_K)

If indp.tau0 = FALSE, the probabilities are specified as:

\tau = [\tau_0, (1-\tau_0)\tau^*_1, ..., (1-\tau_0)\tau^*_K]

\tau_0 \sim \mathcal{B}eta(\kappa_0, \xi_0)

\tau^* = (\tau^*_1, ..., \tau^*_K) \sim \mathcal{D}ir(\kappa)

with \kappa_0 = kappa0, \xi_0 = xi0 and \kappa = kappa. Alternatively, if indp.tau0 = TRUE, the probabilities are specified as:

\tau_m = [\tau_{0m}, (1-\tau_{0m})\tau^*_1, ..., (1-\tau_{0m})\tau^*_K]

\tau_{0m} \sim \mathcal{B}eta(\kappa_0, \xi_0)

for each manifest variable m = 1, ..., M.

A normal-inverse-Gamma prior distribution is assumed on the nonzero factor loadings and on the idiosyncratic variances:

\sigma^2_m \sim \mathcal{I}nv-\mathcal{G}amma(c_{0m}, C_{0m})

\alpha_m^\Delta \mid \sigma^2_m \sim \mathcal{N}(0, A_{0m}\sigma^2_m)

where \alpha_m^\Delta denotes the only nonzero loading in the mth row of \alpha.

For the regression coefficients, a multivariate Normal prior distribution is assumed on each row m = 1, ..., M of \beta:

\beta_m \sim \mathcal{N}(0, B_0 I_Q)

The covariates can be different across manifest variables, implying zero restrictions on the matrix \beta. To specify covariates, use a list of formulas as model (see example below). Intercept terms can be introduced using

To sample the correlation matrix R of the latent factors, marginal data augmentation is implemented (van Dyk and Meng, 2001), see CFSHP section 2.2. Using the transformation \Omega = \Lambda^{1/2} R \Lambda^{1/2}, the parameters \Lambda = diag(\lambda_1, ..., \lambda_K) are used as working parameters. These parameters correspond to the variances of the latent factors in an expanded version of the model where the factors do not have unit variances. Two prior distributions can be specified on the covariance matrix \Omega in the expanded model:

If HW.prior = FALSE, inverse-Wishart distribution:

\Omega \sim \mathcal{I}nv-\mathcal{W}ishart(\nu_0, diag(S_0))

with \nu_0 = nu0 and S_0 = S0.
If HW.prior = TRUE, Huang-Wand (2013) prior:

\Omega \sim \mathcal{I}nv-\mathcal{W}ishart(\nu_0, W), \qquad W = diag(w_1, ..., w_K)

w_k \sim \mathcal{G}amma\left(\frac{1}{2}, \frac{1}{2\nu^*S_{0k}}\right)

with \nu^* = nu0 - Kmax + 1, and the shape and rate parameters are specified such that the mean of the gamma distribution is equal to \nu^* S_{0k}, for each k = 1, ..., K.

Missing values. Missing values (NA) are allowed in the manifest variables. They are drawn from their corresponding conditional distributions during MCMC sampling. Control variables with missing values can be passed to the function. However, all the observations with at least one missing value in the covariates are discarded from the sample (a warning message is issued in that case).

Value

The function returns an object of class 'befa' containing the MCMC draws of the model parameters saved in the following matrices (each matrix has 'iter' rows):

alpha: Factor loadings.
sigma: Idiosyncratic variances.
R: Correlation matrix of the latent factors (off-diagonal elements only).
beta: regression coefficients (if any).
dedic: indicators (integers indicating on which factors the manifest variable load).

The returned object also contains:

nfac: Vector of number of 'active' factors across MCMC iterations (i.e., factors loaded by at least Nid manifest variables).
MHacc: Logical vector indicating accepted proposals of Metropolis-Hastings algorithm.

The parameters Kmax and Nid are saved as object attributes, as well as the function call and the number of mcmc iterations (burnin and iter), and two logical variables indicating if the returned object has been post processed to address the column switching problem (post.column.switch) and the sign switching problem (post.sign.switch).

Author(s)

Rémi Piatek remi.piatek@gmail.com

References

G. Conti, S. Frühwirth-Schnatter, J.J. Heckman, R. Piatek (2014): “Bayesian Exploratory Factor Analysis”, Journal of Econometrics, 183(1), pages 31-57, doi:10.1016/j.jeconom.2014.06.008.

A. Huang, M.P. Wand (2013): “Simple Marginally Noninformative Prior Distributions for Covariance Matrices”, Bayesian Analysis, 8(2), pages 439-452, doi:10.1214/13-BA815.

D.A. van Dyk, X.-L. Meng (2001): “The Art of Data Augmentation”, Journal of Computational and Graphical Statistics, 10(1), pages 1-50, doi:10.1198/10618600152418584.

Examples

#### model without covariates

set.seed(6)

# generate fake data with 15 manifest variables and 3 factors
N <- 100    # number of observations
Y <- simul.dedic.facmod(N, dedic = rep(1:3, each = 5))

# run MCMC sampler
# notice: 1000 MCMC iterations for illustration purposes only,
#  increase this number to obtain reliable posterior results!
mcmc <- befa(Y, Kmax = 5, iter = 1000)

# post process MCMC draws to restore identification
mcmc <- post.column.switch(mcmc)
mcmc <- post.sign.switch(mcmc)

summary(mcmc)  # summarize posterior results
plot(mcmc)     # plot posterior results

# summarize highest posterior probability (HPP) model
summary(mcmc, what = 'hppm')

#### model with covariates

# generate covariates and regression coefficients
Xcov <- cbind(1, matrix(rnorm(4*N), ncol = 4))
colnames(Xcov) <- c('(Intercept)', paste0('X', 1:4))
beta <- rbind(rnorm(15), rnorm(15), diag(3) %x% t(rnorm(5)))

# add covariates to previous model
Y <- Y + Xcov %*% beta

# specify model
model <- c('~ X1',                        # X1 covariate in all equations
           paste0('Y', 1:5,   ' ~ X2'),   # X2 covariate for Y1-Y5 only
           paste0('Y', 6:10,  ' ~ X3'),   # X3 covariate for Y6-Y10 only
           paste0('Y', 11:15, ' ~ X4'))   # X4 covariate for Y11-Y15 only
model <- lapply(model, as.formula)        # make list of formulas

# run MCMC sampler, post process and summarize
mcmc <- befa(model, data = data.frame(Y, Xcov), Kmax = 5, iter = 1000)
mcmc <- post.column.switch(mcmc)
mcmc <- post.sign.switch(mcmc)
mcmc.sum <- summary(mcmc)
mcmc.sum

# compare posterior mean of regression coefficients to true values
beta.comp <- cbind(beta[beta != 0], mcmc.sum$beta[, 'mean'])
colnames(beta.comp) <- c('true', 'mcmc')
print(beta.comp, digits = 3)

[Package BayesFM version 0.1.7 Index]