R: Multilevel hidden Markov model using Bayesian estimation

mHMM {mHMMbayes}

R Documentation

Multilevel hidden Markov model using Bayesian estimation

Description

mHMM fits a multilevel (also known as mixed or random effects) hidden Markov model (HMM) to intense longitudinal data with categorical, continuous (i.e., normally distributed), or count (i.e., Poisson distributed) observations of multiple subjects using Bayesian estimation, and creates an object of class mHMM. By using a multilevel framework, we allow for heterogeneity in the model parameters between subjects, while estimating one overall HMM. The function includes the possibility to add covariates at level 2 (i.e., at the subject level) and have varying observation lengths over subjects. For a short description of the package see mHMMbayes. See vignette("tutorial-mhmm") for an introduction to multilevel hidden Markov models and the package, and see vignette("estimation-mhmm") for an overview of the used estimation algorithms.

Usage

mHMM(
  s_data,
  data_distr = "categorical",
  gen,
  xx = NULL,
  start_val,
  mcmc,
  return_path = FALSE,
  show_progress = TRUE,
  gamma_hyp_prior = NULL,
  emiss_hyp_prior = NULL,
  gamma_sampler = NULL,
  emiss_sampler = NULL
)

Arguments

`s_data`	A matrix containing the observations to be modeled, where the rows represent the observations over time. In `s_data`, the first column indicates subject id number. Hence, the id number is repeated over rows equal to the number of observations for that subject. The subsequent columns contain the dependent variable(s). Note that in case of categorical dependent variable(s), input are integers (i.e., whole numbers) that range from 1 to `q_emiss` (see below) and is numeric (i.e., not be a (set of) factor variable(s)). The total number of rows are equal to the sum over the number of observations of each subject, and the number of columns are equal to the number of dependent variables (`n_dep`) + 1. The number of observations can vary over subjects.
`data_distr`	String vector with length 1 describing the observation type of the data. Currently supported are `'categorical'` , `'continuous'`, and `'count'`. Note that for multivariate data, all dependent variables are assumed to be of the same observation type. The default equals to `data_distr = 'categorical'`.
`gen`	List containing the following elements denoting the general model properties: `m`: numeric vector with length 1 denoting the number of hidden states `n_dep`: numeric vector with length 1 denoting the number of dependent variables `q_emiss`: only to be specified if the data represents categorical data. Numeric vector with length `n_dep` denoting the number of observed categories for the categorical emission distribution for each of the dependent variables.
`xx`	An optional list of (level 2) covariates to predict the transition matrix and/or the emission probabilities. Level 2 covariate(s) means that there is one observation per subject of each covariate. The first element in the list `xx` is used to predict the transition matrix. Subsequent elements in the list are used to predict the emission distribution of (each of) the dependent variable(s). Each element in the list is a matrix, with the number of rows equal to the number of subjects. The first column of each matrix represents the intercept, that is, a column only consisting of ones. Subsequent columns correspond to covariates used to predict the transition matrix / emission distribution. See Details for more information on the use of covariates. If `xx` is omitted completely, `xx` defaults to `NULL`, resembling no covariates. Specific elements in the list can also be left empty (i.e., set to `NULL`) to signify that either the transition probability matrix or a specific emission distribution is not predicted by covariates.
`start_val`	List containing the start values for the transition probability matrix gamma and the emission distribution(s). The first element of the list contains a `m` by `m` matrix with the start values for gamma. The subsequent `n_dep` elements each contain a matrix with the start values for the emission distribution(s). In case of categorical observations: a `m` by `q_emiss[k]` matrix denoting the probability of observing each categorical outcome (columns) within each state (rows). In case of continuous observations: a `m` by 2 matrix denoting the mean (first column) and standard deviation (second column) of the Normal emission distribution within each state (rows). In case of count observations: a `m` by 1 matrix denoting the mean (column) of the Poisson emission distribution within each state on the natural (i.e., real positive domain, not logarithmic) scale (rows). Note that `start_val` should not contain nested lists (i.e., lists within lists).
`mcmc`	List of Markov chain Monte Carlo (MCMC) arguments, containing the following elements: `J`: numeric vector with length 1 denoting the number of iterations of the MCMC algorithm `burn_in`: numeric vector with length 1 denoting the burn-in period for the MCMC algorithm.
`return_path`	A logical scalar. Should the sampled state sequence obtained at each iteration and for each subject be returned by the function (`sample_path = TRUE`) or not (`sample_path = FALSE`). Note that the sampled state sequence is quite a large object, hence the default setting is `sample_path = FALSE`. Can be used for local decoding purposes.
`show_progress`	A logical scaler. Should the function show a text progress bar in the `R` console to represent the progress of the algorithm (`show_progress = TRUE`) or not (`show_progress = FALSE`). Defaults to `show_progress = TRUE`.
`gamma_hyp_prior`	An optional object of class `mHMM_prior_gamma` containing user specified parameter values for the hyper-prior distribution on the transition probability matrix gamma, generated by the function `prior_gamma`.
`emiss_hyp_prior`	An object of the class `mHMM_prior_emiss` containing user specified parameter values for the hyper-prior distribution on the categorical or Normal emission distribution(s), generated by the function `prior_emiss_cat`, `prior_emiss_cont` or `prior_emiss_count`. Mandatory in case of continuous and count observations, optional in case of categorical observations.
`gamma_sampler`	An optional object of the class `mHMM_pdRW_gamma` containing user specified settings for the proposal distribution of the random walk (RW) Metropolis sampler on the subject level transition probability matrix parameters, generated by the function `pd_RW_gamma`.
`emiss_sampler`	An optional object of the class `mHMM_pdRW_emiss` containing user specified settings for the proposal distribution of the random walk (RW) Metropolis sampler on the subject level emission distribution(s) parameters, generated by either the function `pd_RW_emiss_cat` or `pd_RW_emiss_count`. Only applicable in case of categorical and count observations.

Details

Covariates specified in xx can either be dichotomous or continuous variables. Dichotomous variables have to be coded as 0/1 variables. Categorical or factor variables can as yet not be used as predictor covariates. The user can however break up the categorical variable in multiple dummy variables (i.e., dichotomous variables), which can be used simultaneously in the analysis. Continuous predictors are automatically centered. That is, the mean value of the covariate is subtracted from all values of the covariate such that the new mean equals zero. This is done such that the presented probabilities in the output (i.e., for the population transition probability matrix and population emission probabilities) correspond to the predicted probabilities at the average value of the covariate(s).

Value

mHMM returns an object of class mHMM, which has print and summary methods to see the results. The object contains always the following components:

PD_subj: A list containing one list per subject with the elements trans_prob, cat_emiss, cont_emiss, or count_emiss in case of categorical, continuous, and count observations, respectively, and log_likl, providing the subject parameter estimates over the iterations of the MCMC sampler. trans_prob relates to the transition probabilities gamma, cat_emiss to the categorical emission distribution (emission probabilities), cont_emiss to the continuous emission distributions (subsequently the the emission means and the (fixed over subjects) emission standard deviation), count_emiss to the count emission distribution means in the natural (real-positive) scale, and log_likl to the log likelihood over the MCMC iterations. Iterations are contained in the rows, the parameters in the columns.
gamma_prob_bar: A matrix containing the group level parameter estimates of the transition probabilities over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows, and the columns contain the group level parameter estimates. If covariates were included in the analysis, the group level probabilities represent the predicted probability given that the covariate is at the average value for continuous covariates, or given that the covariate equals zero for dichotomous covariates.
gamma_int_bar: A matrix containing the group level intercepts of the Multinomial logistic regression modeling the transition probabilities over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows, and the columns contain the group level intercepts.
gamma_cov_bar: A matrix containing the group level regression coefficients of the Multinomial logistic regression predicting the transition probabilities over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows, and the columns contain the group level regression coefficients.
gamma_V_int_bar: A matrix containing the variance components for the subject-level intercepts (between subject variances) of the multinomial logistic regression modeling the transition probabilities over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows, and the columns contain the variance components for the subject level intercepts. Note that only the intercept variances (and not the co-variances) are returned.
gamma_int_subj: A list containing one matrix per subject denoting the subject level intercepts of the Multinomial logistic regression modeling the transition probabilities over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows, and the columns contain the subject level intercepts.
gamma_naccept: A matrix containing the number of accepted draws at the subject level RW Metropolis step for each set of parameters of the transition probabilities. The subjects are contained in the rows, and the columns contain the sets of parameters.
input: Overview of used input specifications: the distribution type of the observations data_distr, the number of states m, the number of used dependent variables n_dep, (in case of categorical observations) the number of output categories for each of the dependent variables q_emiss, the number of iterations J and the specified burn in period burn_in of the hybrid Metropolis within Gibbs sampler, the number of subjects n_subj, the observation length for each subject n_vary, and the column names of the dependent variables dep_labels.
sample_path: A list containing one matrix per subject with the sampled hidden state sequence over the hybrid Metropolis within Gibbs sampler. The time points of the dataset are contained in the rows, and the sampled paths over the iterations are contained in the columns. Only returned if return_path = TRUE.

Additionally, in case of categorical observations, the mHMM return object contains:

emiss_prob_bar: A list containing one matrix per dependent variable, denoting the group level emission probabilities of each dependent variable over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level emission probabilities. If covariates were included in the analysis, the group level probabilities represent the predicted probability given that the covariate is at the average value for continuous covariates, or given that the covariate equals zero for dichotomous covariates.
emiss_int_bar: A list containing one matrix per dependent variable, denoting the group level intercepts of each dependent variable of the Multinomial logistic regression modeling the probabilities of the emission distribution over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level intercepts.
emiss_cov_bar: A list containing one matrix per dependent variable, denoting the group level regression coefficients of the Multinomial logistic regression predicting the emission probabilities within each of the dependent variables over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level regression coefficients.
emiss_V_int_bar: A list containing one matrix per dependent variable, denoting the variance components for the subject-level intercepts (between subject variances) of the multinomial logistic regression modeling the categorical emission probabilities over the iterations of the hybrid Metropolis within Gibbs sampler. The iterations of the sampler are contained in the rows, and the columns contain the variance components for the subject level intercepts. Note that only the intercept variances (and not the co-variances) are returned.
emiss_int_subj: A list containing one list per subject denoting the subject level intercepts of each dependent variable of the Multinomial logistic regression modeling the probabilities of the emission distribution over the iterations of the hybrid Metropolis within Gibbs sampler. Each lower level list contains one matrix per dependent variable, in which iterations of the sampler are contained in the rows, and the columns contain the subject level intercepts.
emiss_naccept: A list containing one matrix per dependent variable with the number of accepted draws at the subject level RW Metropolis step for each set of parameters of the emission distribution. The subjects are contained in the rows, and the columns of the matrix contain the sets of parameters.

In case of continuous observations, the mHMM return object contains:

emiss_mu_bar: A list containing one matrix per dependent variable, denoting the group level means of the Normal emission distribution of each dependent variable over the iterations of the Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level emission means. If covariates were included in the analysis, the group level means represent the predicted mean given that the covariate is at the average value for continuous covariates, or given that the covariate equals zero for dichotomous covariates.
emiss_varmu_bar: A list containing one matrix per dependent variable, denoting the variance between the subject level means of the Normal emission distributions. over the iterations of the Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level variance in the mean.
emiss_sd_bar: A list containing one matrix per dependent variable, denoting the (fixed over subjects) standard deviation of the Normal emission distributions over the iterations of the Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level emission variances.
emiss_cov_bar: A list containing one matrix per dependent variable, denoting the group level regression coefficients predicting the emission means within each of the dependent variables over the iterations of the Gibbs sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group level regression coefficients.
emiss_naccept: A list containing one matrix per dependent variable with the number of accepted draws at the subject level RW Metropolis step for each set of parameters of the emission distribution. The subjects are contained in the rows, and the columns of the matrix contain the sets of parameters.
input: Overview of used input specifications: the number of states m, the number of used dependent variables n_dep, the number of iterations J and the specified burn in period burn_in of the hybrid Metropolis within Gibbs sampler, the number of subjects n_subj, the observation length for each subject n_vary, and the column names of the dependent variables dep_labels.

In case of count observations, the mHMM return object contains:

emiss_mu_bar: A list containing one matrix per dependent variable, denoting the group-level means of the Poisson emission distribution of each dependent variable over the MCMC iterations. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group-level Poisson emission means in the positive scale. If covariates were included in the analysis, the group-level means represent the predicted mean counts given that the covariate is at the average value for continuous covariates, or given that the covariate equals zero for dichotomous covariates.
emiss_varmu_bar: A list containing one matrix per dependent variable, denoting the variance between the subject-level means of the Poisson emission distribution(s) over the iterations of the MCMC sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group-level variance(s) between subject-specific means (in the positive scale).
emiss_logmu_bar: A list containing one matrix per dependent variable, denoting the group-level logmeans of the Poisson emission distribution of each dependent variable over the MCMC iterations. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group-level Poisson emission logmeans in the logarithmic scale. If covariates were included in the analysis, the group-level means represent the predicted logmean counts given that the covariate is at the average value for continuous covariates, or given that the covariate equals zero for dichotomous covariates.
emiss_logvarmu_bar: A list containing one matrix per dependent variable, denoting the logvariance between the subject-level logmeans of the Poisson emission distribution(s) over the iterations of the MCMC sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group-level logvariance(s) between subject-specific means (in the logarithmic scale).
emiss_cov_bar: A list containing one matrix per dependent variable, denoting the group level regression coefficients predicting the emission means in the logarithmic scale within each of the dependent variables over the iterations of the MCMC sampler. The iterations of the sampler are contained in the rows of the matrix, and the columns contain the group-level regression coefficients in the logarithmic scale.
input: Overview of used input specifications: the number of states m, the number of used dependent variables n_dep, the number of iterations J and the specified burn in period burn_in of the hybrid Metropolis within Gibbs sampler, the number of subjects n_subj, the observation length for each subject n_vary, and the column names of the dependent variables dep_labels.

References

Rabiner LR (1989). “A tutorial on hidden Markov models and selected applications in speech recognition.” Proceedings of the IEEE, 77(2), 257–286.

Scott SL (2002). “Bayesian methods for hidden Markov models: Recursive computing in the 21st century.” Journal of the American Statistical Association, 97(457), 337–351.

Altman RM (2007). “Mixed hidden Markov models: an extension of the hidden Markov model to the longitudinal data setting.” Journal of the American Statistical Association, 102(477), 201–210.

Rossi PE, Allenby GM, McCulloch R (2012). Bayesian statistics and marketing. John Wiley & Sons.

Zucchini W, MacDonald IL, Langrock R (2017). Hidden Markov models for time series: an introduction using R. Chapman and Hall/CRC.

Examples

###### Example on package (categorical) example data, see ?nonverbal

# specifying general model properties:
m <- 2
n_dep <- 4
q_emiss <- c(3, 2, 3, 2)

# specifying starting values
start_TM <- diag(.8, m)
start_TM[lower.tri(start_TM) | upper.tri(start_TM)] <- .2
start_EM <- list(matrix(c(0.05, 0.90, 0.05,
                          0.90, 0.05, 0.05), byrow = TRUE,
                        nrow = m, ncol = q_emiss[1]), # vocalizing patient
                 matrix(c(0.1, 0.9,
                          0.1, 0.9), byrow = TRUE, nrow = m,
                        ncol = q_emiss[2]), # looking patient
                 matrix(c(0.90, 0.05, 0.05,
                          0.05, 0.90, 0.05), byrow = TRUE,
                        nrow = m, ncol = q_emiss[3]), # vocalizing therapist
                 matrix(c(0.1, 0.9,
                          0.1, 0.9), byrow = TRUE, nrow = m,
                        ncol = q_emiss[4])) # looking therapist

# Run a model without covariate(s):
# Note that for reasons of running time, J is set at a ridiculous low value.
# One would typically use a number of iterations J of at least 1000,
# and a burn_in of 200.
out_2st <- mHMM(s_data = nonverbal,
                gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                start_val = c(list(start_TM), start_EM),
                mcmc = list(J = 11, burn_in = 5))

out_2st
summary(out_2st)

# plot the posterior densities for the transition and emission probabilities
plot(out_2st, component = "gamma", col =c("darkslategray3", "goldenrod"))

# Run a model including a covariate (see ?nonverbal_cov) to predict the
# emission distribution for each of the 4 dependent variables:

n_subj <- 10
xx_emiss <- rep(list(matrix(c(rep(1, n_subj),nonverbal_cov$std_CDI_change),
                            ncol = 2, nrow = n_subj)), n_dep)
xx <- c(list(matrix(1, ncol = 1, nrow = n_subj)), xx_emiss)
out_2st_c <- mHMM(s_data = nonverbal, xx = xx,
                  gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  start_val = c(list(start_TM), start_EM),
                  mcmc = list(J = 11, burn_in = 5))


###### Example on categorical simulated data
# Simulate data for 10 subjects with each 100 observations:
n_t <- 100
n <- 10
m <- 2
n_dep <- 1
q_emiss <- 3
gamma <- matrix(c(0.8, 0.2,
                  0.3, 0.7), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0,
                        0.1, 0.1, 0.8), nrow = m, ncol = q_emiss, byrow = TRUE))
data1 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = .5, var_emiss = .5)

# Specify remaining required analysis input (for the example, we use simulation
# input as starting values):
n_dep <- 1
q_emiss <- 3

# Run the model on the simulated data:
out_2st_sim <- mHMM(s_data = data1$obs,
                 gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                 start_val = c(list(gamma), emiss_distr),
                 mcmc = list(J = 11, burn_in = 5))

###### Example on continuous simulated data
 # simulating multivariate continuous data
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 2

gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c( 50, 10,
                              100, 10,
                              150, 10), nrow = m, byrow = TRUE),
                    matrix(c(5, 2,
                             10, 5,
                             20, 3), nrow = m, byrow = TRUE))

data_cont <- sim_mHMM(n_t = n_t, n = n, data_distr = 'continuous',
                      gen = list(m = m, n_dep = n_dep),
                      gamma = gamma, emiss_distr = emiss_distr,
                      var_gamma = .1, var_emiss = c(5^2, 0.2^2))

# Specify hyper-prior for the continuous emission distribution
manual_prior_emiss <- prior_emiss_cont(
                        gen = list(m = m, n_dep = n_dep),
                        emiss_mu0 = list(matrix(c(30, 70, 170), nrow = 1),
                                         matrix(c(7, 8, 18), nrow = 1)),
                        emiss_K0 = list(1, 1),
                        emiss_V =  list(rep(5^2, m), rep(0.5^2, m)),
                        emiss_nu = list(1, 1),
                        emiss_a0 = list(rep(1.5, m), rep(1, m)),
                        emiss_b0 = list(rep(20, m), rep(4, m)))

# Run the model on the simulated data:
# Note that for reasons of running time, J is set at a ridiculous low value.
# One would typically use a number of iterations J of at least 1000,
# and a burn_in of 200.
out_3st_cont_sim <- mHMM(s_data = data_cont$obs,
                         data_distr = 'continuous',
                         gen = list(m = m, n_dep = n_dep),
                         start_val = c(list(gamma), emiss_distr),
                         emiss_hyp_prior = manual_prior_emiss,
                         mcmc = list(J = 11, burn_in = 5))




###### Example on multivariate count data
# Simulate data with one covariate for each count dependent variable

n_t     <- 200     # Number of observations on the dependent variable
m       <- 3        # Number of hidden states
n_dep   <- 2        # Number of dependent variables
n_subj  <- 30        # Number of subjects

gamma   <- matrix(c(0.9, 0.05, 0.05,
                    0.2, 0.7, 0.1,
                    0.2,0.3, 0.5), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c(20,
                             10,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(50,
                             3,
                             20), nrow = m, byrow = TRUE))

# Define between subject variance to use on the simulating function:
# here, the variance is varied over states within the dependent variable.
var_emiss <- list(matrix(c(5.0, 3.0, 1.5), nrow = m),
                  matrix(c(5.0, 5.0, 5.0), nrow = m))

# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
                       n = n_subj,
                       data_distr = "count",
                       gen = list(m = m, n_dep = n_dep),
                       gamma = gamma,
                       emiss_distr = emiss_distr,
                       var_gamma = 0.1,
                       var_emiss = var_emiss,
                       return_ind_par = TRUE)


# Transition probabilities
start_gamma <- diag(0.8, m)
start_gamma[lower.tri(start_gamma) | upper.tri(start_gamma)] <-
                                            (1 - diag(start_gamma)) / (m - 1)

# Emission distribution
start_emiss <- list(matrix(c(20,10, 5), nrow = m, byrow = TRUE),
                    matrix(c(50, 3,20), nrow = m, byrow = TRUE))

# Specify hyper-prior for the count emission distribution
manual_prior_emiss <- prior_emiss_count(
  gen = list(m = m, n_dep = n_dep),
  emiss_mu0 = list(matrix(c(20, 10, 5), byrow = TRUE, ncol = m),
                   matrix(c(50, 3, 20), byrow = TRUE, ncol = m)),
  emiss_K0  = rep(list(0.1),n_dep),
  emiss_nu  = rep(list(0.1),n_dep),
  emiss_V   = rep(list(rep(10, m)),n_dep)
)

# Run model
# Note that for reasons of running time, J is set at a ridiculous low value.
# One would typically use a number of iterations J of at least 1000,
# and a burn_in of 200.
out_3st_count_sim <- mHMM(s_data = data_count$obs,
                          data_distr = 'count',
                          gen = list(m = m, n_dep = n_dep),
                          start_val = c(list(start_gamma), start_emiss),
                          emiss_hyp_prior = manual_prior_emiss,
                          mcmc = list(J = 11, burn_in = 5),
                          show_progress = TRUE)

summary(out_3st_count_sim)

[Package mHMMbayes version 1.1.0 Index]