R: Simulate data using a multilevel hidden Markov model

sim_mHMM {mHMMbayes}

R Documentation

Simulate data using a multilevel hidden Markov model

Description

sim_mHMM simulates data for multiple subjects, for which the data have either categorical or continuous (i.e., normally distributed) observations that follow a hidden Markov model (HMM) with a multilevel structure. The multilevel structure implies that each subject is allowed to have its own set of parameters, and that the parameters at the subject level (level 1) are tied together by a population distribution at level 2 for each of the corresponding parameters. The shape of the population distribution for each of the parameters is a normal distribution. In addition to (natural and/or unexplained) heterogeneity between subjects, the subjects parameters can also depend on a covariate.

Usage

sim_mHMM(
  n_t,
  n,
  data_distr = "categorical",
  gen,
  gamma,
  emiss_distr,
  start_state = NULL,
  xx_vec = NULL,
  beta = NULL,
  var_gamma = 0.1,
  var_emiss = NULL,
  return_ind_par = FALSE,
  m,
  n_dep,
  q_emiss,
  log_scale = FALSE
)

Arguments

`n_t`	Numeric vector with length 1 denoting the length of the observed sequence to be simulated for each subject. To only simulate subject specific transition probability matrices gamma and emission distributions (and no data), set `t` to 0.
`n`	Numeric vector with length 1 denoting the number of subjects for which data is simulated.
`data_distr`	A character vector of length 1 denoting the distribution adopted for the data given the hidden states. It can take the values 'categorical', 'continuous', or 'count', standing for the for categorical observations following a Multinomial logit, continuous observations following a normal distribution, and count observations following a Poisson distribution, correspondingly.
`gen`	List containing the following elements denoting the general model properties: `m`: numeric vector with length 1 denoting the number of hidden states `n_dep`: numeric vector with length 1 denoting the number of dependent variables `q_emiss`: only to be specified if the data represents categorical data. Numeric vector with length `n_dep` denoting the number of observed categories for the categorical emission distribution for each of the dependent variables.
`gamma`	A matrix with `m` rows and `m` columns containing the average population transition probability matrix used for simulating the data. That is, the probability to switch from hidden state i (row i) to hidden state j (column j).
`emiss_distr`	A list with `n_dep` elements containing the average population emission distribution(s) of the observations given the hidden states for each of the dependent variables. If `data_distr = 'categorical'`, each element is a matrix with `m` rows and `q_emiss[k]` columns for each of the `k` in `n_dep` emission distribution(s). That is, the probability of observing category q (column q) in state i (row i). If `data_distr = 'continuous'`, each element is a matrix with `m` rows and 2 columns; the first column denoting the mean of state i (row i) and the second column denoting the standard deviation of state i (row i) of the Normal distribution. If `data_distr = 'count'`, each element is a matrix with `m` rows and 1 column; the only column denoting the logmean of state i (row i) of the lognormal distribution used as prior for the Poisson emissions (note: by default the logmeans should be specified in the natural scale; see argument `log_scale` below, unless covariates are used to predict a count emission distribution).
`start_state`	Optional numeric vector with length 1 denoting in which state the simulated state sequence should start. If left unspecified, the simulated state for time point 1 is sampled from the initial state distribution (which is derived from the transition probability matrix gamma).
`xx_vec`	List of 1 + `n_dep` vectors containing the covariate(s) to predict the transition probability matrix `gamma` and/or (specific) emission distribution(s) `emiss_distr` using the regression parameters specified in `beta` (see below). The first element in the list `xx_vec` is used to predict the transition matrix. Subsequent elements in the list are used to predict the emission distribution of (each of) the dependent variable(s). This means that the covariate used to predict `gamma` and `emiss_distr` can either be the same covariate, different covariates, or a covariate for certain elements and none for the other. At this point, it is only possible to use one covariate for both `gamma` and `emiss_distr`. For all elements in the list, the number of observations in the vectors should be equal to the number of subjects to be simulated `n`. If `xx_vec` is omitted completely, `xx_vec` defaults to NULL, resembling no covariates at all. Specific elements in the list can also be left empty (i.e., set to `NULL`) to signify that either the transition probability matrix or (one of) the emission distribution(s) is not predicted by covariates.
`beta`	List of 1 + `n_dep` matrices containing the regression parameters to predict `gamma` and/or `emiss_distr` in combination with `xx_vec` using (Multinomial logistic) regression. The first matrix is used to predict the transition probability matrix `gamma`. The subsequent matrices are used to predict the emission distribution(s) `emiss_distr` of the dependent variable(s). For `gamma` and categorical emission distributions, one regression parameter is specified for each element in `gamma` and `emiss_distr`, with the following exception. The first element in each row of `gamma` and/or `emiss_distr` is used as reference category in the Multinomial logistic regression. As such, no regression parameters can be specified for these parameters. Hence, the first element in the list `beta` to predict `gamma` consist of a matrix with the number of rows equal to `m` and the number of columns equal to `m` - 1. For categorical emission distributions, the subsequent elements in the list `beta` to predict `emiss_distr` consist of a matrix with the number of rows equal to `m` and the number of columns equal to `q_emiss[k]` - 1 for each of the `k` in `n_dep` emission distribution(s). See details for more information. For continuous and count emission distributions, the subsequent elements in the list `beta` consist of a matrix with the number of rows equal to `m` and 1 column. Note that if `beta` is specified, `xx_vec` has to be specified as well. If `beta` is omitted completely, `beta` defaults to NULL, resembling no prediction of `gamma` and `emiss_distr` using covariates. One of the elements in the list can also be left empty (i.e., set to `NULL`) to signify that either the transition probability matrix or a specific emission distribution is not predicted by covariates. If covariates are used to predict a count emission distribution (`data_distr = 'count'`), then the logarithmic scale should be used for the inputs `emiss_distr` and `var_emiss` in addition to `beta`, and also set `log_scale = TRUE`.
`var_gamma`	Either a numeric vector with length 1 or a matrix of (`m` by `m` - 1) elements denoting the amount of variance between subjects in the transition probability matrix. Note that the value(s) correspond to the variance of the parameters of the Multinomial distribution (i.e., the intercepts of the regression equation of the Multinomial distribution used to sample the transition probability matrix), see details below. Also note that if only one variance value is provided, it will be adopted for the complete transition probability matrix, hence the variance is assumed fixed across all components. The default equals 0.1, which corresponds to little variation between subjects. If one wants to simulate data from exactly the same HMM for all subjects, var_gamma should be set to 0. Note that if data for only 1 subject is simulated (i.e., n = 1), `var_gamma` is set to 0.
`var_emiss`	Either a numeric vector with length `n_dep` or a list of `n_dep` matrices denoting the amount of variance between subjects in the emission distribution(s). When opting for a list of matrices: if `data_distr = 'categorical'`, each element of the list is a matrix with `m` rows and `q_emiss[k] - 1` columns for each of the `k` in `n_dep` emission distribution(s) between subject variances; if `data_distr = 'continuous'`, each element is a matrix with `m` rows and 1 column; the single column denoting the variance between the subjects' means of state i (row i) of the normal distribution used as prior for the Normal emissions. If `data_distr = 'count'`, each element is a matrix with `m` rows and 1 column; the single column denoting the logvariance between the subjects' logmeans of state i (row i) of the lognormal distribution used as prior for the Poisson emissions (note: by default the logvariances should be specified in the natural scale; see argument `log_scale` below). For categorical data, this value corresponds to the variance of the parameters of the Multinomial distribution (i.e., the intercepts of the regression equation of the Multinomial distribution used to sample the components of the emission distribution), see details below. For continuous data, this value corresponds to the variance in the mean of the emission distribution(s) across subjects. For count data, it corresponds to the variance in the logmean of the emission distribution(s) across subjects and by should be specified in the natural scale (see argument `log_scale` below). Note that if only one variance value provided for each emission distribution, the variance is assumed fixed across states (and, for the categorical distribution, categories within a state) within an emission distribution. The default equals 0.1, which corresponds to little variation between subjects given categorical observations. If one wants to simulate data from exactly the same HMM for all subjects, var_emiss should be set to a vector of 0's. Note that if data for only 1 subject is simulated (i.e., n = 1), `var_emiss` is set to a vector of 0's.
`return_ind_par`	A logical scalar. Should the subject specific transition probability matrix `gamma` and emission probability matrix `emiss_distr` be returned by the function (`return_ind_par = TRUE`) or not (`return_ind_par = FALSE`). The default equals `return_ind_par = FALSE`.
`m`	The argument `m` is deprecated; please specify using the input parameter `gen`.
`n_dep`	The argument `n_dep` is deprecated; please specify using the input parameter `gen`.
`q_emiss`	The argument `q_emiss` is deprecated; please specify using the input parameter `gen` (only to be specified when simulating categorical data).
`log_scale`	A logical scalar. If `data_distr = 'count'`, should `emiss_distr` (i.e., the sample average logmeans) and `var_emiss` (i.e., the between subject logvariances) of the lognormal prior adopted for the count be specified in the logarithmic scale (`log_scale = TRUE`) or the natural (i.e., real positive numbers) scale (`log_scale = FALSE`). The default equals `log_scale = FALSE`. Note that if covariates `beta` are used to predict the count emission distribution, then the logarithmic scale should be used for the inputs `emiss_distr`, `beta`, and `var_emiss`, and also set `log_scale = TRUE`.

Details

In simulating the data, having a multilevel structure means that the parameters for each subject are sampled from the population level distribution of the corresponding parameter. The user specifies the population distribution for each parameter: the average population transition probability matrix and its variance, and the average population emission distribution and its variance. For now, the variance of the mean population parameters is assumed fixed for all components of the transition probability matrix and for all components of the emission distribution.

One can simulate multivariate data. That is, the hidden states depend on more than 1 observed variable simultaneously. The distributions of multiple dependent variables for multivariate data are assumed to be independent, and all distributions for one dataset have to be of the same type (either categorical or continuous).

Note that the subject specific initial state distributions (i.e., the probability of each of the states at the first time point) needed to simulate the data are obtained from the stationary distributions of the subject specific transition probability matrices gamma.

beta: As the first element in each row of gamma is used as reference category in the Multinomial logistic regression, the first matrix in the list beta used to predict transition probability matrix gamma has a number of rows equal to m and the number of columns equal to m - 1. The first element in the first row corresponds to the probability of switching from state one to state two. The second element in the first row corresponds to the probability of switching from state one to state three, and so on. The last element in the first row corresponds to the probability of switching from state one to the last state. The same principle holds for the second matrix in the list beta used to predict categorical emission distribution(s) emiss_distr: the first element in the first row corresponds to the probability of observing category two in state one. The second element in the first row corresponds to the probability of observing category three is state one, and so on. The last element in the first row corresponds to the probability of observing the last category in state one.

Note that when simulating count data (data_distr = 'count'), by default the emission distribution parameters emiss_distr and var_emiss are specified in the natural (positive real numbers) scale (as opposite to the logarithmic scale). If the user wants to manually specify these values on the logarithmic scale, please set the argument log_scale = TRUE. Also note that if covariates are used to predict a count emission distribution, then the logarithmic scale should be used for the inputs emiss_distr, beta, and var_emiss, and also set log_scale = TRUE.

Value

The following components are returned by the function sim_mHMM:

states: A matrix containing the simulated hidden state sequences, with one row per hidden state per subject. The first column indicates subject id number. The second column contains the simulated hidden state sequence, consecutively for all subjects. Hence, the id number is repeated over the rows (with the number of repeats equal to the length of the simulated hidden state sequence T for each subject).
obs: A matrix containing the simulated observed outputs, with one row per simulated observation per subject. The first column indicates subject id number. The second column contains the simulated observation sequence, consecutively for all subjects. Hence, the id number is repeated over rows (with the number of repeats equal to the length of the simulated observation sequence T for each subject).
gamma: A list containing n elements with the simulated subject specific transition probability matrices gamma. Only returned if return_ind_par is set to TRUE.
emiss_distr: A list containing n elements with the simulated subject specific emission probability matrices emiss_distr. Only returned if return_ind_par is set to TRUE.

Examples


## Examples on univariate categorical data
# simulating data for 10 subjects with each 100 categorical observations
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 1
q_emiss <- 4
gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
                             0.1, 0.1, 0.8, 0.0,
                             0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data1 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
head(data1$obs)
head(data1$states)

# including a covariate to predict (only) the transition probability matrix gamma
beta      <- rep(list(NULL), 2)
beta[[1]] <- matrix(c(0.5, 1.0,
                     -0.5, 0.5,
                      0.0, 1.0), byrow = TRUE, ncol = 2)
xx_vec      <- rep(list(NULL),2)
xx_vec[[1]] <-  c(rep(0,5), rep(1,5))
data2 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, beta = beta, xx_vec = xx_vec,
                  var_gamma = 1, var_emiss = 1)


# simulating subject specific transition probability matrices and emission distributions only
n_t <- 0
n <- 5
m <- 3
n_dep   <- 1
q_emiss <- 4
gamma <- matrix(c(0.8, 0.1, 0.1,
                  0.2, 0.7, 0.1,
                  0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
                             0.1, 0.1, 0.8, 0.0,
                             0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data3 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
data3

data4 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
                  gamma = gamma, emiss_distr = emiss_distr, var_gamma = .5, var_emiss = .5)
data4


## Example on multivariate continuous data
# simulating multivariate continuous data
n_t     <- 100
n       <- 10
m       <- 3
n_dep   <- 2

gamma   <- matrix(c(0.8, 0.1, 0.1,
                    0.2, 0.7, 0.1,
                    0.2, 0.2, 0.6), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c( 50, 10,
                              100, 10,
                              150, 10), nrow = m, byrow = TRUE),
                    matrix(c(5, 2,
                             10, 5,
                             20, 3), nrow = m, byrow = TRUE))

data_cont <- sim_mHMM(n_t = n_t, n = n, data_distr = 'continuous',
                      gen = list(m = m, n_dep = n_dep),
                      gamma = gamma, emiss_distr = emiss_distr,
                      var_gamma = .5, var_emiss = c(5^2, 0.2^2))

head(data_cont$states)
head(data_cont$obs)


## Example on multivariate count data without covariates
n_t     <- 200     # Number of observations on the dependent variable
m       <- 3        # Number of hidden states
n_dep   <- 2        # Number of dependent variables
n_subj  <- 30        # Number of subjects

gamma   <- matrix(c(0.9, 0.05, 0.05,
                    0.2, 0.7, 0.1,
                    0.2,0.3, 0.5), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c(20,
                             10,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(50,
                             3,
                             20), nrow = m, byrow = TRUE))

# Define between subject variance to use on the simulating function:
# here, the variance is varied over states within the dependent variable.
var_emiss <- list(matrix(c(5.0, 3.0, 1.5), nrow = m),
                  matrix(c(5.0, 5.0, 5.0), nrow = m))

# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
                       n = n_subj,
                       data_distr = "count",
                       gen = list(m = m, n_dep = n_dep),
                       gamma = gamma,
                       emiss_distr = emiss_distr,
                       var_gamma = 0.1,
                       var_emiss = var_emiss,
                       return_ind_par = TRUE)


## Example on multivariate count data with covariates
# Simulate data with one covariate for each count dependent variable

n_t     <- 200     # Number of observations on the dependent variable
m       <- 3        # Number of hidden states
n_dep   <- 3        # Number of dependent variables
n_subj  <- 30        # Number of subjects

gamma   <- matrix(c(0.9, 0.05, 0.05,
                    0.2, 0.7, 0.1,
                    0.2,0.3, 0.5), ncol = m, byrow = TRUE)

emiss_distr <- list(matrix(c(20,
                             10,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(15,
                             2,
                             5), nrow = m, byrow = TRUE),
                    matrix(c(50,
                             3,
                             20), nrow = m, byrow = TRUE))

# Since we have covariates, we specify inputs in the log scale:
emiss_distr_log <- lapply(emiss_distr, function(e) log(e))

# Define list of vectors of covariate values
set.seed(42)
xx_vec <- c(list(NULL),rep(list(rnorm(n_subj,mean = 0, sd = 0.1)),3))

# Define object beta with regression coefficients for the three dependent variables
beta      <- rep(list(NULL), n_dep+1)
beta[[2]] <- matrix(c(1,-1,0), byrow = TRUE, ncol = 1)
beta[[3]] <- matrix(c(2,0,-2), byrow = TRUE, ncol = 1)
beta[[4]] <- matrix(c(-1,3,1), byrow = TRUE, ncol = 1)

# Calculate logvar to use on the simulating function:
logvar <- var_to_logvar(gen = list(m = m, n_dep = n_dep),
                        emiss_mu = emiss_distr,
                        var_emiss =  list(rep(4, m),
                                          rep(2, m),
                                          rep(5, m)),
                        byrow = FALSE)

# Put logvar in the right format:
logvar <- lapply(logvar, function(q) matrix(q, nrow = m))

# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
                       n = n_subj,
                       data_distr = "count",
                       gen = list(m = m, n_dep = n_dep),
                       gamma = gamma,
                       emiss_distr = emiss_distr_log,
                       xx_vec = xx_vec,
                       beta = beta,
                       var_gamma = 0.1,
                       var_emiss = logvar,
                       return_ind_par = TRUE,
                       log_scale = TRUE)

head(data_count$states)
head(data_count$obs)

[Package mHMMbayes version 1.1.0 Index]