sim_mHMM {mHMMbayes} | R Documentation |
Simulate data using a multilevel hidden Markov model
Description
sim_mHMM
simulates data for multiple subjects, for which the data have
either categorical or continuous (i.e., normally distributed)
observations that follow a hidden Markov model (HMM) with a
multilevel structure. The multilevel structure implies that each subject is
allowed to have its own set of parameters, and that the parameters at the
subject level (level 1) are tied together by a population distribution at
level 2 for each of the corresponding parameters. The shape of the population
distribution for each of the parameters is a normal distribution. In addition
to (natural and/or unexplained) heterogeneity between subjects, the subjects
parameters can also depend on a covariate.
Usage
sim_mHMM(
n_t,
n,
data_distr = "categorical",
gen,
gamma,
emiss_distr,
start_state = NULL,
xx_vec = NULL,
beta = NULL,
var_gamma = 0.1,
var_emiss = NULL,
return_ind_par = FALSE,
m,
n_dep,
q_emiss,
log_scale = FALSE
)
Arguments
n_t |
Numeric vector with length 1 denoting the length of the observed
sequence to be simulated for each subject. To only simulate subject
specific transition probability matrices gamma and emission distributions
(and no data), set |
n |
Numeric vector with length 1 denoting the number of subjects for which data is simulated. |
data_distr |
A character vector of length 1 denoting the distribution adopted for the data given the hidden states. It can take the values 'categorical', 'continuous', or 'count', standing for the for categorical observations following a Multinomial logit, continuous observations following a normal distribution, and count observations following a Poisson distribution, correspondingly. |
gen |
List containing the following elements denoting the general model properties:
|
gamma |
A matrix with |
emiss_distr |
A list with |
start_state |
Optional numeric vector with length 1 denoting in which state the simulated state sequence should start. If left unspecified, the simulated state for time point 1 is sampled from the initial state distribution (which is derived from the transition probability matrix gamma). |
xx_vec |
List of 1 + |
beta |
List of 1 + Note that if |
var_gamma |
Either a numeric vector with length 1 or a matrix of
( |
var_emiss |
Either a numeric vector with length For categorical data, this value corresponds to the variance of the
parameters of the Multinomial distribution (i.e., the intercepts of the
regression equation of the Multinomial distribution used to sample the
components of the emission distribution), see details below. For
continuous data, this value corresponds to the variance in the mean of the
emission distribution(s) across subjects. For count data, it corresponds
to the variance in the logmean of the emission distribution(s) across
subjects and by should be specified in the natural scale (see argument
|
return_ind_par |
A logical scalar. Should the subject specific
transition probability matrix |
m |
The argument |
n_dep |
The argument |
q_emiss |
The argument |
log_scale |
A logical scalar. If |
Details
In simulating the data, having a multilevel structure means that the parameters for each subject are sampled from the population level distribution of the corresponding parameter. The user specifies the population distribution for each parameter: the average population transition probability matrix and its variance, and the average population emission distribution and its variance. For now, the variance of the mean population parameters is assumed fixed for all components of the transition probability matrix and for all components of the emission distribution.
One can simulate multivariate data. That is, the hidden states depend on more than 1 observed variable simultaneously. The distributions of multiple dependent variables for multivariate data are assumed to be independent, and all distributions for one dataset have to be of the same type (either categorical or continuous).
Note that the subject specific initial state distributions (i.e., the probability of each of the states at the first time point) needed to simulate the data are obtained from the stationary distributions of the subject specific transition probability matrices gamma.
beta
: As the first element in each row of gamma
is used as
reference category in the Multinomial logistic regression, the first matrix
in the list beta
used to predict transition probability matrix
gamma
has a number of rows equal to m
and the number of columns
equal to m
- 1. The first element in the first row corresponds to the
probability of switching from state one to state two. The second element in
the first row corresponds to the probability of switching from state one to
state three, and so on. The last element in the first row corresponds to the
probability of switching from state one to the last state. The same principle
holds for the second matrix in the list beta
used to predict
categorical emission distribution(s) emiss_distr
: the first element in
the first row corresponds to the probability of observing category two in
state one. The second element in the first row corresponds to the probability
of observing category three is state one, and so on. The last element in the
first row corresponds to the probability of observing the last category in
state one.
Note that when simulating count data (data_distr = 'count'
), by
default the emission distribution parameters emiss_distr
and
var_emiss
are specified in the natural (positive real numbers) scale
(as opposite to the logarithmic scale). If the user wants to manually
specify these values on the logarithmic scale, please set the argument
log_scale = TRUE
. Also note that if covariates are used to predict a
count emission distribution, then the logarithmic scale should be used for
the inputs emiss_distr
, beta
, and var_emiss
, and also
set log_scale = TRUE
.
Value
The following components are returned by the function sim_mHMM
:
states
A matrix containing the simulated hidden state sequences, with one row per hidden state per subject. The first column indicates subject id number. The second column contains the simulated hidden state sequence, consecutively for all subjects. Hence, the id number is repeated over the rows (with the number of repeats equal to the length of the simulated hidden state sequence
T
for each subject).obs
A matrix containing the simulated observed outputs, with one row per simulated observation per subject. The first column indicates subject id number. The second column contains the simulated observation sequence, consecutively for all subjects. Hence, the id number is repeated over rows (with the number of repeats equal to the length of the simulated observation sequence
T
for each subject).gamma
A list containing
n
elements with the simulated subject specific transition probability matricesgamma
. Only returned ifreturn_ind_par
is set toTRUE
.emiss_distr
A list containing
n
elements with the simulated subject specific emission probability matricesemiss_distr
. Only returned ifreturn_ind_par
is set toTRUE
.
See Also
mHMM
for analyzing multilevel hidden Markov data.
Examples
## Examples on univariate categorical data
# simulating data for 10 subjects with each 100 categorical observations
n_t <- 100
n <- 10
m <- 3
n_dep <- 1
q_emiss <- 4
gamma <- matrix(c(0.8, 0.1, 0.1,
0.2, 0.7, 0.1,
0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
0.1, 0.1, 0.8, 0.0,
0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data1 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
head(data1$obs)
head(data1$states)
# including a covariate to predict (only) the transition probability matrix gamma
beta <- rep(list(NULL), 2)
beta[[1]] <- matrix(c(0.5, 1.0,
-0.5, 0.5,
0.0, 1.0), byrow = TRUE, ncol = 2)
xx_vec <- rep(list(NULL),2)
xx_vec[[1]] <- c(rep(0,5), rep(1,5))
data2 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
gamma = gamma, emiss_distr = emiss_distr, beta = beta, xx_vec = xx_vec,
var_gamma = 1, var_emiss = 1)
# simulating subject specific transition probability matrices and emission distributions only
n_t <- 0
n <- 5
m <- 3
n_dep <- 1
q_emiss <- 4
gamma <- matrix(c(0.8, 0.1, 0.1,
0.2, 0.7, 0.1,
0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(0.5, 0.5, 0.0, 0.0,
0.1, 0.1, 0.8, 0.0,
0.0, 0.0, 0.1, 0.9), nrow = m, ncol = q_emiss, byrow = TRUE))
data3 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
gamma = gamma, emiss_distr = emiss_distr, var_gamma = 1, var_emiss = 1)
data3
data4 <- sim_mHMM(n_t = n_t, n = n, gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
gamma = gamma, emiss_distr = emiss_distr, var_gamma = .5, var_emiss = .5)
data4
## Example on multivariate continuous data
# simulating multivariate continuous data
n_t <- 100
n <- 10
m <- 3
n_dep <- 2
gamma <- matrix(c(0.8, 0.1, 0.1,
0.2, 0.7, 0.1,
0.2, 0.2, 0.6), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c( 50, 10,
100, 10,
150, 10), nrow = m, byrow = TRUE),
matrix(c(5, 2,
10, 5,
20, 3), nrow = m, byrow = TRUE))
data_cont <- sim_mHMM(n_t = n_t, n = n, data_distr = 'continuous',
gen = list(m = m, n_dep = n_dep),
gamma = gamma, emiss_distr = emiss_distr,
var_gamma = .5, var_emiss = c(5^2, 0.2^2))
head(data_cont$states)
head(data_cont$obs)
## Example on multivariate count data without covariates
n_t <- 200 # Number of observations on the dependent variable
m <- 3 # Number of hidden states
n_dep <- 2 # Number of dependent variables
n_subj <- 30 # Number of subjects
gamma <- matrix(c(0.9, 0.05, 0.05,
0.2, 0.7, 0.1,
0.2,0.3, 0.5), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(20,
10,
5), nrow = m, byrow = TRUE),
matrix(c(50,
3,
20), nrow = m, byrow = TRUE))
# Define between subject variance to use on the simulating function:
# here, the variance is varied over states within the dependent variable.
var_emiss <- list(matrix(c(5.0, 3.0, 1.5), nrow = m),
matrix(c(5.0, 5.0, 5.0), nrow = m))
# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
n = n_subj,
data_distr = "count",
gen = list(m = m, n_dep = n_dep),
gamma = gamma,
emiss_distr = emiss_distr,
var_gamma = 0.1,
var_emiss = var_emiss,
return_ind_par = TRUE)
## Example on multivariate count data with covariates
# Simulate data with one covariate for each count dependent variable
n_t <- 200 # Number of observations on the dependent variable
m <- 3 # Number of hidden states
n_dep <- 3 # Number of dependent variables
n_subj <- 30 # Number of subjects
gamma <- matrix(c(0.9, 0.05, 0.05,
0.2, 0.7, 0.1,
0.2,0.3, 0.5), ncol = m, byrow = TRUE)
emiss_distr <- list(matrix(c(20,
10,
5), nrow = m, byrow = TRUE),
matrix(c(15,
2,
5), nrow = m, byrow = TRUE),
matrix(c(50,
3,
20), nrow = m, byrow = TRUE))
# Since we have covariates, we specify inputs in the log scale:
emiss_distr_log <- lapply(emiss_distr, function(e) log(e))
# Define list of vectors of covariate values
set.seed(42)
xx_vec <- c(list(NULL),rep(list(rnorm(n_subj,mean = 0, sd = 0.1)),3))
# Define object beta with regression coefficients for the three dependent variables
beta <- rep(list(NULL), n_dep+1)
beta[[2]] <- matrix(c(1,-1,0), byrow = TRUE, ncol = 1)
beta[[3]] <- matrix(c(2,0,-2), byrow = TRUE, ncol = 1)
beta[[4]] <- matrix(c(-1,3,1), byrow = TRUE, ncol = 1)
# Calculate logvar to use on the simulating function:
logvar <- var_to_logvar(gen = list(m = m, n_dep = n_dep),
emiss_mu = emiss_distr,
var_emiss = list(rep(4, m),
rep(2, m),
rep(5, m)),
byrow = FALSE)
# Put logvar in the right format:
logvar <- lapply(logvar, function(q) matrix(q, nrow = m))
# Simulate count data:
data_count <- sim_mHMM(n_t = n_t,
n = n_subj,
data_distr = "count",
gen = list(m = m, n_dep = n_dep),
gamma = gamma,
emiss_distr = emiss_distr_log,
xx_vec = xx_vec,
beta = beta,
var_gamma = 0.1,
var_emiss = logvar,
return_ind_par = TRUE,
log_scale = TRUE)
head(data_count$states)
head(data_count$obs)