R: Simulate data from a hidden generalised linear Markov model.

reglhmm {eglhmm}

R Documentation

Simulate data from a hidden generalised linear Markov model.

Description

Takes a specification of the model and simulates the data from that model. The model may be specified in terms of the individual components of that model (the default method). The components include a data frame that provides the predictor variables, and various parameters of the model. For the "eglhmm" method the model is specified as a fitted model, an object of class "eglhmm".

Usage

reglhmm(x,...)
## Default S3 method:
reglhmm(x, formula, response, cells=NULL, data=NULL, nobs=NULL,
                         distr=c("Gaussian","Poisson","Binomial","Dbd","Multinom"),
                         phi, Rho, sigma, size, ispd=NULL, ntop=NULL, zeta=NULL,
                         missFrac = 0, fep=NULL,
                         contrast=c("treatment","sum","helmert"),...)
## S3 method for class 'eglhmm'
reglhmm(x, missFrac = NULL, ...)

Arguments

`x`	For the default method, the transition probability matrix of the hidden Markov chain. For the `"eglhmm"` method, an object of class `"eglhmm"` as returned by the function `eglhmm()`.
`formula`	The formula specifying the generalised linear model from which data are to be simulated. Note that the predictor variables in this formula must include a factor `state`, which specifies the state of the hidden Markov chain. Note also that this formula must determine a design matrix having a number of columns equal to the length of the vector `phi` of model coefficients provided in `object` (and to the length of `psi` in the case of the Gaussian distribution). If this condition is not satisfied, an error is thrown. It is advisable to use a formula specified in the manner `y~0+state+...` where `...` represents the predictors in the model other than `state`. Of course `phi` must be supplied in a manner that is consistent with this structure.
`response`	A character vector of length 2, specifying the names of the responses. Ignored unless `distr` is `"Multinom"`. If `distr` is `"Multinom"` and if `response` is provided appropriately, then the simulated data are bivariate multinomial.
`cells`	A character vector specifying the names of the factors which determine the “cells” of the model. These factors must be columns of the data frame `data`. (See below.) Each cell corresponds to a time series of (simulated) observations. If `cells` is not supplied (left equal to `NULL`) then the model is taken to have a single cell, i.e. data from a “simple” hidden Markov model is generated. The parameters of that model may be time-varying, and still depend on the predictors specified by `formula`.
`data`	A data frame containing the predictor variables referred to by `formula`, i.e. the predictors for the model from which data are to be simulated. If `data` is not specified, the `nobs` (see below) must be. If `data` is not specified then `formula` must have the structure `y ~ state` or preferably `y ~ 0 + state`. Of course `phi` must be specified in a consistent manner.
`nobs`	Integer scalar. The number of observations to be generated in the setting in which the generalised linear model in question is vacuous. Ignored if `data` is supplied.
`distr`	Character string specifying the distribution of the “emissions” from the model, i.e., of the observations. This distribution determines “emission probabilities”.
`phi`	A numeric vector specifying the coefficients of the linear predictor of the generalised linear model. The length of `phi` must be equal to the number of columns of the design matrix determined by `formula` and `data`. The entries of `phi` must match up appropriately with the columns of the design matrix.
`Rho`	A matrix, or a list of two matrices or a three dimensional array specifying the emissions probabilities for a multinomial distribution. Ignored unless `distr` is `"Multinomial"`.
`sigma`	A numeric vector of length equal to the number of states. Its `i`th entry is the standard deviation of the (Gaussian) distribution corresponding to the `i`th state. Ignored unless `distr` is `"Gaussian"`.
`size`	Integer scalar. The number of trials (sample size) from which the number of “successes” are counted, in the context of the binomial distribution. (I.e. the `size` parameter of `rbinom()`.) Ignored unless `distr` is `"Binomial"`.
`ispd`	An optional numeric vector specifying the initial state probability distribution of the model. If `ispd` is not provided then it is taken to be the stationary/steady state distribution determined by the transition probability matrix `x`. If specified, `ispd` must be a probability vector of length equal to the number of rows (equivalently the number of columns) of `x`.
`ntop`	Integer scalar, strictly greater than 1. The maximum possible value of the db distribution. See `db()`. Used only if `distr` is `"Dbd"`.
`zeta`	Logical scalar. Should zero origin indexing be used? I.e. should the range of values of the db distribution be taken to be `{0,1,2,...,ntop}` rather than `{1,2,...,ntop}`? Used only if `distr` is `"Dbd"`.
`missFrac`	A non-negative scalar, less than 1. Data will be randomly set equal to `NA` with probability `miss.frac`. Note that for the `"eglhmm"` method, if `"miss.frac"` is not supplied then it is extracted from `object`
`fep`	A list of length 1 or 2. The first entry of this list is a logical scalar. If this is `TRUE`, then the first entry of the simulated emissions (or at least one entry of the first pair of simulated emissions) is forced to be “present”, i.e. non-missing. The second entry of `fep`, if present, is a numeric scalar, between 0 and 1 (i.e. a probability). It is equal to the probability that both entries of the first pair of emissions are present. It is ignored if the emissions are univariate. If the emissions are bivariate but the second entry of `fep` is not provided, then this second entry defaults to the “overall” probability that both entries of a pair of emission are present, given that at least on is present. This probability is calculated from `nafrac`.
`contrast`	A character string, one of “treatment”, “helmert” or “sum”, specifying what contrast (for unordered factors) to use in constructing the design matrix. (The contrast for ordered factors, which is has no relevance in this context, is left at it default value of `"contr.poly"`.) Note that the meaning of the coefficient vector `phi` depends on the contrast specified, so make sure that the contrast is the same as what you had in mind when you specified phi!!! Note that for the `"eglhmm"` method, `contrast` is extracted from `x`.
`...`	Not used.

Value

A data frame with the same columns as those of data and an added column, whose name is determined from formula, containing the simulated response

Remark

Although this documentation refers to “generalised linear models”, the only such models currently (13/02/2024) available are the Gaussian model with the identity link, the Poisson model, with the log link, and the Binomial model with the logit link. The Multinomial model, which is also available, is not exactly a generalised linear model; it might be thought of as an “extended” generalised linear model. Other models may be added at a future date.

Author(s)

Rolf Turner rolfturner@posteo.net

References

T. Rolf Turner, Murray A. Cameron, and Peter J. Thomson (1998). Hidden Markov chains in generalized linear models. Canadian Journal of Statististics 26, pp. 107 – 125, DOI: https://doi.org/10.2307/3315677.

Rolf Turner (2008). Direct maximization of the likelihood of a hidden Markov model. Computational Statistics and Data Analysis 52, pp. 4147 – 4160, DOI: https://doi.org/10.1016/j.csda.2008.01.029

Examples

    loc4 <- c("LngRf","BondiE","BondiOff","MlbrOff")
    SCC4 <- SydColCount[SydColCount$locn %in% loc4,] 
    SCC4$locn <- factor(SCC4$locn) # Get rid of unused levels.
    rownames(SCC4) <- 1:nrow(SCC4)
    Tpm   <- matrix(c(0.91,0.09,0.36,0.64),byrow=TRUE,ncol=2)
    Phi   <- c(0,log(5),-0.34,0.03,-0.32,0.14,-0.05,-0.14)
    # The "state effects" are 1 and 5.
    Dat   <- SCC4[,1:3]
    fmla  <- y~0+state+locn+depth
    cells <- c("locn","depth")
# The default method.
    X     <- reglhmm(Tpm,formula=fmla,cells=cells,data=Dat,distr="P",phi=Phi,
                    miss.frac=0.75,contrast="sum")
# The "eglhmm" method.
    fit <- eglhmm(y~locn+depth,data=SCC4,cells=cells,K=2,
                 verb=TRUE,distr="P")
    Y   <- reglhmm(fit)
# Vacuous generalised linear model.
    Z   <- reglhmm(Tpm,formula=y~0+state,nobs=300,distr="P",phi=log(c(2,7)))
    # The "state effects" are 2 and 7.

[Package eglhmm version 0.1-3 Index]