R: Fit a hidden Markov model to discrete data.

hmm {hmm.discnp}

R Documentation

Fit a hidden Markov model to discrete data.

Description

Effects a maximum likelihood fit of a hidden Markov model to discrete data where the observations come from one of a number of finite discrete distributions, depending on the (hidden) state of the Markov chain. These distributions (the “emission probabilities”) are specified non-parametrically. The observations may be univariate, independent bivariate, or dependent bivariate. By default this function uses the EM algorithm. In the univariate setting it may alternatively use a “brute force” method.

Usage

hmm(y, yval=NULL, par0=NULL, K=NULL, rand.start=NULL,
    method=c("EM","bf","LM","SD"), hglmethod=c("fortran","oraw","raw"),
    optimiser=c("nlm","optim"), optimMethod=NULL, stationary=cis,
    mixture=FALSE, cis=TRUE, indep=NULL, tolerance=1e-4, digits=NULL,
    verbose=FALSE, itmax=200, crit=c("PCLL","L2","Linf","ABSGRD"),
    X=NULL,keep.y=FALSE, keep.X=keep.y,
    addIntercept=TRUE, lmc=10, hessian=FALSE,...)

Arguments

`y`	A vector or a list of vectors, or one or two column matrix (bivariate setting) or a list of such matrices; missing values are allowed. If `y` is a vector, or list of vectors (of discrete data) these vectors are coerced to one column matrices. The entries of these vectors or matrices may be numeric or character and are assumed to constitute discrete data.
`yval`	A vector (of length `m`, say) of possible values for the data or a list of two such vectors (of lengths `m1` and `m2`, say, one for each of the two variates in the bivariate settings). These vectors default to the sorted unique values of the respective variates as provided in `y`. If `yval` is supplied and any value of `y` does not match some value of `yval`, then an error is thrown. The argument `yval` is provided so as to allow for fitting of models to data in which some of the data values “of interest” were never observed. The estimated emission probabilities of such “never observed” values will of course be zero.
`par0`	An optional (named) list of starting values for the parameters of the model, with components `tpm` (transition probability matrix), optionally `ispd` (initial state probability distribution) and `Rho`. The object `Rho` specifies the probability that the observations take on each of the possible values of the variate or variates, given the state of the hidden Markov chain. See Details. Note that in the case of independent bivariate data `Rho` is a list of two matrices. These matrices may (and in general will) have different row dimensions, but must have identical column dimensions (equal to `K`, the number of states; see below). If the model is stationary (i.e. if `stationary` is `TRUE`) then you should almost surely not specify the `ispd` component of `par0`. If you do specify it, it really only makes sense to specify it to be the stationary distribution determined by `tpm` and this is a waste of time since this is what the code will take `ispd` to be if you leave it unspecified. If `par0` is not specified, starting values are created by the (undocumented) function `init.all()`.
`K`	The number of states in the hidden Markov chain; if `par0` is not specified `K` MUST be; if `par0` is specified, `K` is ignored. Note that `K=1` is acceptable; if `K` is 1 then all observations are treated as being independent and the non-parametric estimate of the distribution of the observations is calculated in the “obvious” way.
`rand.start`	Either a logical scalar or a list consisting of two logical scalars which must be named `tpm` and `Rho`. If the former, it is converted internally into a list with entries named `tpm` and `Rho`, both having the same value as the original argument. If `tpm` is TRUE then the function init.all() chooses entries for the starting value of `tpm` at random; likewise for `Rho`. If left `NULL`, this argument defaults to `list(tpm=FALSE,Rho=FALSE)`.
`method`	Character string, either `"bf"`, `"EM"`, `"LM"` or `"SD"` (i.e. use numerical maximisation via either `nlm()` or `optim()`, the EM algorithm, the Levenberg-Marquardt algorithm, or the method of steepest descent). May be abbreviated. Currently the `"bf"`, `"LM"` and `"SD"` methods can be used only in the univariate setting, handle only stationary models (see below) and do not do mixtures.
`hglmethod`	Character string; one of `"fortran"`, `"oraw"` or `"raw"`. May be abbreviated. This argument determines the procedure by which the hessian, gradient and log likelihood of the model and data are calculated. If this is argument is equal to `"fortran"` (the default) then (obviously!) dynamically loaded fortran subroutines are used. The other two possibilities effect the calculations in raw R; `"oraw"` (“o” for “original” uses code that is essentially a direct transcription of the fortran code, do-loops being replaced by for-loops. With method `"raw"` the for-loops are eliminated and matrix-vector calculations are applied. The `"oraw"` method is about 25 times slower than the `"fortran"` method and the `"raw"` method is (surprisingly?) even worse; it is more than 30 times slower. The “raw” methods are present mainly for debugging purposes and would not usually be used in practice. This argument is used only if the `method` is `"LM"` or `"SD"` (and is involved only peripherally in the latter instance). It is ignored otherwise.
`optimiser`	Character string specifying the optimiser to use when the “`"bf"`” method of optimisation is chosen. It should be one of `"nlm"` or `"optim"`, and may be abbreviated. Ignored unless `method="bf"`.
`optimMethod`	Character string specifying the optimisation method to be used by `optim()`. Should be one of `"Nelder-Mead"`, `"BFGS"`, `"CG"`, `"L-BFGS-B"`, `"SANN"`, or `"Brent"`. Ignored if the `method` is not `"bf"` or if the optimiser is not `"optim"`.
`stationary`	Logical scalar. If `TRUE` then the model is fitted under the stationarity assumption, i.e. that the Markov chain was in steady state at the time that observations commenced. In this case the initial state probability distribution is estimated as the stationary distribution determined by the (estimated) transition probability matrix. Otherwise if `cis` (see below) is `TRUE` the initial state probability distribution is estimated as the mean of the vectors of conditional probabilities of the states, given the observation sequences, at time `t=1`. If `stationary` is `TRUE` and `cis` is `FALSE` an error is thrown. Currently if the method is `"bf"`, `"LM"` or `"SD"`, and `stationary` is `FALSE`, then an error is thrown.
`mixture`	A logical scalar; if TRUE then a mixture model (all rows of the transition probability matrix are identical) is fitted rather than a general hidden Markov model. Currently an error is thrown if `mixture=TRUE` and the method is `"bf"`, `"LM"` or `"SD"`.
`cis`	A logical scalar specifying whether there should be a constant initial state probability distribution. If `stationary` is `FALSE` and `cis` is `FALSE` then the initial state probability distribution for a given observation sequence is equal to 1 where the (first) maximum of the vector of conditional probabilities of the states, given the observation sequences, at time `t=1`, occurs, and is 0 elsewhere. If `stationary` is `TRUE` and `cis` is `FALSE` an error is given.
`indep`	Logical scalar. Should the bivariate model be fitted under the assumption that the two variables are (conditionally) independent give the state? If this argument is left as `NULL` its value is inferred from the structure of `Rho` in `par0` if the latter is supplied. If the data are bivariate and neither `indep` nor `par0` is supplied, then an error is given. If the data are bivariate and if the value of `indep` is inconsistent with the structure of `par0$Rho` then an error is given. If the data are univariate then `indep` is ignored.
`tolerance`	If the value of the quantity used for the stopping criterion is less than tolerance then the algorithm is considered to have converged. Ignored if `method="bf"`. Defaults to `1e-4`.
`digits`	Integer scalar. The number of digits to which to print out “progress reports” (when `verbose` is `TRUE`). There is a “sensible” default (calculated from `tolerance`). Not used if the method is `"bf"`.
`verbose`	A logical scalar determining whether to print out details of the progress of the algorithm. If the method is `"EM"`, `"LM"` or `"SD"` then when `verbose` is `TRUE` information about the convergence criteria is printed out at every step that the algorithm takes. If `method="bf"` then the value of `verbose` determines the value of the argument `print.level` of `nlm()` or the value of the argument `trace` of `optim()`. In the first case, if `verbose` is `TRUE` then `print.level` is set to 2, otherwise it is set to 0. In the second case, if `verbose` is `TRUE` then `trace` is set to 6, otherwise it is set to 0.
`itmax`	When the method is `"EM"`, `"LM"` or `"SD"` this is the maximum number of steps that the algorithm takes. If the convergence criterion has not been met by the time `itmax` steps have been performed, a warning message is printed out, and the function stops. A value is returned by the function anyway, with the logical component `converged` set to `FALSE`. When `method="bf"` the `itmax` argument is passed to `nlm()` as the value of `iterlim` or to `optim()` as the value of `maxit`. If the (somewhat obscure) convergence criteria of `nlm()` or `optim()` have not been met by the time `itmax` “iterations” have been performed, the algorithm ceases. In this case, if `nlm()` is used. the value of `code` in the object returned set equal to 4 and if `optim()` is used then the value of `convergence` returned is set equal to 1. Note that the value of `code`, respectively `convergence` is returned as the `converged` component of the object returned by `hmm()`. A value of 1 indicates successful completion of the `nlm()` procedure. A value of 0 indicates successful completion of the `optim()` procedure.
`crit`	The name of the stopping criterion used. When `method="EM"` it must be one of `"PCLL"` (percent change in log-likelihood; the default), `"L2"` (L-2 norm, i.e. square root of sum of squares of change in coefficients), or `"Linf"` (L-infinity norm, i.e. maximum absolute value of change in coefficients). When `method="LM"` or `method="SD"` there is a fourth possibility, namely `"ABSGRD"` the (maximum) absolute value of the gradient. It may not be advisable to use this criterion in the current context (i.e. that of discrete non-parametric distributions). See Warnings. This argument defaults to `"PCLL"`. It is ignored if `method="bf"`. (The `nlm()` and `optim()` functions have their own obscure stopping criteria.)
`X`	An optional numeric matrix, or a list of such matrices, of “auxiliary” predictors. The use of such predictors is (currently, at least) applicable only in the univariate emissions setting. If `X` is a list it must be of the same length as `y` and all entries of this list must have the same number of columns. If the columns of any entry of the list are named, then they must be named for all entries, and the column names must be the same for all entries. The number of rows of each entry must be equal to the length of the corresponding entry of `y`. If `X` is a matrix then `y` should be a vector or one-column matrix (or a list with a single entry equal to such). There may be at most one constant column in `X` or the components thereof. If there are any constant columns there must be precisely one (in all components of `X`), it must be the first column and all of its entries must be equal to `1`. If the columns have names, the names of this first column must be `"Intercept"`. Note that `X` (or its entries) must be a numeric matrix (or must be numeric matrices) — not data frames! Factor predictors are not permitted. It may be possible to use factor predictors by supplying `X` or its entries as the output of `model.matrix()`; this will depend on circumstances. The fitted coefficients that are produced when `X` is supplied, are (to put it mildly) a bit difficult to interpret. See Fitted Coefficients of Auxiliary Predictors for some discussion.
`keep.y`	Logical scalar; should the observations `y` be returned as a component of the value of this function?
`keep.X`	Logical scalar; should the predictors `X` be returned as a component of the value of this function? Note that the value of `keep.X` will be silently set equal to `FALSE` unless it actually “makes sense” to keep `X`. I.e. unless the observations are univariate and `X` is actually supplied, i.e. is not `NULL`.
`addIntercept`	Logical scalar. Should a column of ones, corresponding to an intercept term, be prepended to each of the matrices in the list `X`? If each of these matrices already has an initial column of ones, then setting `addIntercept=TRUE` results in an error being thrown. If this is not the case, then by default an initial column of ones is added.
`lmc`	Numeric scalar. The (initial) “Levenberg-Marquardt correction” parameter. Used only if `method="LM"`, otherwise ignored.
`hessian`	Logical scalar. Should the hessian matrix be returned? This argument is relevant only if `method="bf"` (in which case it is passed along to `hmmNumOpt()`) and is ignored otherwise. This argument should be set to `TRUE` only if you really want the hessian matrix. Setting it to `TRUE` causes a substantial delay between the time when `hmm()` finishes its iterations and when it actually returns a value.
`...`	Additional arguments passed to `hmmNumOpt()`. There is one noteworthy argument `useAnalGrad` which is used “directly” by `hmmNumOpt()`. This argument is a logical scalar and if it is `TRUE` then calls to `nlm()` or `optim()` are structured so that an analytic calculation of the gradient vector (implemented by the internal function `get.gl()` is applied. If it is `FALSE` then finite difference methods are used to calculate the gradient vector. If this argument is not specified it defaults to `FALSE`. Note that the name of this argument cannot be abbreviated. Other “additional arguments” may be supplied for the control of `nlm()` and are passed on appropriately to `nlm()`. These are used only if `method="bf"` and if `optimiser="nlm"`. These “...” arguments might typically include `gradtol`, `stepmax` and `steptol`. They should NOT include `print.level` or `iterlim`. The former argument is automatically passed to `nlm()` as `0` if `verbose` is `FALSE` and as `2` if `verbose` is `TRUE`. The latter argument is automatically passed to `nlm()` with the value of `itmax`.

Details

Univariate case: In the univariate case the emission probabilities are specified by means of a data frame Rho. The first column of Rho, named "y", is a factor consisting of the possible values of the emissions, repeated K times (where K is the number of states). The second column, named states, is a factor consisting of integer values 1, 2, ..., K. Each of these values is repeated m times where m is the length of yval. Further columns of Rho are numeric and consist of coefficients of the linear predictor of the probabilities of the various values of y. If X is NULL then Rho has only one further column named Intercept.

If X is not NULL then the Intercept column is present only if addIntercept is TRUE. There as many (other, in addition to the possible Intercept column) numeric columns as there are columns in X or in the matrices in the list X. The names of these columns are taken to be the column names of X or the first entry of X if such column names are present. Otherwise the names default to V1, V2 ....

The probabilities of the emissions taking on their various possible values are given by

\Pr(Y = y_i | \boldsymbol{x}, \textrm{state}=S) = \ell_i/\sum_{j=1}^m \ell_j

where \ell_j is the j\textrm{th} entry of \boldsymbol{\beta}^{\top}\boldsymbol{x} and where in turn \boldsymbol{x} is the vector of predictors and \boldsymbol{\beta} is the coefficient vector in the linear predicator that corresponds to y_i and the hidden state S. For identifiability the vectors \boldsymbol{\beta} corresponding to the first value of Y (the first level of Rho$y) are set equal to the zero vector for all values of the state S.

Note that the Rho component of the starting values par0 may be specified as a matrix of probabilities, with rows corresponding to possible values of the observations and columns corresponding to states. That is the Rho component of par0 may be provided in the form \textrm{Rho} = [\rho_{ij}] where \rho_{ij} = \Pr(Y = y_i | S = j). This is permissible as long as X is NULL and may be found to be more convenient and intuitive. If the starting value for Rho is provided in matrix form it is (silently) converted internally into the data frame form, by the (undocumented) function cnvrtRho().

When argument X is not NULL, it is difficult to specify a “reasonable” value for the Rho component of par0. One might try to specify par0$Rho in the data frame form. The question of how to specify the columns of par0$Rho corresponding to the auxiliary predictors (columns of X or of the entries of X) is a thorny one.

It is permissible in these circumstances to specify par0$Rho as a matrix of probabilities, just as one would do if X were NULL. In this setting the (undocumented) function checkStartVal() converts the matrix of probabilities to data frame form and then appends columns, all of whose entries are 0, corresponding to the auxiliary predictors. When par0 is unspecified, the (undocumented) function init.all() performs similar construction to accommodate a non-NULL value of X. Whether the resulting starting value for Rho makes any real sense, is questionable. However little else can be done.
Independent bivariate case: the emission probabilities are specified by a list of two matrices. In this setting \Pr(Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho^{(1)}_{i_1,j} \rho^{(2)}_{i_2,j} where R^{(k)} = [\rho^{(k)}_{ij}] (k = 1,2) are the two emission probability matrices.
Dependent bivariate case: the emission probabilities are specified by a three dimensional array. In this setting \Pr((Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho_{i_1,i_2,j} where R = [\rho_{ijk}] is the emission probability array.

The hard work of calculating the recursive probabilities used to fit the model is done by a Fortran subroutine recurse (actually coded in Ratfor) which is dynamically loaded. In the univariate case, when X is provided, the estimation of the “linear predictor” vectors \boldsymbol{\beta} is handled by the function multinom() from the nnet package. Note that this is a “Recommended” package and is thereby automatically available (i.e. does not have to be installed).

Value

A list with components:

`Rho`	The fitted value of the data frame, list of two matrices, or array `Rho` (in the case of a univariate model, a bivariate independent model or a bivariate dependent model respectively) specifying the distributions of the observations (the “emission” probabilities).
`Rho.matrix`	Present only in the univariate setting. A matrix whose entries are the (fitted) emission probabilities, row corresponding to values of the emissions and columns to states. The columns sum to 1. This component provides the same information as `Rho`, but in a more readily interpretable form.
`tpm`	The fitted value of the transition probability matrix `tpm`.
`stationary`	Logical scalar; the value of the `stationary` argument.
`ispd`	The fitted initial state probability distribution, or a matrix of initial state probability distributions, one (column) of `ispd` for each observation sequence. If `stationary` is `TRUE` then `ispd` is assumed to be the (unique) stationary distribution for the chain, and thereby determined by the transition probability matrix `tpm`. If `stationary` is `FALSE` and `cis` is `TRUE` then `ispd` is estimated as the mean of the vectors of conditional probabilities of the states, given the observation sequences, at time `t=1`. If `cis` is `FALSE` then `ispd` is a matrix whose columns are the vectors of conditional probabilities of the states, given the observation sequences, at time `t=1`, as described above. (If there is only one observation sequence, then this — one-column — matrix is converted into a vector.)
`log.like`	The final (maximal, we hope!) value of the log likelihood, as determined by the maximisation procedure.
`grad`	The gradient of the log likelihood. Present only if the method is `"LM"` or `"bf"` and in the latter case then only if the optimiser is `nlm()`.
`hessian`	The hessian of the log likelihood. Present only if the method is `"LM"` or `"bf"`.
`stopCrit`	A vector of the (final) values of the stopping criteria, with names `"PCLL"`, `"L2"`, `"Linf"` unless the method is `"LM"` or `"SD"` in which case this vector has a fourth entry named `"ABSGRD"`.
`par0`	The starting values used by the algorithms. Either the argument `par0`, or a similar object with either or both components (`tpm` and `Rho`) being created by `rand.start()`.
`npar`	The number of parameters in the fitted model. Equal to `nispar + ntpmpar + nrhopar` where (1) `nispar` is `0` if `stationary` is `TRUE` and is `K-1` otherwise; (2) `ntpmpar` is `K(K-1)` (3) `nrhopar` is `(nrow(Rho) - K)(ncol(Rho)-2)` for univariate models `K*(sum(sapply(Rho,nrow))-K)` for bivariate independent models `prod(dim(Rho))-K` for bivariate dependent models.
`bicm`	Numeric scalar. The number by which `npar` is multiplied to form the `BIC` criterion. It is essentially the log of the number of observations. See the code of `hmm()` for details.
`converged`	A logical scalar indicating whether the algorithm converged. If the EM, LM or steepest descent algorithm was used it simply indicates whether the stopping criterion was met before the maximum number (`itmax`) of steps was exceeded. If `method="bf"` then `converged` is based on the `code` component of the object returned by the optimiser when `nlm()` was used, or on the `convergence` component when `optim()` was used. In these cases `converged` has an attribute (`code` or `convergence` respectively) giving the (integer) value of the relevant component. Note that in the `nlm()` case a value of `code` equal to 2 indicates “probable” convergence, and a value of 3 indicates “possible” convergence. However in this context `converged` is set equal to `TRUE` only if `code` is 1.
`nstep`	The number of steps performed by the algorithm if the method was `"EM"`, `"LM"` or `"SD"`. The value of `nstep` is set equal to the `iterations` component of the value returned by `nlm()` if `method="bf"`.
`prior.emsteps`	The number of EM steps that were taken before the method was switched from `"EM"` to `"bf"` or to `"LM"`. Present only in values returned under the `"bf"` or `"LM"` methods after a switch from `"EM"` and is equal to `0` if either of these methods was specified in the initial call (rather than arising as the result of a switch).
`ylengths`	Integer vector of the lengths of the observation sequences (number of rows if the observations are in the form of one or two column matrices).
`nafrac`	A real number between 0 and 1 or a pair (two dimensional vector) of such numbers. Each number is the the fraction of missing values if the corresponding components of the observations.
`y`	An object of class `"tidyList"`. It is a tidied up version of the observations; i.e. the observations `y` after the application of the undocumented function `tidyList()`. Present only if `keep.y` is `TRUE`.
`X`	An object of class `"tidyList"`. It is tidied up version of the predictor matrix or list of predictor matrices; i.e. the argument `X` after the application of `tidyList()` (with argument `rp` set to `"predictor"`. Present only if `X` is supplied, is an appropriate argument, and if `keep.X` is `TRUE`.
`parity`	Character string; `"univar"` if the data were univariate, `"bivar"` if they were bivariate.
`numeric`	Logical scalar; `TRUE` if the (original) data were numeric, `FALSE` otherwise.
`AIC`	The value of AIC `= -2log.like + 2npar` for the fitted model.
`BIC`	The value of BIC `= -2log.like + log(nobs)npar` for the fitted model. In the forgoing `nobs` is the number of observations. This is the number of non-missing values in `unlist(y)` in the univariate setting and one half of this number in the bivariate setting.
`args`	A list of argument values supplied. This component is returned in the interest of making results reproducible. It is also needed to facilitate the updating of a model via the update method for the class `hmm.discnp`, `update.hmm.discnp()`. It has components: `method` `optimiser` `optimMethod` `stationary` `mixture` `cis` `tolerance` `itmax` `crit` `addIntercept`

Thanks

A massive nest of bugs was eliminated in the transition from version 3.0-8 to version 3.0-9. These bugs arose in the context of using auxiliary predictor variables (argument X). The handling of such auxiliary predictors was completely messed up. I am grateful to Leah Walker for pointing out the problem to me.

Warnings

The ordering of the (hidden) states can be arbitrary. What the estimation procedure decides to call “state 1” may not be what you think of as being state number 1. The ordering of the states will be affected by the starting values used.

Some experiences with using the "ABSGRD" stopping criterion indicate that it may be problematic in the context of discrete non-parametric distributions. For example a value of 1854.955 was returned after 200 LM steps in one (non-convergent, of course!) attempt at fitting a model. The stopping criterion "PCLL" in this example took the “reasonable” value of 0.03193748 when iterations ceased.

Notes — Various

This function used to have an argument newstyle, a logical scalar (defaulting to TRUE) indicating whether (in the univariate setting) the emission probabilities should be represented in “logistic” form. (See Details, Univariate case:, above.) Now the emission probabilities are always represented in the “logistic” form. The component Rho of the starting parameter values par0 may still be supplied as a matrix of probabilities (with columns summing to 1), but this component is converted (internally, silently) to the logistic form.

The object returned by this function also has (in the univariate setting), in addition to the component Rho, a component Rho.matrix giving the emission probabilities in the more readily interpretable matrix-of-probabilities form. (See Value above.)

The package used to require the argument y to be a matrix in the case of multiple observed sequences. If the series were of unequal length the user was expected to pad them out with NAs to equalize the lengths.

The old matrix format for multiple observation sequences was permitted for a while (and the matrix was internally changed into a list) but this is no longer allowed. If y is indeed given as a matrix then this corresponds to a single observation sequence and it must have one (univariate setting) or two (bivariate setting) columns which constitute the observations of the respective variates.

If K=1 then tpm, ispd, converged, and nstep are all set equal to NA in the list returned by this function.

The estimate of ispd in the non-stationary setting is inevitably very poor, unless the number of sequences of observations (the length of the list y) is very large. We have in effect “less than one” relevant observation for each such sequence.

The returned values of tpm and Rho (or the entries of Rho when Rho is a list) have dimension names. These are formed from the argument yval if this is supplied, otherwise from the sorted unique values of the observations in y. Likewise the returned value of ispd is a named vector, the names being the same as the row (and column) names of tpm.

If method is equal to "EM" there may be a decrease (!!!) in the log likelihood at some EM step. This is “theoretically impossible” but can occur in practice due to an intricacy in the way that the EM algorithm treats ispd when stationary is TRUE. It turns out to be effectively impossible to maximise the expected log likelihood unless the term in that quantity corresponding to ispd is ignored (whence it is ignored). Ignoring this term is “asymptotically negligible” but can have the unfortunate effect of occasionally leading to a decrease in the log likelihood.

If such a decrease is detected, then the algorithm terminates and issues a message to the effect that the decrease occurred. The message suggests that another method be used and that perhaps the results from the penultimate EM step (which are returned by this function) be used as starting values.

It seems to me that it should be the case that such a decrease in the log likelihood can occur only if stationary is TRUE. However I have encountered instances in which a decrease occurred when stationary was FALSE. I have yet to figure out/track down what is going on here.

Note on `method`

If the method is "EM" it is actually possible for the log likelihood to decrease at some EM step. This is “impossible in an ideal world” but can happen to the fact the EM algorithm, as implemented in this package at least, cannot maximise the expected log likelihood if the component corresponding to the initial state probability distribution is taken into consideration. This component should ideally be maximised subject to the constraint that t(P)%*%ispd = ispd, but this constraint seems to effectively impossible to impose. Lagrangian multipliers don't cut it. Hence the summand in question is ignored at the M-step. This usually works alright since the summand is asymptotically negligible, but things can sometimes go wrong. If such a decrease occurs, an error is thrown.

In previous versions of this package, instead of throwing an error the hmm() function would automatically switch to either the "bf" or the "LM" method, depending whether a matrix X of auxiliary predictors is supplied, starting from the penultimate parameter estimates produced by the EM algorithm. However this appears not to be a good idea; those “penultimate estimates” appear not to be good starting values for the other methods. Hence an error is now thrown and the user is explicitly instructed to invoke a different method, “starting from scratch”.

Fitted Coefficients of the Predictors

It is of course of interest to understand the meaning of the coefficients that are fitted to the predictors in the model. If X is supplied then the number of predictors is (as a rule) one (for the intercept) plus the number of columns in each entry of X. We say “as a rule” because, e.g., the entries of X could each have an “intercept” column, or the addIntercept argument could be FALSE. If X is not supplied there is only one predictor, named Intercept.

The interpretation of these predictor coefficients is a bit subtle. To get an idea of what it's all about, consider the output from example 4. (See Examples). The fitted coefficients in question are to be found in columns 3 and onward of the component Rho of the object returned by hmm(). In the context of example 4, this object is fit.wap. (The suffix wap stands for “with auxiliary predictors”.)

  fit.wap$Rho
       y state Intercept     ma.com      nh.com     bo.com
  1   lo     1 1.3810463  0.4527982 -3.27161353 -1.9563915
  2  mlo     1 0.1255631 -1.1402546 -1.37713744  0.5946980
  3    m     1 0.7356526  0.1523734 -2.70841817 -0.1794645
  4  mhi     1 0.8479798 -0.2438988 -1.12544989 -0.9650320
  5   hi     1 0.0000000  0.0000000  0.00000000  0.0000000
  6   lo     2 3.9439410 -0.8355306 -0.77702276  1.4963631
  7  mlo     2 2.6189880 -1.9373885 -0.09190623  0.8316870
  8    m     2 2.1457317 -1.7276183  0.19524655 -0.3249485
  9  mhi     2 1.8834139 -1.3760011 -0.59806309  1.2828365
  10  hi     2 0.0000000  0.0000000  0.00000000  0.0000000

If you multiply the matrix consisting of the predictor coefficients (columns 3 to 6 of Rho in this instance) times a vector of predictors you get, for each state, the “exponential form” of the probabilities (“pre-probabilities”) for each of the possible y-values, given the vector of predictors.

E.g. set x <- c(1,1,0,0). This vector picks up the intercept and indicates that the Malabar outfall has been commissioned, the North Head outfall has not been commissioned, and the Bondi Offshore outfall has not been commissioned.

Now set:

    pp1 <- (as.matrix(fit.wap$Rho)[,3:6]%*%x)[1:5]
    pp2 <- (as.matrix(fit.wap$Rho)[,3:6]%*%x)[6:10]

Note that pp1 consists of “exponential probabilities” corresponding to state 1, and pp2 consists of “exponential probabilities” corresponding to state 2. To convert the foregoing pre-probabilities to the actual probabilities of the y-values, we apply the — undocumented — function expForm2p():

    p1 <- expForm2p(pp1)
    p2 <- expForm2p(pp2)

The value of p1 is

[1] 0.52674539 0.03051387 0.20456767 0.15400019 0.08417288

and that of p2 is

[1] 0.78428283 0.06926632 0.05322204 0.05819340 0.03503541

Note that p1 and p2 each sum to 1, as they should/must do. This says, e.g., that when the system is in state 2, and Malabar has been commissioned but North Head and Bondi Offshore have not, the (estimated) probability that y is "mhi" (medium-high) is 0.05819340.

It may be of some interest to test the hypothesis that the predictors have any actual predictive power at all:

    fit.nap <- hmm(xxx,yval=Yval,K=2,verb=TRUE)
    # "nap" <--> no aux. preds

There is a bit of a problem here, in that the likelihood decreases at EM step 65. (See the warning message.)

We can check on this problem by refitting using method="LM".

    fit.nap.lm <- hmm(xxx,yval=Yval,par0=fit.nap,method="LM",verb=TRUE)

Doing so produces only a small improvement in the log likelihood (from -1821.425 to -1820.314), so we really could have ignored the problem. We can now do anova(fit.wap,fit.nap) which gives

    $stat
    [1] 153.5491

    $df
    [1] 24

    $pvalue
    [1] 7.237102e-21

Thus the p-value is effectively zero, saying that in this instance the auxiliary predictors appear to have a “significant” impact on the fit.

Author(s)

Rolf Turner r.turner@auckland.ac.nz

References

Rabiner, L. R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE vol. 77, pp. 257 – 286, 1989.

Zucchini, W. and Guttorp, P., "A hidden Markov model for space-time precipitation," Water Resources Research vol. 27, pp. 1917-1923, 1991.

MacDonald, I. L., and Zucchini, W., "Hidden Markov and Other Models for Discrete-valued Time Series", Chapman & Hall, London, 1997.

Liu, Limin, "Hidden Markov Models for Precipitation in a Region of Atlantic Canada", Master's Report, University of New Brunswick, 1997.

Examples

# TO DO: Create one or more bivariate examples.
#
# The value of itmax in the following examples is so much
# too small as to be risible.  This is just to speed up the
# R CMD check process.
# 1.
Yval <- LETTERS[1:10]
Tpm  <- matrix(c(0.75,0.25,0.25,0.75),ncol=2,byrow=TRUE)
Rho  <- cbind(c(rep(1,5),rep(0,5)),c(rep(0,5),rep(1,5)))/5
rownames(Rho) <- Yval
set.seed(42)
xxx  <- rhmm(ylengths=rep(1000,5),nsim=1,tpm=Tpm,Rho=Rho,yval=Yval,drop=TRUE)
fit  <- hmm(xxx,par0=list(tpm=Tpm,Rho=Rho),itmax=10)
print(fit$Rho) # A data frame
print(cnvrtRho(fit$Rho)) # A matrix of probabilities
                         # whose columns sum to 1.

# 2.
# See the help for logLikHmm() for how to generate y.num.
## Not run: 
   fit.num     <- hmm(y.num,K=2,verb=TRUE,itmax=10)
   fit.num.mix <- hmm(y.num,K=2,verb=TRUE,mixture=TRUE,itmax=10)
   print(fit.num[c("tpm","Rho")])

## End(Not run)
# Note that states 1 and 2 get swapped.

# 3.
xxx <- with(SydColDisc,split(y,f=list(locn,depth)))
Yval <- c("lo","mlo","m","mhi","hi")
# Two states: above and below the thermocline.
fitSydCol <- hmm(xxx,yval=Yval,K=2,verb=TRUE,itmax=10)

# 4.
X <- split(SydColDisc[,c("ma.com","nh.com","bo.com")],
           f=with(SydColDisc,list(locn,depth)))
X <- lapply(X,function(x){
                 as.matrix(as.data.frame(lapply(x,as.numeric)))-1})
fit.wap <- hmm(xxx,yval=Yval,K=2,X=X,verb=TRUE,itmax=10)
# wap <--> with auxiliary predictors.

# 5.
## Not run:  # Takes too long.
fitlm <- hmm(xxx,yval=Yval,K=2,method="LM",verb=TRUE)
fitem <- hmm(xxx,yval=Yval,K=2,verb=TRUE)
# Algorithm terminates due to a decrease in the log likelihood
# at EM step 64.
newfitlm <- hmm(xxx,yval=Yval,par0=fitem,method="LM",verb=TRUE)
# The log likelihood improves from -1900.988 to -1820.314

## End(Not run)

# 6.
fitLesCount <- hmm(lesionCount,K=2,itmax=10) # Two states: relapse and remission.

[Package hmm.discnp version 3.0-9 Index]