hmm {hmm.discnp} | R Documentation |
Fit a hidden Markov model to discrete data.
Description
Effects a maximum likelihood fit of a hidden Markov model to discrete data where the observations come from one of a number of finite discrete distributions, depending on the (hidden) state of the Markov chain. These distributions (the “emission probabilities”) are specified non-parametrically. The observations may be univariate, independent bivariate, or dependent bivariate. By default this function uses the EM algorithm. In the univariate setting it may alternatively use a “brute force” method.
Usage
hmm(y, yval=NULL, par0=NULL, K=NULL, rand.start=NULL,
method=c("EM","bf","LM","SD"), hglmethod=c("fortran","oraw","raw"),
optimiser=c("nlm","optim"), optimMethod=NULL, stationary=cis,
mixture=FALSE, cis=TRUE, indep=NULL, tolerance=1e-4, digits=NULL,
verbose=FALSE, itmax=200, crit=c("PCLL","L2","Linf","ABSGRD"),
X=NULL,keep.y=FALSE, keep.X=keep.y,
addIntercept=TRUE, lmc=10, hessian=FALSE,...)
Arguments
y |
A vector or a list of vectors, or one or two column matrix
(bivariate setting) or a list of such matrices; missing values
are allowed. If |
yval |
A vector (of length The argument |
par0 |
An optional (named) list of starting values for the
parameters of the model, with components If the model is stationary (i.e. if If |
K |
The number of states in the hidden Markov chain; if Note that |
rand.start |
Either a logical scalar or a list consisting of two logical
scalars which must be named |
method |
Character string, either |
hglmethod |
Character string; one of |
optimiser |
Character string specifying the optimiser to use when the
“ |
optimMethod |
Character string specifying the optimisation method to be used by
|
stationary |
Logical scalar. If |
mixture |
A logical scalar; if TRUE then a mixture model (all rows of the
transition probability matrix are identical) is fitted rather
than a general hidden Markov model. Currently an error is
thrown if |
cis |
A logical scalar specifying whether there should be a
constant initial state probability
distribution. If |
indep |
Logical scalar. Should the bivariate model be fitted under the
assumption that the two variables are (conditionally) independent
give the state? If this argument is left as |
tolerance |
If the value of the quantity used for the stopping criterion
is less than tolerance then the algorithm is considered to
have converged. Ignored if |
digits |
Integer scalar. The number of digits to which to print out
“progress reports” (when |
verbose |
A logical scalar determining whether to print out details of
the progress of the algorithm. If the method is |
itmax |
When the method is |
crit |
The name of the stopping criterion used. When |
X |
An optional numeric matrix, or a list of such
matrices, of “auxiliary” predictors. The use of
such predictors is (currently, at least) applicable only in the
univariate emissions setting. If There may be at most one constant column in Note that The fitted coefficients that are produced when |
keep.y |
Logical scalar; should the observations |
keep.X |
Logical scalar; should the predictors |
addIntercept |
Logical scalar. Should a column of ones, corresponding to
an intercept term, be prepended to each of the matrices in
the list |
lmc |
Numeric scalar. The (initial) “Levenberg-Marquardt
correction” parameter. Used only if |
hessian |
Logical scalar. Should the hessian matrix be
returned? This argument is relevant only if |
... |
Additional arguments passed to Other “additional arguments” may be supplied for the
control of |
Details
-
Univariate case: In the univariate case the emission probabilities are specified by means of a data frame
Rho
. The first column ofRho
, named"y"
, is a factor consisting of the possible values of the emissions, repeatedK
times (whereK
is the number of states). The second column, namedstates
, is a factor consisting of integer values1, 2, ..., K
. Each of these values is repeatedm
times wherem
is the length ofyval
. Further columns ofRho
are numeric and consist of coefficients of the linear predictor of the probabilities of the various values ofy
. IfX
isNULL
thenRho
has only one further column namedIntercept
.If
X
is notNULL
then theIntercept
column is present only ifaddIntercept
isTRUE
. There as many (other, in addition to the possibleIntercept
column) numeric columns as there are columns inX
or in the matrices in the listX
. The names of these columns are taken to be the column names ofX
or the first entry ofX
if such column names are present. Otherwise the names default toV1
,V2
....The probabilities of the emissions taking on their various possible values are given by
\Pr(Y = y_i | \boldsymbol{x}, \textrm{state}=S) = \ell_i/\sum_{j=1}^m \ell_j
where
\ell_j
is thej\textrm{th}
entry of\boldsymbol{\beta}^{\top}\boldsymbol{x}
and where in turn\boldsymbol{x}
is the vector of predictors and\boldsymbol{\beta}
is the coefficient vector in the linear predicator that corresponds toy_i
and the hidden stateS
. For identifiability the vectors\boldsymbol{\beta}
corresponding to the first value ofY
(the first level ofRho$y
) are set equal to the zero vector for all values of the stateS
.Note that the
Rho
component of the starting valuespar0
may be specified as a matrix of probabilities, with rows corresponding to possible values of the observations and columns corresponding to states. That is theRho
component ofpar0
may be provided in the form\textrm{Rho} = [\rho_{ij}]
where\rho_{ij} = \Pr(Y = y_i | S = j)
. This is permissible as long asX
isNULL
and may be found to be more convenient and intuitive. If the starting value forRho
is provided in matrix form it is (silently) converted internally into the data frame form, by the (undocumented) functioncnvrtRho()
.When argument
X
is notNULL
, it is difficult to specify a “reasonable” value for theRho
component ofpar0
. One might try to specifypar0$Rho
in the data frame form. The question of how to specify the columns ofpar0$Rho
corresponding to the auxiliary predictors (columns ofX
or of the entries ofX
) is a thorny one.It is permissible in these circumstances to specify
par0$Rho
as a matrix of probabilities, just as one would do ifX
wereNULL
. In this setting the (undocumented) functioncheckStartVal()
converts the matrix of probabilities to data frame form and then appends columns, all of whose entries are 0, corresponding to the auxiliary predictors. Whenpar0
is unspecified, the (undocumented) functioninit.all()
performs similar construction to accommodate a non-NULL
value ofX
. Whether the resulting starting value forRho
makes any real sense, is questionable. However little else can be done. -
Independent bivariate case: the emission probabilities are specified by a list of two matrices. In this setting
\Pr(Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho^{(1)}_{i_1,j} \rho^{(2)}_{i_2,j}
whereR^{(k)} = [\rho^{(k)}_{ij}]
(k = 1,2
) are the two emission probability matrices. -
Dependent bivariate case: the emission probabilities are specified by a three dimensional array. In this setting
\Pr((Y_1,Y_2) = (y_{i1},y_{i2}) | S = j) = \rho_{i_1,i_2,j}
whereR = [\rho_{ijk}]
is the emission probability array.
The hard work of calculating the recursive probabilities used
to fit the model is done by a Fortran subroutine recurse
(actually coded in Ratfor) which is dynamically loaded. In the
univariate case, when X
is provided, the estimation of the
“linear predictor” vectors \boldsymbol{\beta}
is handled by the function multinom()
from the nnet
package. Note that this is a “Recommended” package
and is thereby automatically available (i.e. does not have to
be installed).
Value
A list with components:
Rho |
The fitted value of the data frame, list of two matrices,
or array |
Rho.matrix |
Present only in the univariate setting. A matrix
whose entries are the (fitted) emission probabilities,
row corresponding to values of the emissions and columns
to states. The columns sum to 1. This component provides
the same information as |
tpm |
The fitted value of the transition probability matrix |
stationary |
Logical scalar; the value of the |
ispd |
The fitted initial state probability distribution, or a matrix
of initial state probability distributions, one (column) of
If If |
log.like |
The final (maximal, we hope!) value of the log likelihood, as determined by the maximisation procedure. |
grad |
The gradient of the log likelihood. Present only if the
method is |
hessian |
The hessian of the log likelihood. Present only if the
method is |
stopCrit |
A vector of the (final) values of the stopping criteria, with
names |
par0 |
The starting values used by the algorithms. Either the argument
|
npar |
The number of parameters in the fitted model. Equal to
|
bicm |
Numeric scalar. The number by which |
converged |
A logical scalar indicating whether the algorithm converged.
If the EM, LM or steepest descent algorithm was used it simply
indicates whether the stopping criterion was met before
the maximum number ( Note that in the |
nstep |
The number of steps performed by the algorithm if the method
was |
prior.emsteps |
The number of EM steps that were taken before the method was
switched from |
ylengths |
Integer vector of the lengths of the observation sequences (number of rows if the observations are in the form of one or two column matrices). |
nafrac |
A real number between 0 and 1 or a pair (two dimensional vector) of such numbers. Each number is the the fraction of missing values if the corresponding components of the observations. |
y |
An object of class |
X |
An object of class |
parity |
Character string; |
numeric |
Logical scalar; |
AIC |
The value of AIC |
BIC |
The value of BIC |
args |
A list of argument values supplied. This component is
returned in the interest of making results reproducible.
It is also needed to facilitate the updating of a model
via the update method for the class It has components:
|
Thanks
A massive nest of bugs was eliminated in the transition from version
3.0-8 to version 3.0-9. These bugs arose in the context of using
auxiliary predictor variables (argument X
). The handling of
such auxiliary predictors was completely messed up. I am grateful
to Leah Walker for pointing out the problem to me.
Warnings
The ordering of the (hidden) states can be arbitrary. What the estimation procedure decides to call “state 1” may not be what you think of as being state number 1. The ordering of the states will be affected by the starting values used.
Some experiences with using the "ABSGRD"
stopping
criterion indicate that it may be problematic in the context of
discrete non-parametric distributions. For example a value of
1854.955 was returned after 200 LM steps in one (non-convergent,
of course!) attempt at fitting a model. The stopping criterion
"PCLL"
in this example took the “reasonable”
value of 0.03193748 when iterations ceased.
Notes — Various
This function used to have an argument newstyle
,
a logical scalar (defaulting to TRUE
) indicating whether
(in the univariate setting) the emission probabilities
should be represented in “logistic” form. (See
Details, Univariate case:, above.) Now the
emission probabilities are always represented in the
“logistic” form. The component Rho
of the
starting parameter values par0
may still be supplied
as a matrix of probabilities (with columns summing to 1), but
this component is converted (internally, silently) to the
logistic form.
The object returned by this function also has (in the univariate
setting), in addition to the component Rho
, a component
Rho.matrix
giving the emission probabilities in the
more readily interpretable matrix-of-probabilities form. (See
Value above.)
The package used to require the argument y
to
be a matrix in the case of multiple observed sequences.
If the series were of unequal length the user was expected to
pad them out with NAs to equalize the lengths.
The old matrix format for multiple observation sequences was
permitted for a while (and the matrix was internally changed into
a list) but this is no longer allowed. If y
is indeed
given as a matrix then this corresponds to a single observation
sequence and it must have one (univariate setting) or two
(bivariate setting) columns which constitute the observations
of the respective variates.
If K=1
then tpm
, ispd
, converged
,
and nstep
are all set equal to NA
in the list
returned by this function.
The estimate of ispd
in the non-stationary setting
is inevitably very poor, unless the number of sequences of
observations (the length of the list y
) is very large.
We have in effect “less than one” relevant observation for
each such sequence.
The returned values of tpm
and Rho
(or the entries
of Rho
when Rho
is a list) have dimension names.
These are formed from the argument yval
if this is
supplied, otherwise from the sorted unique values of the
observations in y
. Likewise the returned value of
ispd
is a named vector, the names being the same as the
row (and column) names of tpm
.
If method
is equal to "EM"
there may be a
decrease (!!!) in the log likelihood at some EM step.
This is “theoretically impossible” but can occur in
practice due to an intricacy in the way that the EM algorithm
treats ispd
when stationary
is TRUE
.
It turns out to be effectively impossible to maximise the expected
log likelihood unless the term in that quantity corresponding
to ispd
is ignored (whence it is ignored).
Ignoring this term is “asymptotically negligible” but
can have the unfortunate effect of occasionally leading to a
decrease in the log likelihood.
If such a decrease is detected, then the algorithm terminates and issues a message to the effect that the decrease occurred. The message suggests that another method be used and that perhaps the results from the penultimate EM step (which are returned by this function) be used as starting values.
It seems to me that it should be the case that such a
decrease in the log likelihood can occur only if stationary
is TRUE
. However I have encountered instances in which
a decrease occurred when stationary
was FALSE
.
I have yet to figure out/track down what is going on here.
Note on method
If the method
is "EM"
it is actually possible
for the log likelihood to decrease at some EM step.
This is “impossible in an ideal world” but can happen
to the fact the EM algorithm, as implemented in this package
at least, cannot maximise the expected log likelihood if the
component corresponding to the initial state probability
distribution is taken into consideration. This component
should ideally be maximised subject to the constraint that
t(P)%*%ispd = ispd
, but this constraint seems to
effectively impossible to impose. Lagrangian multipliers
don't cut it. Hence the summand in question is ignored at
the M-step. This usually works alright since the summand
is asymptotically negligible, but things can sometimes go
wrong. If such a decrease occurs, an error is thrown.
In previous versions of this package, instead of throwing
an error the hmm()
function would automatically switch
to either the "bf"
or the "LM"
method, depending
whether a matrix X
of auxiliary predictors is supplied,
starting from the penultimate parameter estimates produced
by the EM algorithm. However this appears not to be a good
idea; those “penultimate estimates” appear not to be
good starting values for the other methods. Hence an error
is now thrown and the user is explicitly instructed to invoke
a different method, “starting from scratch”.
Fitted Coefficients of the Predictors
It is of course of interest to understand the meaning of the
coefficients that are fitted to the predictors in the model.
If X
is supplied then the number of predictors is (as a rule)
one (for the intercept) plus the number of columns in each entry
of X
. We say “as a rule” because, e.g., the entries
of X
could each have an “intercept” column, or the
addIntercept
argument could be FALSE
. If X
is not supplied there is only one predictor, named Intercept
.
The interpretation of these predictor coefficients is a bit subtle.
To get an idea of what it's all about, consider the output from
example 4
. (See Examples). The fitted coefficients
in question are to be found in columns 3 and onward of the component
Rho
of the object returned by hmm()
. In the context
of example 4
, this object is fit.wap
. (The suffix
wap
stands for “with auxiliary predictors”.)
fit.wap$Rho y state Intercept ma.com nh.com bo.com 1 lo 1 1.3810463 0.4527982 -3.27161353 -1.9563915 2 mlo 1 0.1255631 -1.1402546 -1.37713744 0.5946980 3 m 1 0.7356526 0.1523734 -2.70841817 -0.1794645 4 mhi 1 0.8479798 -0.2438988 -1.12544989 -0.9650320 5 hi 1 0.0000000 0.0000000 0.00000000 0.0000000 6 lo 2 3.9439410 -0.8355306 -0.77702276 1.4963631 7 mlo 2 2.6189880 -1.9373885 -0.09190623 0.8316870 8 m 2 2.1457317 -1.7276183 0.19524655 -0.3249485 9 mhi 2 1.8834139 -1.3760011 -0.59806309 1.2828365 10 hi 2 0.0000000 0.0000000 0.00000000 0.0000000
If you multiply the matrix consisting of the predictor coefficients
(columns 3 to 6 of Rho
in this instance) times a vector of
predictors you get, for each state, the “exponential form”
of the probabilities (“pre-probabilities”) for each of the
possible y
-values, given the vector of predictors.
E.g. set x <- c(1,1,0,0)
. This vector picks up the intercept
and indicates that the Malabar outfall has been commissioned,
the North Head outfall has not been commissioned, and the Bondi
Offshore outfall has not been commissioned.
Now set:
pp1 <- (as.matrix(fit.wap$Rho)[,3:6]%*%x)[1:5] pp2 <- (as.matrix(fit.wap$Rho)[,3:6]%*%x)[6:10]
Note that pp1
consists of “exponential
probabilities” corresponding to state 1, and pp2
consists of “exponential probabilities” corresponding
to state 2. To convert the foregoing pre-probabilities to the
actual probabilities of the y
-values, we apply the —
undocumented — function expForm2p()
:
p1 <- expForm2p(pp1) p2 <- expForm2p(pp2)
The value of p1
is
[1] 0.52674539 0.03051387 0.20456767 0.15400019 0.08417288
and that of p2
is
[1] 0.78428283 0.06926632 0.05322204 0.05819340 0.03503541
Note that p1
and p2
each sum to 1, as they should/must
do. This says, e.g., that when the system is in state 2, and
Malabar has been commissioned but North Head and Bondi Offshore
have not, the (estimated) probability that y
is "mhi"
(medium-high) is 0.05819340.
It may be of some interest to test the hypothesis that the predictors have any actual predictive power at all:
fit.nap <- hmm(xxx,yval=Yval,K=2,verb=TRUE) # "nap" <--> no aux. preds
There is a bit of a problem here, in that the likelihood decreases at EM step 65. (See the warning message.)
We can check on this problem by refitting using method="LM".
fit.nap.lm <- hmm(xxx,yval=Yval,par0=fit.nap,method="LM",verb=TRUE)
Doing so produces only a small improvement in the log likelihood
(from -1821.425 to -1820.314), so we really could have ignored the
problem. We can now do anova(fit.wap,fit.nap)
which gives
$stat [1] 153.5491 $df [1] 24 $pvalue [1] 7.237102e-21
Thus the p-value is effectively zero, saying that in this instance the auxiliary predictors appear to have a “significant” impact on the fit.
Author(s)
Rolf Turner
r.turner@auckland.ac.nz
References
Rabiner, L. R., "A tutorial on hidden Markov models and selected applications in speech recognition," Proc. IEEE vol. 77, pp. 257 – 286, 1989.
Zucchini, W. and Guttorp, P., "A hidden Markov model for space-time precipitation," Water Resources Research vol. 27, pp. 1917-1923, 1991.
MacDonald, I. L., and Zucchini, W., "Hidden Markov and Other Models for Discrete-valued Time Series", Chapman & Hall, London, 1997.
Liu, Limin, "Hidden Markov Models for Precipitation in a Region of Atlantic Canada", Master's Report, University of New Brunswick, 1997.
See Also
Examples
# TO DO: Create one or more bivariate examples.
#
# The value of itmax in the following examples is so much
# too small as to be risible. This is just to speed up the
# R CMD check process.
# 1.
Yval <- LETTERS[1:10]
Tpm <- matrix(c(0.75,0.25,0.25,0.75),ncol=2,byrow=TRUE)
Rho <- cbind(c(rep(1,5),rep(0,5)),c(rep(0,5),rep(1,5)))/5
rownames(Rho) <- Yval
set.seed(42)
xxx <- rhmm(ylengths=rep(1000,5),nsim=1,tpm=Tpm,Rho=Rho,yval=Yval,drop=TRUE)
fit <- hmm(xxx,par0=list(tpm=Tpm,Rho=Rho),itmax=10)
print(fit$Rho) # A data frame
print(cnvrtRho(fit$Rho)) # A matrix of probabilities
# whose columns sum to 1.
# 2.
# See the help for logLikHmm() for how to generate y.num.
## Not run:
fit.num <- hmm(y.num,K=2,verb=TRUE,itmax=10)
fit.num.mix <- hmm(y.num,K=2,verb=TRUE,mixture=TRUE,itmax=10)
print(fit.num[c("tpm","Rho")])
## End(Not run)
# Note that states 1 and 2 get swapped.
# 3.
xxx <- with(SydColDisc,split(y,f=list(locn,depth)))
Yval <- c("lo","mlo","m","mhi","hi")
# Two states: above and below the thermocline.
fitSydCol <- hmm(xxx,yval=Yval,K=2,verb=TRUE,itmax=10)
# 4.
X <- split(SydColDisc[,c("ma.com","nh.com","bo.com")],
f=with(SydColDisc,list(locn,depth)))
X <- lapply(X,function(x){
as.matrix(as.data.frame(lapply(x,as.numeric)))-1})
fit.wap <- hmm(xxx,yval=Yval,K=2,X=X,verb=TRUE,itmax=10)
# wap <--> with auxiliary predictors.
# 5.
## Not run: # Takes too long.
fitlm <- hmm(xxx,yval=Yval,K=2,method="LM",verb=TRUE)
fitem <- hmm(xxx,yval=Yval,K=2,verb=TRUE)
# Algorithm terminates due to a decrease in the log likelihood
# at EM step 64.
newfitlm <- hmm(xxx,yval=Yval,par0=fitem,method="LM",verb=TRUE)
# The log likelihood improves from -1900.988 to -1820.314
## End(Not run)
# 6.
fitLesCount <- hmm(lesionCount,K=2,itmax=10) # Two states: relapse and remission.