| crossval {eglhmm} | R Documentation |
Cross validate an extended generalised linear hidden Markov model.
Description
Calculates a number of cross validation log likelihood values
for a hidden Markov model (usually one fitted by the eglhmm()
function).
Usage
crossval(model,data,nrep,frac=0.8,type,id="id",minNcomp=100,
seed=NULL,crossVerb=FALSE,lastPar=NULL,...)
Arguments
model |
A list having components with names selected from those
of objects returned by |
data |
A data frame containing the observations to which the cross validation are to be fitted. It is of course up to the user to ensure that a specified
value of |
nrep |
Positive nteger scalar; the number of replications, i.e. the
number of cross validation calculations undertaken. If argument
|
frac |
Postive scaler, less than 1. The fraction of the
data randomly selected to be used as |
type |
Integer scalar, equal to either 1 or 2, which determines the nature
of the sampling used to produce the training and validation data.
If If Obviously it is sensible to use |
id |
Character scalar specifying the column to be used to determine the
individual independent time series that make up the data. Ignored
unless |
minNcomp |
Integer scalar specifying the minimum number of components (number
of levels of If the number of components is less than the default value of
|
seed |
Positive integer scalar to be used as a seed for the random
number generator (so as to enable reproducibility). If not
supplied, it is randomly chosen from the sequence |
crossVerb |
Logical scalar. Should brief “progress reports” (letting
the user know what is happening with respect to replicate |
lastPar |
The last values of the (relevant) fitted paramenters, provided as
an attribute of a component of the list returned by the function
currently under consideration (i.e. |
... |
Possible additional arguments to be passed to |
Details
On each replication a random subset comprising frac of
the data is selected to serve as training data. The complement
of this subset is used as validation data. The model specified
by model is fitted to the training data. It is possible
to over-ride some of the details of the specifications producing
model, via the ... argument of crossval().
After the model is fitted to the training data, the log likelihood
of the validation data is calculated on the basis of that
fitted model.
If the procedure for fitting a model to the training data fails
to converge, then the corresponding entry of the list returned by
this function is NA. In this case, the entry is assigned an
attribute lastPar, (the estimates of the model parameters
that were current when the fitting algorithm terminated)
which will in turn have attributes trnDat, valDat
(the training data in question and the corresponding validation
data), and seed (the value of the seed that was set before
the sampling that determined trnDat and valDat
took place). The value of seed is SEEDS[i]
(if nrep>1 and the entry in question was the ith
entry of the returned list) or the value of the seed
argument of this function or its random replacement if this
argument was not supplied (if nrep==1).
The attribute lastPar enables the user to continue the
procedure for fitting a model to the training data, starting
from where the procedure, that failed to converge, left off.
Continuing the procedure is easily effected by calling
crossval() with argument par0 set equal to the
lastPar attribute of the relevant entry of the list that
was previously returned by this function.
If type==1 then the training and validation data are
created in a somewhat subtle manner. The procedure necessitates
referring to the “original” data. The data frame that
is passed to eglhmm() is a “replicated” version
of the original data, with each row of the original data being
repeated once for each level of state (and a "state"
column — factor — being added to the resulting data frame).
If type==2 the procedure is conceptually simpler.
The procedure in the type==1 setting is as follows:
Let
Sbe the set of indices of all non-missing values in the column of the original data that contains the emissions.Select at random a subset
VofSso that the size ofVis the fractionfracof the size ofS.Let
Tbe the complement inSofV.The training data are formed by replacing by
NAall those values of the emissions column in the original data whose indices are inT.The validation data are formed by replacing by
NAall those values of the emissions column in the original data, whose indices are inV.Then replicate both the training and validation data in the manner described above.
If
type==2then the training data are formed by selecting at random a fractionfracof the levels of the column ofdatanamed"id". (If there is no such column, an error is thrown.) The training data then consist of those rows ofdatacorresponding to the selected levels ofid. The validation data then consist of those rows ofdatawhich are not in the training data.
Value
If nrep>1 the returned value is list of length nrep.
The ith entry of this list is the log likelihood of the
validation data with respect to the model fitted to the training
data, for the ith random selection of these two subsets.
This entry will be NA if the attempt to fit a model
to the training data was unsuccessful. The ith entry
has an attribute seed (singular) which is the value of
the seed that was set prior to the random sampling that chose
the training and validation data. If the ith entry is
NA it will also have an attribute lastPar which
in turn will have attributes trnDat and valDat.
See Details.
If nrep>1 then the returned value also has an attribute
seeds (plural) which is vector of length nrep+1,
consisting of the “auxiliary” seed vector SEEDS
(see the argument seed) together with the “over all”
seed (possibly equal to the seed argument) that was set for
the random number generator before any sampling was undertaken.
Note that the ith entry of this seeds attribute
is the same as the seed attribute of the ith entry
of the returned valuel
If nrep==1 then the returned value is a single numeric
scalar which is the log likelihood of the validation data or
NA if the fitting procedure did not converge for the
training data. It has an attribute seed which is equal
to the seed argument or its random replacement. If the
value is NA then it has a further attribute lastPar.
(See above.)
Note
If the function fails to fit the model, obtained from the training
data, to the validation data, then the value returned is NA.
This value will have an attribute lastPar. This attribute
will in turn have attributes, trnDat and valDat,
the training data and validation data which were being used
in the failed fitting procedure. Supplying an appropriate
value of lastPar enables the continuation of the fitting
procedure, starting from where the procedure previously left off.
See Details for a little more information.
Author(s)
Rolf Turner
r.turner@auckland.ac.nz
References
Celeux, Gilles and Durand, Jean-Baptiste (2008). Selecting
hidden Markov model state number with cross-validated likelihood.
Computational Statistics 23 541–564, DOI
10.1007/s00180-007-0097-1.
Smyth, Padhraic (2000). Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 9 63–72.
See Also
eglhmm()
Examples
## Not run:
ids <- paste0("s",1001:1101)
cc <- ccSim[ccSim$id
cc$id <- factor(cc$id)
cvll1 <- vector("list",9)
set.seed(42)
SEEDS <- sample(1:1e6,9)
for(k in 1:9) {
cat("k =",k,"started\n")
fit <- eglhmm(categMC ~ 1, distr="M", method="em", data=cc, K=k,
itmax=1500,cells="id",verb=TRUE)
cvll1[[k]] <- crossval(fit,nrep=5,type=2,seed=SEEDS[k],tolerance=1e-4,
verbose=FALSE,crossVerb=TRUE)
cat("k =",k,"finished\n")
}
## End(Not run)
fit <- eglhmm(y ~ 1, data=Downloads,K=4,distr="P",verb=TRUE,cf="singlecell")
# Use artifically low value of itmax so that crossval() fails to
# fit the model to the training data
cvll2 <- crossval(fit,nrep=5,type=1,verbose=TRUE,seed=322596,itmax=5)
cvll3 <- crossval(fit,type=1,verbose=TRUE,lastPar=attr(cvll2[[1]],"lastPar"))
# So cvll3 carried on, in one instance, from where the first
# attempted fit in cvll2 gave up.