crossval {eglhmm} | R Documentation |
Cross validate an extended generalised linear hidden Markov model.
Description
Calculates a number of cross validation log likelihood values
for a hidden Markov model (usually one fitted by the eglhmm()
function).
Usage
crossval(model,data,nrep,frac=0.8,type,id="id",minNcomp=100,
seed=NULL,crossVerb=FALSE,lastPar=NULL,...)
Arguments
model |
A list having components with names selected from those
of objects returned by |
data |
A data frame containing the observations to which the cross validation are to be fitted. It is of course up to the user to ensure that a specified
value of |
nrep |
Positive nteger scalar; the number of replications, i.e. the
number of cross validation calculations undertaken. If argument
|
frac |
Postive scaler, less than 1. The fraction of the
data randomly selected to be used as |
type |
Integer scalar, equal to either 1 or 2, which determines the nature
of the sampling used to produce the training and validation data.
If If Obviously it is sensible to use |
id |
Character scalar specifying the column to be used to determine the
individual independent time series that make up the data. Ignored
unless |
minNcomp |
Integer scalar specifying the minimum number of components (number
of levels of If the number of components is less than the default value of
|
seed |
Positive integer scalar to be used as a seed for the random
number generator (so as to enable reproducibility). If not
supplied, it is randomly chosen from the sequence |
crossVerb |
Logical scalar. Should brief “progress reports” (letting
the user know what is happening with respect to replicate |
lastPar |
The last values of the (relevant) fitted paramenters, provided as
an attribute of a component of the list returned by the function
currently under consideration (i.e. |
... |
Possible additional arguments to be passed to |
Details
On each replication a random subset comprising frac
of
the data is selected to serve as training data. The complement
of this subset is used as validation data. The model specified
by model
is fitted to the training data. It is possible
to over-ride some of the details of the specifications producing
model
, via the ...
argument of crossval()
.
After the model is fitted to the training data, the log likelihood
of the validation data is calculated on the basis of that
fitted model.
If the procedure for fitting a model to the training data fails
to converge, then the corresponding entry of the list returned by
this function is NA
. In this case, the entry is assigned an
attribute lastPar
, (the estimates of the model parameters
that were current when the fitting algorithm terminated)
which will in turn have attributes trnDat
, valDat
(the training data in question and the corresponding validation
data), and seed
(the value of the seed that was set before
the sampling that determined trnDat
and valDat
took place). The value of seed
is SEEDS[i]
(if nrep>1
and the entry in question was the i
th
entry of the returned list) or the value of the seed
argument of this function or its random replacement if this
argument was not supplied (if nrep==1
).
The attribute lastPar
enables the user to continue the
procedure for fitting a model to the training data, starting
from where the procedure, that failed to converge, left off.
Continuing the procedure is easily effected by calling
crossval()
with argument par0
set equal to the
lastPar
attribute of the relevant entry of the list that
was previously returned by this function.
If type==1
then the training and validation data are
created in a somewhat subtle manner. The procedure necessitates
referring to the “original” data. The data frame that
is passed to eglhmm()
is a “replicated” version
of the original data, with each row of the original data being
repeated once for each level of state
(and a "state"
column — factor — being added to the resulting data frame).
If type==2
the procedure is conceptually simpler.
The procedure in the type==1
setting is as follows:
Let
S
be the set of indices of all non-missing values in the column of the original data that contains the emissions.Select at random a subset
V
ofS
so that the size ofV
is the fractionfrac
of the size ofS
.Let
T
be the complement inS
ofV
.The training data are formed by replacing by
NA
all those values of the emissions column in the original data whose indices are inT
.The validation data are formed by replacing by
NA
all those values of the emissions column in the original data, whose indices are inV
.Then replicate both the training and validation data in the manner described above.
If
type==2
then the training data are formed by selecting at random a fractionfrac
of the levels of the column ofdata
named"id"
. (If there is no such column, an error is thrown.) The training data then consist of those rows ofdata
corresponding to the selected levels ofid
. The validation data then consist of those rows ofdata
which are not in the training data.
Value
If nrep>1
the returned value is list of length nrep
.
The i
th entry of this list is the log likelihood of the
validation data with respect to the model fitted to the training
data, for the i
th random selection of these two subsets.
This entry will be NA
if the attempt to fit a model
to the training data was unsuccessful. The i
th entry
has an attribute seed
(singular) which is the value of
the seed that was set prior to the random sampling that chose
the training and validation data. If the i
th entry is
NA
it will also have an attribute lastPar
which
in turn will have attributes trnDat
and valDat
.
See Details.
If nrep>1
then the returned value also has an attribute
seeds
(plural) which is vector of length nrep+1
,
consisting of the “auxiliary” seed vector SEEDS
(see the argument seed
) together with the “over all”
seed (possibly equal to the seed
argument) that was set for
the random number generator before any sampling was undertaken.
Note that the i
th entry of this seeds
attribute
is the same as the seed
attribute of the i
th entry
of the returned valuel
If nrep==1
then the returned value is a single numeric
scalar which is the log likelihood of the validation data or
NA
if the fitting procedure did not converge for the
training data. It has an attribute seed
which is equal
to the seed
argument or its random replacement. If the
value is NA
then it has a further attribute lastPar
.
(See above.)
Note
If the function fails to fit the model, obtained from the training
data, to the validation data, then the value returned is NA
.
This value will have an attribute lastPar
. This attribute
will in turn have attributes, trnDat
and valDat
,
the training data and validation data which were being used
in the failed fitting procedure. Supplying an appropriate
value of lastPar
enables the continuation of the fitting
procedure, starting from where the procedure previously left off.
See Details for a little more information.
Author(s)
Rolf Turner
r.turner@auckland.ac.nz
References
Celeux, Gilles and Durand, Jean-Baptiste (2008). Selecting
hidden Markov model state number with cross-validated likelihood.
Computational Statistics 23 541–564, DOI
10.1007/s00180-007-0097-1
.
Smyth, Padhraic (2000). Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 9 63–72.
See Also
eglhmm()
Examples
## Not run:
ids <- paste0("s",1001:1101)
cc <- ccSim[ccSim$id
cc$id <- factor(cc$id)
cvll1 <- vector("list",9)
set.seed(42)
SEEDS <- sample(1:1e6,9)
for(k in 1:9) {
cat("k =",k,"started\n")
fit <- eglhmm(categMC ~ 1, distr="M", method="em", data=cc, K=k,
itmax=1500,cells="id",verb=TRUE)
cvll1[[k]] <- crossval(fit,nrep=5,type=2,seed=SEEDS[k],tolerance=1e-4,
verbose=FALSE,crossVerb=TRUE)
cat("k =",k,"finished\n")
}
## End(Not run)
fit <- eglhmm(y ~ 1, data=Downloads,K=4,distr="P",verb=TRUE,cf="singlecell")
# Use artifically low value of itmax so that crossval() fails to
# fit the model to the training data
cvll2 <- crossval(fit,nrep=5,type=1,verbose=TRUE,seed=322596,itmax=5)
cvll3 <- crossval(fit,type=1,verbose=TRUE,lastPar=attr(cvll2[[1]],"lastPar"))
# So cvll3 carried on, in one instance, from where the first
# attempted fit in cvll2 gave up.