R: Cross validate an extended generalised linear hidden Markov...

crossval {eglhmm}

R Documentation

Cross validate an extended generalised linear hidden Markov model.

Description

Calculates a number of cross validation log likelihood values for a hidden Markov model (usually one fitted by the eglhmm() function).

Usage

    crossval(model,data,nrep,frac=0.8,type,id="id",minNcomp=100,
             seed=NULL,crossVerb=FALSE,lastPar=NULL,...)

Arguments

`model`	A list having components with names selected from those of objects returned by `eglhmm()` e.g. `distr`, `theta`, `formula` etc. Typically `model` will be of class `"eglhmm"` and will have been returned by the `eglhmm()` function.
`data`	A data frame containing the observations to which the cross validation are to be fitted. It is of course up to the user to ensure that a specified value of `data` makes sense, i.e. is consistent with the other arguments.
`nrep`	Positive nteger scalar; the number of replications, i.e. the number of cross validation calculations undertaken. If argument `lastPar` (see below) is supplied then `nrep` is ignored and is silently set equal to 1. If `lastPar` is `NULL` then `nrep` must be supplied, otherwise an error is thrown.
`frac`	Postive scaler, less than 1. The fraction of the data randomly selected to be used as `training data` on each replication. (The remaining data, i.e. those data not used as training data, are used as `validation` data.
`type`	Integer scalar, equal to either 1 or 2, which determines the nature of the sampling used to produce the training and validation data. If `type=1` then these data sets are obtained by sampling data points individually. The training data are obtained by setting a fraction `1 - frac` of the observed emission values (those which are not missing already) equal to `NA`. The validation data are the complement of the training data. If `type=2` then the components of `data` are randomly sampled (and used in their entirety, either for training or for validation). The components are determined from the column of `data` the name of which is specified by the argument `id`; if there is no such column, then an error is thrown. Sampling the components means sampling the levels of (the factor) `data[,id]`. Obviously it is sensible to use `type=2` only if `data` has a large number of components. By default this number is required to be at least 100. (See `minNcomp` below.)
`id`	Character scalar specifying the column to be used to determine the individual independent time series that make up the data. Ignored unless `type=2`.
`minNcomp`	Integer scalar specifying the minimum number of components (number of levels of `data["id"]`) that `data` must have in order for `type=2` to used. Ignored if `type` is equal to 1. If the number of components is less than the default value of `minNcomp` (i.e. 100) then it is strongly recommended that `type=1` be used instead.
`seed`	Positive integer scalar to be used as a seed for the random number generator (so as to enable reproducibility). If not supplied, it is randomly chosen from the sequence `1:1e6`. Note that if `nrep > 1` then after this seed is set, a vector `SEEDS` of “auxiliary” seeds, of length `nrep`, is chosen from the sequence `1:1e6` and the seed is set from the corresponding entry of this vector at the start of each replication. If `nrep==1` then the sampling for the single replication that occurs is determined by `seed`.
`crossVerb`	Logical scalar. Should brief “progress reports” (letting the user know what is happening with respect to replicate `repl`, for each `repl`) be produced?
`lastPar`	The last values of the (relevant) fitted paramenters, provided as an attribute of a component of the list returned by the function currently under consideration (i.e. `crossval()`, whenever the process of fitting the model in question to the training data did not converge. These values can be used as starting values so as to carry on with the fitting process from where the previous attempt left off.
`...`	Possible additional arguments to be passed to `eglhmm()` via `update()`, e.g. `itmax`, `tolerance`, ... .

Details

On each replication a random subset comprising frac of the data is selected to serve as training data. The complement of this subset is used as validation data. The model specified by model is fitted to the training data. It is possible to over-ride some of the details of the specifications producing model, via the ... argument of crossval(). After the model is fitted to the training data, the log likelihood of the validation data is calculated on the basis of that fitted model.

If the procedure for fitting a model to the training data fails to converge, then the corresponding entry of the list returned by this function is NA. In this case, the entry is assigned an attribute lastPar, (the estimates of the model parameters that were current when the fitting algorithm terminated) which will in turn have attributes trnDat, valDat (the training data in question and the corresponding validation data), and seed (the value of the seed that was set before the sampling that determined trnDat and valDat took place). The value of seed is SEEDS[i] (if nrep>1 and the entry in question was the ith entry of the returned list) or the value of the seed argument of this function or its random replacement if this argument was not supplied (if nrep==1).

The attribute lastPar enables the user to continue the procedure for fitting a model to the training data, starting from where the procedure, that failed to converge, left off. Continuing the procedure is easily effected by calling crossval() with argument par0 set equal to the lastPar attribute of the relevant entry of the list that was previously returned by this function.

If type==1 then the training and validation data are created in a somewhat subtle manner. The procedure necessitates referring to the “original” data. The data frame that is passed to eglhmm() is a “replicated” version of the original data, with each row of the original data being repeated once for each level of state (and a "state" column — factor — being added to the resulting data frame). If type==2 the procedure is conceptually simpler. The procedure in the type==1 setting is as follows:

Let S be the set of indices of all non-missing values in the column of the original data that contains the emissions.
Select at random a subset V of S so that the size of V is the fraction frac of the size of S.
Let T be the complement in S of V.
The training data are formed by replacing by NA all those values of the emissions column in the original data whose indices are in T.
The validation data are formed by replacing by NA all those values of the emissions column in the original data, whose indices are in V.
Then replicate both the training and validation data in the manner described above.

If type==2 then the training data are formed by selecting at random a fraction frac of the levels of the column of data named "id". (If there is no such column, an error is thrown.) The training data then consist of those rows of data corresponding to the selected levels of id. The validation data then consist of those rows of data which are not in the training data.

Value

If nrep>1 the returned value is list of length nrep. The ith entry of this list is the log likelihood of the validation data with respect to the model fitted to the training data, for the ith random selection of these two subsets. This entry will be NA if the attempt to fit a model to the training data was unsuccessful. The ith entry has an attribute seed (singular) which is the value of the seed that was set prior to the random sampling that chose the training and validation data. If the ith entry is NA it will also have an attribute lastPar which in turn will have attributes trnDat and valDat. See Details.

If nrep>1 then the returned value also has an attribute seeds (plural) which is vector of length nrep+1, consisting of the “auxiliary” seed vector SEEDS (see the argument seed) together with the “over all” seed (possibly equal to the seed argument) that was set for the random number generator before any sampling was undertaken. Note that the ith entry of this seeds attribute is the same as the seed attribute of the ith entry of the returned valuel

If nrep==1 then the returned value is a single numeric scalar which is the log likelihood of the validation data or NA if the fitting procedure did not converge for the training data. It has an attribute seed which is equal to the seed argument or its random replacement. If the value is NA then it has a further attribute lastPar. (See above.)

Note

If the function fails to fit the model, obtained from the training data, to the validation data, then the value returned is NA. This value will have an attribute lastPar. This attribute will in turn have attributes, trnDat and valDat, the training data and validation data which were being used in the failed fitting procedure. Supplying an appropriate value of lastPar enables the continuation of the fitting procedure, starting from where the procedure previously left off. See Details for a little more information.

Author(s)

Rolf Turner r.turner@auckland.ac.nz

References

Celeux, Gilles and Durand, Jean-Baptiste (2008). Selecting hidden Markov model state number with cross-validated likelihood. Computational Statistics 23 541–564, DOI 10.1007/s00180-007-0097-1.

Smyth, Padhraic (2000). Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 9 63–72.

Examples

## Not run: 
ids   <- paste0("s",1001:1101)
cc  <- ccSim[ccSim$id 
cc$id <- factor(cc$id)
cvll1 <- vector("list",9)
set.seed(42)
SEEDS <- sample(1:1e6,9)
for(k in 1:9) {
    cat("k =",k,"started\n")
    fit  <- eglhmm(categMC ~ 1, distr="M", method="em", data=cc, K=k,
                   itmax=1500,cells="id",verb=TRUE)
    cvll1[[k]] <- crossval(fit,nrep=5,type=2,seed=SEEDS[k],tolerance=1e-4,
                           verbose=FALSE,crossVerb=TRUE)
    cat("k =",k,"finished\n")
}

## End(Not run)
fit   <- eglhmm(y ~ 1, data=Downloads,K=4,distr="P",verb=TRUE,cf="singlecell")
# Use artifically low value of itmax so that crossval() fails to
# fit the model to the training data
cvll2 <- crossval(fit,nrep=5,type=1,verbose=TRUE,seed=322596,itmax=5)
cvll3 <- crossval(fit,type=1,verbose=TRUE,lastPar=attr(cvll2[[1]],"lastPar"))
# So cvll3 carried on, in one instance, from where the first
# attempted fit in cvll2 gave up.

[Package eglhmm version 0.1-3 Index]