crossval {eglhmm}R Documentation

Cross validate an extended generalised linear hidden Markov model.

Description

Calculates a number of cross validation log likelihood values for a hidden Markov model (usually one fitted by the eglhmm() function).

Usage

    crossval(model,data,nrep,frac=0.8,type,id="id",minNcomp=100,
             seed=NULL,crossVerb=FALSE,lastPar=NULL,...)

Arguments

model

A list having components with names selected from those of objects returned by eglhmm() e.g. distr, theta, formula etc. Typically model will be of class "eglhmm" and will have been returned by the eglhmm() function.

data

A data frame containing the observations to which the cross validation are to be fitted.

It is of course up to the user to ensure that a specified value of data makes sense, i.e. is consistent with the other arguments.

nrep

Positive nteger scalar; the number of replications, i.e. the number of cross validation calculations undertaken. If argument lastPar (see below) is supplied then nrep is ignored and is silently set equal to 1. If lastPar is NULL then nrep must be supplied, otherwise an error is thrown.

frac

Postive scaler, less than 1. The fraction of the data randomly selected to be used as training data on each replication. (The remaining data, i.e. those data not used as training data, are used as validation data.

type

Integer scalar, equal to either 1 or 2, which determines the nature of the sampling used to produce the training and validation data. If type=1 then these data sets are obtained by sampling data points individually. The training data are obtained by setting a fraction 1 - frac of the observed emission values (those which are not missing already) equal to NA. The validation data are the complement of the training data.

If type=2 then the components of data are randomly sampled (and used in their entirety, either for training or for validation). The components are determined from the column of data the name of which is specified by the argument id; if there is no such column, then an error is thrown. Sampling the components means sampling the levels of (the factor) data[,id].

Obviously it is sensible to use type=2 only if data has a large number of components. By default this number is required to be at least 100. (See minNcomp below.)

id

Character scalar specifying the column to be used to determine the individual independent time series that make up the data. Ignored unless type=2.

minNcomp

Integer scalar specifying the minimum number of components (number of levels of data["id"]) that data must have in order for type=2 to used. Ignored if type is equal to 1.

If the number of components is less than the default value of minNcomp (i.e. 100) then it is strongly recommended that type=1 be used instead.

seed

Positive integer scalar to be used as a seed for the random number generator (so as to enable reproducibility). If not supplied, it is randomly chosen from the sequence 1:1e6. Note that if nrep > 1 then after this seed is set, a vector SEEDS of “auxiliary” seeds, of length nrep, is chosen from the sequence 1:1e6 and the seed is set from the corresponding entry of this vector at the start of each replication. If nrep==1 then the sampling for the single replication that occurs is determined by seed.

crossVerb

Logical scalar. Should brief “progress reports” (letting the user know what is happening with respect to replicate repl, for each repl) be produced?

lastPar

The last values of the (relevant) fitted paramenters, provided as an attribute of a component of the list returned by the function currently under consideration (i.e. crossval(), whenever the process of fitting the model in question to the training data did not converge. These values can be used as starting values so as to carry on with the fitting process from where the previous attempt left off.

...

Possible additional arguments to be passed to eglhmm() via update(), e.g. itmax, tolerance, ... .

Details

On each replication a random subset comprising frac of the data is selected to serve as training data. The complement of this subset is used as validation data. The model specified by model is fitted to the training data. It is possible to over-ride some of the details of the specifications producing model, via the ... argument of crossval(). After the model is fitted to the training data, the log likelihood of the validation data is calculated on the basis of that fitted model.

If the procedure for fitting a model to the training data fails to converge, then the corresponding entry of the list returned by this function is NA. In this case, the entry is assigned an attribute lastPar, (the estimates of the model parameters that were current when the fitting algorithm terminated) which will in turn have attributes trnDat, valDat (the training data in question and the corresponding validation data), and seed (the value of the seed that was set before the sampling that determined trnDat and valDat took place). The value of seed is SEEDS[i] (if nrep>1 and the entry in question was the ith entry of the returned list) or the value of the seed argument of this function or its random replacement if this argument was not supplied (if nrep==1).

The attribute lastPar enables the user to continue the procedure for fitting a model to the training data, starting from where the procedure, that failed to converge, left off. Continuing the procedure is easily effected by calling crossval() with argument par0 set equal to the lastPar attribute of the relevant entry of the list that was previously returned by this function.

If type==1 then the training and validation data are created in a somewhat subtle manner. The procedure necessitates referring to the “original” data. The data frame that is passed to eglhmm() is a “replicated” version of the original data, with each row of the original data being repeated once for each level of state (and a "state" column — factor — being added to the resulting data frame). If type==2 the procedure is conceptually simpler. The procedure in the type==1 setting is as follows:

Value

If nrep>1 the returned value is list of length nrep. The ith entry of this list is the log likelihood of the validation data with respect to the model fitted to the training data, for the ith random selection of these two subsets. This entry will be NA if the attempt to fit a model to the training data was unsuccessful. The ith entry has an attribute seed (singular) which is the value of the seed that was set prior to the random sampling that chose the training and validation data. If the ith entry is NA it will also have an attribute lastPar which in turn will have attributes trnDat and valDat. See Details.

If nrep>1 then the returned value also has an attribute seeds (plural) which is vector of length nrep+1, consisting of the “auxiliary” seed vector SEEDS (see the argument seed) together with the “over all” seed (possibly equal to the seed argument) that was set for the random number generator before any sampling was undertaken. Note that the ith entry of this seeds attribute is the same as the seed attribute of the ith entry of the returned valuel

If nrep==1 then the returned value is a single numeric scalar which is the log likelihood of the validation data or NA if the fitting procedure did not converge for the training data. It has an attribute seed which is equal to the seed argument or its random replacement. If the value is NA then it has a further attribute lastPar. (See above.)

Note

If the function fails to fit the model, obtained from the training data, to the validation data, then the value returned is NA. This value will have an attribute lastPar. This attribute will in turn have attributes, trnDat and valDat, the training data and validation data which were being used in the failed fitting procedure. Supplying an appropriate value of lastPar enables the continuation of the fitting procedure, starting from where the procedure previously left off. See Details for a little more information.

Author(s)

Rolf Turner r.turner@auckland.ac.nz

References

Celeux, Gilles and Durand, Jean-Baptiste (2008). Selecting hidden Markov model state number with cross-validated likelihood. Computational Statistics 23 541–564, DOI 10.1007/s00180-007-0097-1.

Smyth, Padhraic (2000). Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing 9 63–72.

See Also

eglhmm()

Examples

## Not run: 
ids   <- paste0("s",1001:1101)
cc  <- ccSim[ccSim$id 
cc$id <- factor(cc$id)
cvll1 <- vector("list",9)
set.seed(42)
SEEDS <- sample(1:1e6,9)
for(k in 1:9) {
    cat("k =",k,"started\n")
    fit  <- eglhmm(categMC ~ 1, distr="M", method="em", data=cc, K=k,
                   itmax=1500,cells="id",verb=TRUE)
    cvll1[[k]] <- crossval(fit,nrep=5,type=2,seed=SEEDS[k],tolerance=1e-4,
                           verbose=FALSE,crossVerb=TRUE)
    cat("k =",k,"finished\n")
}

## End(Not run)
fit   <- eglhmm(y ~ 1, data=Downloads,K=4,distr="P",verb=TRUE,cf="singlecell")
# Use artifically low value of itmax so that crossval() fails to
# fit the model to the training data
cvll2 <- crossval(fit,nrep=5,type=1,verbose=TRUE,seed=322596,itmax=5)
cvll3 <- crossval(fit,type=1,verbose=TRUE,lastPar=attr(cvll2[[1]],"lastPar"))
# So cvll3 carried on, in one instance, from where the first
# attempted fit in cvll2 gave up.


[Package eglhmm version 0.1-3 Index]