R: Maximum Likelihood Estimation (MLE) of a semi-Markov chain

fitsmm {smmR}

R Documentation

Maximum Likelihood Estimation (MLE) of a semi-Markov chain

Description

Maximum Likelihood Estimation of a semi-Markov chain starting from one or several sequences. This estimation can be parametric or non-parametric, non-censored, censored at the beginning and/or at the end of the sequence, with one or several trajectories. Several parametric distributions are considered (Uniform, Geometric, Poisson, Discrete Weibull and Negative Binomial).

Usage

fitsmm(
  sequences,
  states,
  type.sojourn = c("fij", "fi", "fj", "f"),
  distr = "nonparametric",
  init.estim = "mle",
  cens.beg = FALSE,
  cens.end = FALSE
)

Arguments

`sequences`	A list of vectors representing the sequences.
`states`	Vector of state space (of length s).
`type.sojourn`	Type of sojourn time (for more explanations, see Details).
`distr`	By default `"nonparametric"` for the non-parametric estimation case. If the parametric estimation case is desired, `distr` should be: A matrix of distributions of dimension `(s, s)` if `type.sojourn = "fij"`; A vector of distributions of length `s` if `type.sojourn = "fi"` or `"fj"`; A distribution if `type.sojourn = "f"`. The distributions to be used in `distr` must be one of `"unif"`, `"geom"`, `"pois"`, `"dweibull"`, `"nbinom"`.
`init.estim`	Optional. `init.estim` gives the method used to estimate the initial distribution. The following methods are proposed: `init.estim = "mle"`: the classical Maximum Likelihood Estimator is used to estimate the initial distribution `init`; `init.estim = "limit"`: the initial distribution is replaced by the limit (stationary) distribution of the semi-Markov chain; `init.estim = "freq"`: the initial distribution is replaced by the frequencies of each state in the sequences; `init.estim = "unif"`: the initial probability of each state is equal to `1 / s`, with `s` the number of states.
`cens.beg`	Optional. A logical value indicating whether or not sequences are censored at the beginning.
`cens.end`	Optional. A logical value indicating whether or not sequences are censored at the end.

Details

This function estimates a semi-Markov model in parametric or non-parametric case, taking into account the type of sojourn time and the censoring described in references. The non-parametric estimation concerns sojourn time distributions defined by the user. For the parametric estimation, several discrete distributions are considered (see below).

The difference between the Markov model and the semi-Markov model concerns the modeling of the sojourn time. With a Markov chain, the sojourn time distribution is modeled by a Geometric distribution (in discrete time). With a semi-Markov chain, the sojourn time can be any arbitrary distribution. In this package, the available distribution for a semi-Markov model are :

Uniform: f(x) = \frac{1}{n} for 1 \le x \le n. n is the parameter;
Geometric: f(x) = \theta (1-\theta)^{x - 1} for x = 1, 2,\ldots,n, 0 < \theta < 1, \theta is the probability of success. \theta is the parameter;
Poisson: f(x) = \frac{\lambda^x exp(-\lambda)}{x!} for x = 0, 1, 2,\ldots,n, with \lambda > 0. \lambda is the parameter;
Discrete Weibull of type 1: f(x)=q^{(x-1)^{\beta}}-q^{x^{\beta}}, x = 1, 2,\ldots,n, with 0 < q < 1, the first parameter and \beta > 0 the second parameter. (q, \beta) are the parameters;
Negative binomial: f(x)=\frac{\Gamma(x+\alpha)}{\Gamma(\alpha) x!} p^{\alpha} (1 - p)^x, for x = 0, 1, 2,\ldots,n, \Gamma is the Gamma function, \alpha is the parameter of overdispersion and p is the probability of success, 0 < p < 1. (\alpha, p) are the parameters;
Non-parametric.

We define :

the semi-Markov kernel q_{ij}(k) = P( J_{m+1} = j, T_{m+1} - T_{m} = k | J_{m} = i );
the transition matrix (p_{trans}(i,j))_{i,j} \in states of the embedded Markov chain J = (J_m)_m, p_{trans}(i,j) = P( J_{m+1} = j | J_m = i );
the initial distribution \mu_i = P(J_1 = i) = P(Z_1 = i), i \in 1, 2, \dots, s;
the conditional sojourn time distributions (f_{ij}(k))_{i,j} \in states,\ k \in N ,\ f_{ij}(k) = P(T_{m+1} - T_m = k | J_m = i, J_{m+1} = j ), f is specified by the argument param in the parametric case and by distr in the non-parametric case.

The maximum likelihood estimation of the transition matrix of the embedded Markov chain is \widehat{p_{trans}}(i,j) = \frac{N_{ij}}{N_{i.}}.

Five methods are proposed for the estimation of the initial distribution :

Maximum Likelihood Estimator:: The Maximum Likelihood Estimator for the initial distribution. The formula is: \widehat{\mu_i} = \frac{Nstart_i}{L}, where Nstart_i is the number of occurences of the word i (of length k) at the beginning of each sequence and L is the number of sequences. This estimator is reliable when the number of sequences L is high.
Limit (stationary) distribution:: The limit (stationary) distribution of the semi-Markov chain is used as a surrogate of the initial distribution.
Frequencies of each state:: The initial distribution is replaced by taking the frequencies of each state in the sequences.
Uniform distribution:: The initial probability of each state is equal to 1 / s, with s, the number of states.

Note that q_{ij}(k) = p_{trans}(i,j) \ f_{ij}(k) in the general case (depending on the present state and on the next state). For particular cases, we replace f_{ij}(k) by f_{i.}(k) (depending on the present state i), f_{.j}(k) (depending on the next state j) and f_{..}(k) (depending neither on the present state nor on the next state).

In this package we can choose different types of sojourn time. Four options are available for the sojourn times:

depending on the present state and on the next state (fij);
depending only on the present state (fi);
depending only on the next state (fj);
depending neither on the current, nor on the next state (f).

If type.sojourn = "fij", distr is a matrix of dimension (s, s) (e.g., if the 1st element of the 2nd column is "pois", that is to say we go from the first state to the second state following a Poisson distribution). If type.sojourn = "fi" or "fj", distr must be a vector (e.g., if the first element of vector is "geom", that is to say we go from (or to) the first state to (or from) any state according to a Geometric distribution). If type.sojourn = "f", distr must be one of "unif", "geom", "pois", "dweibull", "nbinom" (e.g., if distr is equal to "nbinom", that is to say that the sojourn time when going from one state to another state follows a Negative Binomial distribution). For the non-parametric case, distr is equal to "nonparametric" whatever type of sojourn time given.

If the sequence is censored at the beginning and/or at the end, cens.beg must be equal to TRUE and/or cens.end must be equal to TRUE. All the sequences must be censored in the same way.

Value

Returns an object of S3 class smmfit (inheriting from the S3 class smm and smmnonparametric class if distr = "nonparametric" or smmparametric otherwise). The S3 class smmfit contains:

All the attributes of the S3 class smmnonparametric or smmparametric depending on the type of estimation;
An attribute M which is an integer giving the total length of the set of sequences sequences (sum of all the lengths of the list sequences);
An attribute loglik which is a numeric value giving the value of the log-likelihood of the specified semi-Markov model based on the sequences;
An attribute sequences which is equal to the parameter sequences of the function fitsmm (i.e. a list of sequences used to estimate the Markov model).

References

V. S. Barbu, N. Limnios. (2008). Semi-Markov Chains and Hidden Semi-Markov Models Toward Applications - Their Use in Reliability and DNA Analysis. New York: Lecture Notes in Statistics, vol. 191, Springer.

Examples

states <- c("a", "c", "g", "t")
s <- length(states)

# Creation of the initial distribution
vect.init <- c(1 / 4, 1 / 4, 1 / 4, 1 / 4)

# Creation of the transition matrix
pij <- matrix(c(0, 0.2, 0.5, 0.3, 
                0.2, 0, 0.3, 0.5, 
                0.3, 0.5, 0, 0.2, 
                0.4, 0.2, 0.4, 0), 
              ncol = s, byrow = TRUE)

# Creation of the distribution matrix

distr.matrix <- matrix(c(NA, "pois", "geom", "nbinom", 
                         "geom", NA, "pois", "dweibull",
                         "pois", "pois", NA, "geom", 
                         "pois", "geom", "geom", NA), 
                       nrow = s, ncol = s, byrow = TRUE)

# Creation of an array containing the parameters
param1.matrix <- matrix(c(NA, 2, 0.4, 4, 
                          0.7, NA, 5, 0.6, 
                          2, 3, NA, 0.6, 
                          4, 0.3, 0.4, NA), 
                        nrow = s, ncol = s, byrow = TRUE)

param2.matrix <- matrix(c(NA, NA, NA, 0.6, 
                          NA, NA, NA, 0.8, 
                          NA, NA, NA, NA, 
                          NA, NA, NA, NA), 
                        nrow = s, ncol = s, byrow = TRUE)

param.array <- array(c(param1.matrix, param2.matrix), c(s, s, 2))

# Specify the semi-Markov model
semimarkov <- smmparametric(states = states, init = vect.init, ptrans = pij, 
                            type.sojourn = "fij", distr = distr.matrix, 
                            param = param.array)

seqs <- simulate(object = semimarkov, nsim = c(1000, 10000, 2000), seed = 100)

# Estimation of simulated sequences
est <- fitsmm(sequences = seqs, states = states, type.sojourn = "fij", 
              distr = distr.matrix)

[Package smmR version 1.0.3 Index]