R: Point by point estimates of a k-th order drifting Markov...

fitdmm {drimmR}

R Documentation

Point by point estimates of a k-th order drifting Markov Model

Description

Estimation of d+1 points of support transition matrices and |E|^{k} initial law of a k-th order drifting Markov Model starting from one or several sequences.

Usage

fitdmm(
  sequences,
  order,
  degree,
  states,
  init.estim = c("mle", "freq", "prod", "stationary", "unif"),
  fit.method = c("sum"),
  ncpu = 2
)

Arguments

`sequences`	A list of character vector(s) representing one (several) sequence(s)
`order`	Order of the Markov chain
`degree`	Degree of the polynomials (e.g., linear drifting if `degree`=1, etc.)
`states`	Vector of states space of length s > 1
`init.estim`	Default="mle". Method used to estimate the initial law. If `init.estim` = "mle", then the classical Maximum Likelihood Estimator is used, if `init.estim` = "freq", then, the initial distribution `init.estim` is estimated by taking the frequences of the words of length k for all sequences. If `init.estim` = "prod", then, `init.estim` is estimated by using the product of the frequences of each letter (for all the sequences) in the word of length k. If `init.estim` = "stationary", then `init.estim` is estimated by using the stationary law of the point of support transition matrices of each letter. If `init.estim` = "unif", then, `init.estim` of each letter is estimated by using `\frac{1}{s}`. Or 'init.estim'= customisable vector of length `\|E\|^k`. See Details for the formulas.
`fit.method`	If `sequences` is a list of several character vectors of the same length, the usual LSE over the sample paths is proposed when `fit.method`="sum" (a list of a single character vector is its special case).
`ncpu`	Default=2. Represents the number of cores used to parallelized computation. If ncpu=-1, then it uses all available cores.

Details

The fitdmm function creates a drifting Markov model object dmm.

Let E={1,\ldots, s}, s < \infty be random system with finite state space, with a time evolution governed by discrete-time stochastic process of values in E. A sequence X_0, X_1, \ldots, X_n with state space E= {1, 2, \ldots, s} is said to be a linear drifting Markov chain (of order 1) of length n between the Markov transition matrices \Pi_0 and \Pi_1 if the distribution of X_t, t = 1, \ldots, n, is defined by P(X_t=v \mid X_{t-1} = u, X_{t-2}, \ldots ) = \Pi_{\frac{t}{n}}(u, v), ; u, v \in E, where \Pi_{\frac{t}{n}}(u, v) = ( 1 - \frac{t}{n}) \Pi_0(u, v) + \frac{t}{n} \Pi_1(u, v), \; u, v \in E. The linear drifting Markov model of order 1 can be generalized to polynomial drifting Markov model of order k and degree d.Let \Pi_{\frac{i}{d}} = (\Pi_{\frac{i}{d}}(u_1, \dots, u_k, v))_{u_1, \dots, u_k,v \in E} be d Markov transition matrices (of order k) over a state space E.

The estimation of DMMs is carried out for 4 different types of data :

One can observe one sample path :

It is denoted by H(m,n):= (X_0,X_1, \ldots,X_{m}), where m denotes the length of the sample path and n the length of the drifting Markov chain. Two cases can be considered:

m=n (a complete sample path),
m < n (an incomplete sample path).

One can also observe H i.i.d. sample paths :

It is denoted by H_i(m_i,n_i), i=1, \ldots, H. Two cases cases are considered :

m_i=n_i=n \forall i=1, \ldots, H (complete sample paths of drifting Markov chains of the same length),
n_i=n \forall i=1, \ldots, H (incomplete sample paths of drifting Markov chains of the same length). In this case, an usual LSE over the sample paths is used.

The initial distribution of a k-th order drifting Markov Model is defined as \mu_i = P(X_1 = i). The initial distribution of the k first letters is freely customisable by the user, but five methods are proposed for the estimation of the latter :

Estimation based on the Maximum Likelihood Estimator:: The Maximum Likelihood Estimator for the initial distribution. The formula is: \widehat{\mu_i} = \frac{Nstart_i}{L}, where Nstart_i is the number of occurences of the word i (of length k) at the beginning of each sequence and L is the number of sequences. This estimator is reliable when the number of sequences L is high.
Estimation based on the frequency:: The initial distribution is estimated by taking the frequences of the words of length k for all sequences. The formula is \widehat{\mu_i} = \frac{N_i}{N}, where N_i is the number of occurences of the word i (of length k) in the sequences and N is the sum of the lengths of the sequences.
Estimation based on the product of the frequences of each state:: The initial distribution is estimated by using the product of the frequences of each state (for all the sequences) in the word of length k.
Estimation based on the stationary law of point of support transition matrix for a word of length k :: The initial distribution is estimated using \mu(\Pi_{\frac{k-1}{n}})
Estimation based on the uniform law :: \frac{1}{s}

Value

An object of class dmm

Author(s)

Geoffray Brelurut, Alexandre Seiller

References

Barbu VS, Vergne N (2018). “Reliability and survival analysis for drifting Markov models: modelling and estimation.” Methodology and Computing in Applied Probability, 1–33. doi: 10.1007/s11009-018-9682-8, https://doi.org/10.1007/s11009-018-9682-8. Vergne N (2008). “Drifting Markov models with polynomial drift and applications to DNA sequences.” Statistical Applications in Genetics Molecular Biology , 7(1) . doi: 10.2202/1544-6115.1326, https://doi.org/10.2202/1544-6115.1326.

Examples

data(lambda, package = "drimmR")
states <- c("a","c","g","t")
order <- 1
degree <- 1
fitdmm(lambda,order,degree,states, init.estim = "freq",fit.method="sum")

[Package drimmR version 1.0.1 Index]