fitdmm {drimmR}R Documentation

Point by point estimates of a k-th order drifting Markov Model

Description

Estimation of d+1 points of support transition matrices and Ek|E|^{k} initial law of a k-th order drifting Markov Model starting from one or several sequences.

Usage

fitdmm(
  sequences,
  order,
  degree,
  states,
  init.estim = c("mle", "freq", "prod", "stationary", "unif"),
  fit.method = c("sum"),
  ncpu = 2
)

Arguments

sequences

A list of character vector(s) representing one (several) sequence(s)

order

Order of the Markov chain

degree

Degree of the polynomials (e.g., linear drifting if degree=1, etc.)

states

Vector of states space of length s > 1

init.estim

Default="mle". Method used to estimate the initial law. If init.estim = "mle", then the classical Maximum Likelihood Estimator is used, if init.estim = "freq", then, the initial distribution init.estim is estimated by taking the frequences of the words of length k for all sequences. If init.estim = "prod", then, init.estim is estimated by using the product of the frequences of each letter (for all the sequences) in the word of length k. If init.estim = "stationary", then init.estim is estimated by using the stationary law of the point of support transition matrices of each letter. If init.estim = "unif", then, init.estim of each letter is estimated by using 1s\frac{1}{s}. Or 'init.estim'= customisable vector of length Ek|E|^k. See Details for the formulas.

fit.method

If sequences is a list of several character vectors of the same length, the usual LSE over the sample paths is proposed when fit.method="sum" (a list of a single character vector is its special case).

ncpu

Default=2. Represents the number of cores used to parallelized computation. If ncpu=-1, then it uses all available cores.

Details

The fitdmm function creates a drifting Markov model object dmm.

Let E=1,,sE={1,\ldots, s}, s < \infty be random system with finite state space, with a time evolution governed by discrete-time stochastic process of values in EE. A sequence X0,X1,,XnX_0, X_1, \ldots, X_n with state space E=1,2,,sE= {1, 2, \ldots, s} is said to be a linear drifting Markov chain (of order 1) of length nn between the Markov transition matrices Π0\Pi_0 and Π1\Pi_1 if the distribution of XtX_t, t=1,,nt = 1, \ldots, n, is defined by P(Xt=vXt1=u,Xt2,)=Πtn(u,v),;u,vEP(X_t=v \mid X_{t-1} = u, X_{t-2}, \ldots ) = \Pi_{\frac{t}{n}}(u, v), ; u, v \in E, where Πtn(u,v)=(1tn)Π0(u,v)+tnΠ1(u,v),  u,vE\Pi_{\frac{t}{n}}(u, v) = ( 1 - \frac{t}{n}) \Pi_0(u, v) + \frac{t}{n} \Pi_1(u, v), \; u, v \in E. The linear drifting Markov model of order 11 can be generalized to polynomial drifting Markov model of order kk and degree dd.Let Πid=(Πid(u1,,uk,v))u1,,uk,vE\Pi_{\frac{i}{d}} = (\Pi_{\frac{i}{d}}(u_1, \dots, u_k, v))_{u_1, \dots, u_k,v \in E} be dd Markov transition matrices (of order kk) over a state space EE.

The estimation of DMMs is carried out for 4 different types of data :

One can observe one sample path :

It is denoted by H(m,n):=(X0,X1,,Xm)H(m,n):= (X_0,X_1, \ldots,X_{m}), where m denotes the length of the sample path and nn the length of the drifting Markov chain. Two cases can be considered:

  1. m=n (a complete sample path),

  2. m < n (an incomplete sample path).

One can also observe HH i.i.d. sample paths :

It is denoted by Hi(mi,ni),i=1,,HH_i(m_i,n_i), i=1, \ldots, H. Two cases cases are considered :

  1. mi=ni=ni=1,,Hm_i=n_i=n \forall i=1, \ldots, H (complete sample paths of drifting Markov chains of the same length),

  2. ni=ni=1,,Hn_i=n \forall i=1, \ldots, H (incomplete sample paths of drifting Markov chains of the same length). In this case, an usual LSE over the sample paths is used.

The initial distribution of a k-th order drifting Markov Model is defined as μi=P(X1=i)\mu_i = P(X_1 = i). The initial distribution of the k first letters is freely customisable by the user, but five methods are proposed for the estimation of the latter :

Estimation based on the Maximum Likelihood Estimator:

The Maximum Likelihood Estimator for the initial distribution. The formula is: μi^=NstartiL\widehat{\mu_i} = \frac{Nstart_i}{L}, where NstartiNstart_i is the number of occurences of the word ii (of length kk) at the beginning of each sequence and LL is the number of sequences. This estimator is reliable when the number of sequences LL is high.

Estimation based on the frequency:

The initial distribution is estimated by taking the frequences of the words of length k for all sequences. The formula is μi^=NiN\widehat{\mu_i} = \frac{N_i}{N}, where NiN_i is the number of occurences of the word ii (of length kk) in the sequences and NN is the sum of the lengths of the sequences.

Estimation based on the product of the frequences of each state:

The initial distribution is estimated by using the product of the frequences of each state (for all the sequences) in the word of length kk.

Estimation based on the stationary law of point of support transition matrix for a word of length k :

The initial distribution is estimated using μ(Πk1n)\mu(\Pi_{\frac{k-1}{n}})

Estimation based on the uniform law :

1s\frac{1}{s}

Value

An object of class dmm

Author(s)

Geoffray Brelurut, Alexandre Seiller

References

Barbu VS, Vergne N (2018). “Reliability and survival analysis for drifting Markov models: modelling and estimation.” Methodology and Computing in Applied Probability, 1–33. doi: 10.1007/s11009-018-9682-8, https://doi.org/10.1007/s11009-018-9682-8. Vergne N (2008). “Drifting Markov models with polynomial drift and applications to DNA sequences.” Statistical Applications in Genetics Molecular Biology , 7(1) . doi: 10.2202/1544-6115.1326, https://doi.org/10.2202/1544-6115.1326.

Examples

data(lambda, package = "drimmR")
states <- c("a","c","g","t")
order <- 1
degree <- 1
fitdmm(lambda,order,degree,states, init.estim = "freq",fit.method="sum")

[Package drimmR version 1.0.1 Index]