fitmm {smmR} | R Documentation |
Maximum Likelihood Estimation (MLE) of a k-th order Markov chain
Description
Maximum Likelihood Estimation of the transition matrix and initial distribution of a k-th order Markov chain starting from one or several sequences.
Usage
fitmm(sequences, states, k = 1, init.estim = "mle")
Arguments
sequences |
A list of vectors representing the sequences. |
states |
Vector of state space (of length s). |
k |
Order of the Markov chain. |
init.estim |
Optional.
|
Details
Let X_1, X_2, ..., X_n
be a trajectory of length n
of
the Markov chain X = (X_m)_{m \in N}
of order k = 1
with
transition matrix p_{trans}(i,j) = P(X_{m+1} = j | X_m = i)
. The
maximum likelihood estimation of the transition matrix is
\widehat{p_{trans}}(i,j) = \frac{N_{ij}}{N_{i.}}
, where N_{ij}
is the number of transitions from state i
to state j
and
N_{i.}
is the number of transition from state i
to any state.
For k > 1
we have similar expressions.
The initial distribution of a k-th order Markov chain is defined as
\mu_i = P(X_1 = i)
. Five methods are proposed for the estimation
of the latter :
- Maximum Likelihood Estimator:
The Maximum Likelihood Estimator for the initial distribution. The formula is:
\widehat{\mu_i} = \frac{Nstart_i}{L}
, whereNstart_i
is the number of occurences of the wordi
(of lengthk
) at the beginning of each sequence andL
is the number of sequences. This estimator is reliable when the number of sequencesL
is high.- Stationary distribution:
The stationary distribution is used as a surrogate of the initial distribution. If the order of the Markov chain is more than one, the transition matrix is converted into a square block matrix in order to estimate the stationary distribution. This method may take time if the order of the Markov chain is high (more than three (3)).
- Frequencies of each word:
The initial distribution is estimated by taking the frequencies of the words of length
k
for all sequences. The formula is\widehat{\mu_i} = \frac{N_i}{N}
, whereN_i
is the number of occurences of the wordi
(of lengthk
) in the sequences andN
is the sum of the lengths of the sequences.- Product of the frequencies of each state:
The initial distribution is estimated by using the product of the frequencies of each state (for all the sequences) in the word of length
k
.- Uniform distribution:
The initial probability of each state is equal to
1 / s
, withs
, the number of states.
Value
An object of class S3 mmfit
(inheriting from the S3 class mm).
The S3 class mmfit
contains:
All the attributes of the S3 class mm;
An attribute
M
which is an integer giving the total length of the set of sequencessequences
(sum of all the lengths of the listsequences
);An attribute
loglik
which is a numeric value giving the value of the log-likelihood of the specified Markov model based on thesequences
;An attribute
sequences
which is equal to the parametersequences
of the functionfitmm
(i.e. a list of sequences used to estimate the Markov model).
See Also
Examples
states <- c("a", "c", "g", "t")
s <- length(states)
k <- 2
init <- rep.int(1 / s ^ k, s ^ k)
p <- matrix(0.25, nrow = s ^ k, ncol = s)
# Specify a Markov model of order 2
markov <- mm(states = states, init = init, ptrans = p, k = k)
seqs <- simulate(object = markov, nsim = c(1000, 10000, 2000), seed = 150)
est <- fitmm(sequences = seqs, states = states, k = 2)