score {rEMM} | R Documentation |
Score a New Sequence Given an EMM
Description
Calculates a score of how likely it is that a new sequence was generated by the same process as the sequences used to build the EMM.
Usage
## S4 method for signature 'EMM,matrix'
score(x, newdata, method = c("product", "log_sum", "sum",
"log_odds", "supported_transitions", "supported_states",
"sum_transitions", "log_loss", "likelihood", "log_likelihood", "AIC"),
match_cluster = "exact", random = FALSE, prior = TRUE, normalize = TRUE,
initial_transition = FALSE, threshold = NA)
## S4 method for signature 'EMM,EMM'
score(x, newdata, method = c("product", "log_sum", "sum",
"supported_transitions"), match_cluster = "exact", random = FALSE, prior = TRUE,
initial_transition = FALSE)
Arguments
x |
an |
newdata |
sequence or another |
method |
method to calculate the score (see details) |
match_cluster |
do the new observations have to fall within
the threshold of the cluster ( |
random |
logical; should the order of newdata be randomized? Can be used to compare the score with the actual score. |
prior |
logical; add one to each transition count. This is equal to start with a count of one for each transition, i.e. initially all transitions are equally likely. It prevents the product of probabilities to be zero if a transition was never observed. |
normalize |
logical; normalize the score by the length of the sequence. |
initial_transition |
logical; include the initial transition in the computation? |
threshold |
minimum count threshold used by supported transitions and supported states. |
Details
The scores for a new sequence of length
can be computed
by the following methods. For
match_cluster="exact"
or "nn"
:
- "product"
-
Product of transition probabilities along the path of
in the model. A single missing transition (transition probability of zero) will result in a score of 0. Use
prior
to avoid this.where
is the transition probability between the state representing positions
and
in the sequence.
- "sum"
-
Average of transition probabilities along the path of
in the model.
- "log_sum"
-
Average of the log of the transition probabilities along the path of
in the model. The ranking of the scores is equivalent to the product of probabilities, however, the calculation is more reliable since the product of probabilities might become a very small number.
A single missing transition (transition probability of zero) will result in a score of neg. infinity. Use
prior
to avoid this. - "supported_transitions"
Fraction of transitions in the new sequence
supported (present) in the model after assigning each data point in
to a state in the model.
- "supported_states"
Fraction of points in the new sequence
for which a state (cluster) exists in the model.
match_cluster
is always"exact"
because for"nn"
this measure would always give 1. Note that this measure ignores transition information.If threshold is given, then only states with a count greater than the given threshold are counted as supported.
- "sum_transitions"
Sum of the counts on the edges in the model on the path of sequence
normalized by the total number of transition counts in the model.
where
is the transition count between the state representing positions
and
in the sequence.
If threshold is given, then only transitions with a count greater than the given threshold are counted as supported.
- "likelihood", "log_likelihood"
The likelihood of the model given the new data is the unnormalized product score (product of transition probabilities).
- "log_loss"
The average log loss is defined as
It represents the average compression rate of the new sequence given the model.
- "AIC"
Akaike Information Criterion corrected for finite sample size.
where
and
is the model complexity measured by the number of non-zero entries in the transition matrix. We use the likelihood of the model given by the proportion of supported transitions. AIC can be used for model selection where the smallest value indicates the preferred model.
where
represents the
-th data point in the new sequence,
is the transition probability from state
to state
in the model,
is the state the
-th data point (
) in
the new sequence is assigned to.
is an indicator function which is 0 for
and 1 otherwise.
For match_cluster="weighted"
:
- "product"
-
Weighted version of the product of probabilities. The weight is the similarity between a new data point and the state in the model it is assigned to.
- "sum"
-
Weighted version of the sum of probabilities.
- "log_sum"
-
Weighted version of the sum of the log of probabilities.
- "supported_states"
-
Same as
"supported_states"
but instead of counting the supported states, the similarityis used as a weight. Threshold is not implemented.
where is a modified and normalized
similarity function given by
where
is the distance measure and
is the threshold that
was used to create the model.
Value
A scalar score value.
See Also
transition
to access transition probabilities
and find_clusters
for assigning observations to states/clusters.
Examples
data("EMMsim")
emm <- EMM(threshold = .2)
emm <- build(emm, EMMsim_train)
# default is method "product". The score is much higher compared to a randomized order.
score(emm, EMMsim_test)
score(emm, EMMsim_test, random = TRUE)
### create shuffled data (destroy temporal relationship)
### and create noisy data
test_shuffled <- EMMsim_test[sample(1:nrow(EMMsim_test)), ]
test_noise <- jitter(EMMsim_test, amount = .3)
### helper for plotting
mybars <- function(...) {
oldpar <- par(mar = c(5, 10, 4, 2))
ss <- rbind(...)
barplot(
ss[, ncol(ss):1],
xlim = c(-1, 4),
beside = TRUE,
horiz = TRUE,
las = 2,
legend = rownames(ss)
)
par(oldpar)
}
### compare various scores
methods <- c(
"product",
"sum",
"log_sum",
"supported_states",
"supported_transitions",
"sum_transitions",
"log_loss",
"likelihood"
)
### default is exact matching
clean <-
sapply(
methods,
FUN = function(m)
score(emm, EMMsim_test, method = m)
)
shuffled <-
sapply(
methods,
FUN = function(m)
score(emm, test_shuffled, method = m)
)
noise <-
sapply(
methods,
FUN = function(m)
score(emm, test_noise, method = m)
)
mybars(shuffled, noise, clean)
### weighted matching is better for noisy data
clean <-
sapply(
methods,
FUN = function(m)
score(emm, EMMsim_test, method = m,
match = "weighted")
)
shuffled <-
sapply(
methods,
FUN = function(m)
score(emm, test_shuffled, method = m,
match = "weighted")
)
noise <-
sapply(
methods,
FUN = function(m)
score(emm, test_noise, method = m,
match = "weighted")
)
mybars(shuffled, noise, clean)