prune {PST} | R Documentation |
Prune a probabilistic suffix tree
Description
Prune a PST, using either a gain function, a maximal depth or a list of nodes to keep or remove. Optionally, nodes are not removed from the tree but tagged as deleted, helping to visualize the pruning process.
Usage
## S4 method for signature 'PSTf'
prune(object, nmin, L, gain, C, keep, drop, state, delete = TRUE, lik =TRUE)
Arguments
object |
a probabilistic suffix tree, i.e., an object of class |
nmin |
integer. All strings having counts less than nmin are removed. |
L |
integer. If specified the the tree is cut at depth L., that is all nodes with depth > L are removed. |
gain |
character. Function for measuring information gain. See |
C |
numeric. Cutoff value to use with the gain function |
keep |
character. A vector of character strings containing the names of the nodes to keep in the tree. All nodes that are not a suffix of contexts in keep are removed from the tree. |
drop |
character. A vector of character strings containing the names of the nodes to remove from the tree. All nodes that are a suffix of contexts in drop are removed from the tree as weel. |
state |
character. All nodes corresponding to contexts which include |
delete |
Logical. If FALSE, the pruned nodes are not removed from the tree but tagged as pruned=FALSE, so that when plotting the pruned tree these nodes wil appear surrounded with red (can be set to another color) lines. |
lik |
Logical. If TRUE, the log-likelihood of the pruned model, i.e. the likelihood of the training sequences given the model, is computed and stored in the 'logLik' slot of the PST. Setting to FALSE will spare the time required to compute the likelihood. |
Details
The initial tree returned by the pstree
function may yield an overly complex model containing all contexts of maximal length and frequency
found in the learning sample. The pruning stage potentially reduces the number of nodes in the tree, and thus the model complexity. It compares the conditional probabilities associated to a node labelled by a subsequence
to the conditional probabilities of its parent node labelled by the longest suffix of
,
. The general idea is to remove a node if it does not contribute additional information with respect to its parent in predicting the next symbol, that is if
is not significantly different from
for all
.
The pruning procedure starts from the terminal nodes and is applied recursively until all terminal nodes remaining in the tree represent an information gain relative to their parent.
A gain function, whose outcome will determine the pruning decision, is used to compare the two probability distributions. The gain function is driven by a cut-off, and different values of this parameter will yield more or less complex trees. A method for selecting the pruning cut-off is described in the tune
help page.
A first implemented gain function, which is used by the Learn-PSA algorithm, is based on the ratio between and
for each
. A node represents an information gain if for any symbol
the ratio is greater than the cut-off
or lower than
, that is if
where is a user defined cut-off value. Nodes that do not satisfy the above condition are pruned. For
no node is removed since even a node having a next probability distribution similar to the one of its parent does not satisfy the pruning condition.
The context algorithm uses another gain function, namely
where is the context labelling the terminal node,
is the number of occurrences of
in the data. The cutoff
is specified on the scale of
-quantiles Maechler-2004
where is the quantile function of a
distribution with
degrees of freedom. The cutoff
is a threshold for the difference of deviances between a tree
and its subtree
obtained by pruning the terminal node
. Typical values for
are
and
, yielding
and
respectively. For more details, see Gabadinho 2016.
Value
A probabilistic suffix tree, i.e., an object of class PSTf
.
Author(s)
Alexis Gabadinho
References
Bejerano, G. & Yona, G. (2001). Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17, pp. 23-43.
Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.
Maechler, M. & Buehlmann, P. (2004). Variable Length Markov Chains: Methodology, Computing, and Software Journal of Computational and Graphical Statistics, 13, pp. 435-455.
Ron, D.; Singer, Y. & Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length Machine Learning, 25, pp. 117-149.
See Also
Examples
data(s1)
s1.seq <- seqdef(s1)
S1 <- pstree(s1.seq, L=3, nmin=2, ymin=0.001)
## --
S1.p1 <- prune(S1, gain="G1", C=1.20, delete=FALSE)
summary(S1.p1)
plot(S1.p1)
## --
C95 <- qchisq(0.95,1)/2
S1.p2 <- prune(S1, gain="G2", C=C95, delete=FALSE)
plot(S1.p2)