R: PST based pattern mining

pmine {PST}

R Documentation

PST based pattern mining

Description

Mine for (sub)sequences satisfying user defined criteria in a state sequence object

Usage

## S4 method for signature 'PSTf,stslist'
pmine(object, data, l, pmin=0, pmax=1, prefix, lag, average=FALSE,
output="sequences", with.prefix=TRUE, sorted=TRUE, decreasing=TRUE, score.norm=FALSE)

Arguments

`object`	A fitted PST, that is an object of class PSTf as returned by the `pstree` or `prune` method.
`data`	A sequence object of class 'stslist' as defined with the `seqdef` function of the `TraMineR` library.
`l`	integer. Length of the subsequence to search for.
`pmin`	numeric. (Sub)-sequences having average or per state probability greater or equal than `pmin` are selected. Default to 1, meaning no lower threshold for the probability.
`pmax`	numeric. (Sub)-sequences having average or per state probability less or equal than `pmax` are selected. Default to 1, meaning no upper threshold for the probability.
`prefix`	character. Subsequences are searched in sequences starting with `'prefix'`, where `'prefix'` is a string representing a subsequence with states separated by `'-'`. This option can be used to search for -most- likely patterns in sequences starting with 'prefix'.
`lag`	integer. The `lag` first states in the sequence are omitted. If `prefix` is
`average`	logical. If `TRUE`, the `pmin` or `pmax` probability is supposed to be the average state probability in the (sub)sequence. If `FALSE` (sub)sequences having every state probability less than `pmax` or greater than `pmin` are selected.
`output`	character. If `output='sequences'` the whole sequence(s) where the user defined criteria is satisfied are returned. If `output='patterns'` only the (sub)sequences satisfying the user defined criteria are returned.
`with.prefix`	logical. If `'output=patterns'`, should the patterns in the output be preceeded by their prefix, that is by the whole sub-sequence preceding the pattern.
`sorted`	logical. If `'sorted=TRUE'`, selected patterns or sequences are sorted according to their score, i.e., their average probability.
`decreasing`	logical. If `'sorted=TRUE'`, should sort order be decreasing or increasing ?
`score.norm`	logical. If `TRUE`, the score attached to each selected pattern or (sub)-sequence (the weights in the returned sequence object) is the average per state probability, and is thus normalized by the length of the pattern. If `FALSE`, the score is the whole (sub)-sequence probability.

Details

The likelihood P^{S}(x) of a whole sequence x is computed from the state probabilities at each position in the sequence. However, the likelihood of the first states is usually lower than at higher position due to a reduced memory available for prediction. A sequence may not appear as very likely if its first state has a low relative frequency, even if the model predicts high probabilities for the states at higher positions.

The pmine function allows for advanced pattern mining with user defined parameters. It is controlled by the lag and pmin arguments. For example, by setting lag=2 and pmin=0.40 (example 1), we select all sequences with average (the geometric mean is used) state probability from position lag+1, \ldots, \ell above pmin. Instead of considering the average state probability at positions lag+1, \ldots, \ell, it is also possible to select frequent patterns that do not contain any state with probability below the threshold. This prevents from selecting sequences having many states with high probability but one ore several states with a low probability.

It is also possible to mine the sequence data for frequent patterns of length \ell_{j} < \ell, regardless of the position in the sequence where they occur. By using the output="patterns" argument, the pmine function returns the patterns (as a sequence object) instead of the whole set of distinct sequences containing the patterns. Since the probability of a pattern can be different depending on the context (previous states) the returned subsequences also contain the context preceding the pattern. For more details, see Gabadinho 2016.

Value

A state sequence object, that is an object of class stslist, where weights are the probability score of (sub)sequences.

Author(s)

Alexis Gabadinho

References

Gabadinho, A. & Ritschard, G. (2016). Analyzing State Sequences with Probabilistic Suffix Trees: The PST R Package. Journal of Statistical Software, 72(3), pp. 1-39.

Examples

## activity calendar for year 2000
## from the Swiss Household Panel
## see ?actcal
data(actcal)

## selecting individuals aged 20 to 59
actcal <- actcal[actcal$age00>=20 & actcal$age00 <60,]

## defining a sequence object
actcal.lab <- c("> 37 hours", "19-36 hours", "1-18 hours", "no work")
actcal.seq <- seqdef(actcal,13:24,labels=actcal.lab)

## building a PST
actcal.pst <- pstree(actcal.seq, nmin=2, ymin=0.001)

## pruning
## Cut-offs for 5% and 1% (see ?prune)
C99 <- qchisq(0.99,4-1)/2
actcal.pst.C99 <- prune(actcal.pst, gain="G2", C=C99)

## example 1
pmine(actcal.pst.C99, actcal.seq, pmin=0.4, lag=2)

## example 2: patterns of length 6 having p>=0.6
pmine(actcal.pst.C99, actcal.seq, pmin=0.6, l=6)

[Package PST version 0.94.1 Index]