clickb-package {clickb}R Documentation

Web Data Analysis by Bayesian Mixture of Markov Models

Description

The R package clickb is for web sequences analysis and classification. One way to identify users' activity stages is relying on sequential data analysis, that aims to discover statistically relevant temporal structure where the values are delivered in sequences. In model-based clustering approach, longitudinal data are analysed and a model is trained to understand the underlying process generating the sequences. Markov processes assume sequences having a parametric distribution and depend on each sequence elements conditional probability of occurrence while satisfy the Markov property. In other words, these models allow to analyse patterns taking into account the probabilistic aspect of subjects movements in the sequence by considering the possible realization of a categorical variable as states of the Markov chain. Furthermore, We can use their extension to mixture of first-order Markov models to identify clusters of sequences generated by the same Markov model representing subpopulations in the data. Each subpopulation has its Markov model so that they may differ for the initial, transition or a combination. In clickstream data context, these differences mean that sequences belonging to the same cluster describe different browsing behaviour, as they display different preferences in starting the web path from a specific pages or to what web pages will be accessed in the next steps of the website exploration. Parameter estimation can be obtained through the Expectation-Maximization algorithm (Baum et al., 1970) as used by Cadez et al. (2003) in clickstream analysis context. However, we will use a Bayesian approach based on MCMC sampler considering Dirichlet priors for transition matrices rows as proposed by Fruehwirth-Schnatter and Pamminger (2010). This approach overcome a limitation of the EM algorithm that may struggle during the M-step if there are no transitions between two states. The package contains suitable tools to estimate parameters in mixture of first-order time-discrete Markov models for categorical response in a Bayesian framework, identifying clusters of sequences under the assumption that number of mixture components is considered fixed. The algorithm is based on Gibbs sampler moves and at each iteration assign web sequences in the new clusters based on a posterior probability. If simulated data are used or a previous classification is available, the algorithm compare the original cluster labels and the new ones through the Danon similarity index (Danon et al., 2005).

Author(s)

Furio Urso furio.urso@unipa.it,
Reza Mohammadi a.mohammadi@uva.nl,
Antonino Abbruzzo antonino.abbruzzo@unipa.it,
Maria Francesca Cracolici mariafrancesca.cracolici@unipa.it

References

Baum LE, Petrie T, Soules G, Weiss N (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The annals of mathematical statistics 41(1), 164–171

Cadez I, Heckerman D, Meek C, Smyth P, White S (2003) Model-based clustering and visualization of navigation patterns on a web site. Data mining and knowledge discovery 7(4), 399–424

Danon L, Diaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. Journal of statistical mechanics: Theory and experiment 2005(09), P09008

Fruehwirth-Schnatter S, Pamminger C (2010) Model-based clustering of categorical time series. Bayesian Analysis 5(2), 345–368

Examples

# Generate a sequential dataset from a mixture of Markov models 
M <- 3    # number of components
K <- 6    # number of states
# define initial and transition probabilities for each component
ini1<-c(0.15,0.4,0.2, 0.15, 0,0.1)
A1<-matrix(c(0,0.45,0.1,0.15,0.15,0.15,
             0.1,0,0.15,0.15,0.1,0.5,
             0.25,0.2,0,0.2,0.15,0.2,
             0.15,0.2,0.05,0,0.4,0.2,
             0.15,0.05,0.45,0.15,0,0.2,
             0.4,0.15,0.2,0.05,0.2,0),byrow=TRUE,nrow=6)

ini2<-c(0.3,0.2,0.25, 0.15, 0, 0.1)
A2<-matrix(c(0,0.8,0,0,0,0.2,
             0.2,0,0.8,0,0,0,
             0,0.2,0,0.8,0,0,
             0,0,0.2,0,0.8,0,
             0,0,0,0.2,0,0.8,
             0.8,0,0,0,0.2,0),byrow=TRUE,nrow=6)

ini3<-c(0.2,0.1,0.2, 0.1, 0, 0.4)
A3<-matrix(c(0,0.1,0,0,0,0.9,
             0.8,0,0.15,0.05,0,0,
             0,0.9,0,0.1,0,0,
             0.05,0.05,0.8,0,0,0.1,
             0,0,0.05,0.9,0,0.05,
             0.05,0.05,0,0,0.9,0),byrow=TRUE,nrow=6)

trans.prob <- list(A1, A2, A3)
ini.prob <- list(ini1, ini2, ini3)

# sizes i.e. number of sequences in each component
N.sim1<-20
N.sim2<-30
N.sim3<-50

clust.size <- list(N.sim1, N.sim2, N.sim3)

T.range <- c(5, 30)  # sequences minimum length and maximum length

data<- sim_seq( M, K, ini.prob, trans.prob, clust.size, T.range)

### Estimate model parameters and identify cluster of sequences

# Set up initial values and hyper parameters (either fixed or random)

iter<-5   # number of iterations for the Gibbs sampler
burn<-0

num.cluster <- 3 # number of components
states <- 6 # number of states

ini.constr<-c(1, 1, 1, 1, 0, 1)           # constrains on initial probabilities
trans.constr<-matrix(c(0,1,1,1,1,1,       # constrains on transition probabilities
             1,0,1,1,1,1, 
             1,1,0,1,1,1, 
             1,1,1,0,1,1, 
             1,1,1,1,0,1, 
             1,1,1,1,1,0),byrow=TRUE,nrow=6) 

# parameters initial values
A.ini <- 1/states*matrix(rep(1, length = (states^2)), 
                         nrow = states, byrow = TRUE, 
                         dimnames = list(as.character(c(1:states))) )
pi.ini <- rep(1/states, length = states)

# Prior distributions' hyperparameters
prior.ini<- prior.transrow <- prior.mixcoef <- 1  

# Run the MCMC to estimate parameters 
MMM_1 <- fit_mixmar(data, iter, burn, num.cluster = num.cluster, states = states,
                     A.ini = A.ini, pi.ini = pi.ini, prior.ini = prior.ini, 
                     prior.transrow = prior.transrow, prior.mixcoef = prior.mixcoef, 
                     ini.constr = ini.constr, trans.constr = trans.constr)
                     
str(MMM_1$pi.list)            #Initial Markov chain probabilities for MCMC
str(MMM_1$A.list)             #Transition Markov chain probabilities for MCMC
str(MMM_1$Sim.index.Danon)    #Danon similarity between two partitions  
str(MMM_1$Sim.index.Rand)     #Rand similarity between two partitions 
MMM_1$Conf.mat                # Confusion matrix between two partitions 


[Package clickb version 0.1 Index]