phyclust {phyclust}R Documentation

The Main Function of phyclust

Description

The main function of phyclust implements finite mixture models for sequence data that the mutation processes are modeled by evolution processes based on Continuous Time Markov Chain theory.

Usage

phyclust(X, K, EMC = .EMC, manual.id = NULL, label = NULL, byrow = TRUE)

Arguments

X

nid/sid matrix with N rows/sequences and L columns/sites.

K

number of clusters.

EMC

EM control.

manual.id

manually input class ids.

label

label of sequences for semi-supervised clustering

byrow

advanced option for X, default = TRUE.

Details

X should be a numerical matrix containing sequence data that can be transfered by code2nid or code2sid.

EMC contains all options used for EM algorithms.

manual.id manually input class ids as an initialization only for the initialization method, 'manualMu'.

label indicates the known clusters for labeled sequences which is a vector with length N and has values from 0 to K. 0 indicates clusters are unknown. label = NULL is for unsupervised clustering. Only un- and semi-supervised clustering are implemented.

byrow used in bootstraps to avoid transposing matrix 'X'. If FALSE, then the 'X' should be have the dimension L\times K.

Value

A list with class phyclust will be returned containing several elements as the following:

'N.X.org'

number of sequences in the X matrix.

'N.X.unique'

number of unique sequences in the X matrix.

'L'

number of sites, length of sequences, number of column of the X matrix.

'K'

number of clusters.

'Eta'

proportion of subpopulations, \eta_k, length = K, sum to 1.

'Z.normalized'

posterior probabilities, Z_{nk}, each row sums to 1.

'Mu'

centers of subpopulations, dim = K\times L, each row is a center.

'QA'

Q matrix array, information for the evolution model, a list contains:

'pi' equilibrium probabilities, each row sums to 1.
'kappa' transition and transversion bias.
'Tt' total evolution time, t.
'identifier' identifier for QA.
'logL'

log likelihood values.

'p'

number of parameters.

'bic'

BIC, -2\log L + p \log N.

'aic'

AIC, -2\log L + 2p.

'N.seq.site'

number of segregating sites.

'class.id'

class id for each sequences based on the maximum posterior.

'n.class'

number of sequences in each cluster.

'conv'

convergence information, a list contains:

'eps' relative error.
'error' error if the likelihood decreased.
'flag' convergence state.
'iter' convergence iterations.
'inner.iter' convergence of inner iterations other than EM.
'cm.iter' convergence of CM iterations.
'check.param' parameter states.
'init.procedure'

initialization procedure.

'init.method'

initialization method.

'substitution.model'

substitution model.

'edist.model'

evolution distance model.

'code.type'

code type.

'em.method'

EM algorithm.

'boundary.method'

boundary method.

'label.method'

label method.

ToDo(s)

Author(s)

Wei-Chen Chen wccsnow@gmail.com

References

Phylogenetic Clustering Website: https://snoweye.github.io/phyclust/

See Also

.EMC, .EMControl, find.best, phyclust.se. phyclust.se.update.

Examples

library(phyclust, quiet = TRUE)

X <- seq.data.toy$org

set.seed(1234)
(ret.1 <- phyclust(X, 3))

EMC.2 <- .EMC
EMC.2$substitution.model <- "HKY85"
# the same as EMC.2 <- .EMControl(substitution.model = "HKY85")

(ret.2 <- phyclust(X, 3, EMC = EMC.2))

# for semi-supervised clustering
semi.label <- rep(0, nrow(X))
semi.label[1:3] <- 1
(ret.3 <- phyclust(X, 3, EMC = EMC.2, label = semi.label))

[Package phyclust version 0.1-34 Index]