LexCHCca {Xplortext}R Documentation

Chronological Constrained Hierarchical Clustering on Correspondence Analysis Components (LexCHCca)

Description

Chronological constrained agglomerative hierarchical clustering on a corpus of documents

Usage

LexCHCca (object, nb.clust=0, min=2, max=NULL, nb.par=5, 
 graph=TRUE, proba=0.05, cut.test=FALSE, alpha.test =0.05, description=FALSE,
 nb.desc=5, size.desc=80)

Arguments

object

object of LexCA class

nb.clust

number of clusters only if no test (cut.test=FALSE). If 0 (or "click"), the tree is cut at the level the user clicks on. If -1 (or "auto"), the tree is automatically cut at the suggested level. If a (positive) integer, the tree is cut with nb.clust clusters (by default 0)

min

minimum number of clusters. Available only if cut.test=FALSE. (by default 3)

max

maximum number of clusters. Available only if cut.test=FALSE. (by default NULL; then max is computed as the minimum between 10 and the number of documents divided by 2)

nb.par

number of edited paragons (para) and specific documents labels (dist) (by default 5)

graph

if TRUE, graphs are displayed (by default TRUE)

proba

threshold on the p-value used to describe the clusters (by default 0.05)

cut.test

if FALSE (by default), Legendre test is not performed when joining two nodes. This test is used to determine whether two clusters should be joined or not; see details

alpha.test

threshold on the p-value used in selecting aggregation clusters for Legendre test (by default 0.05)

description

if TRUE, description of the clusters by the characteristic words/documents, paragon (para), specific documents (dist) and contextual variables if these latter have been selected in the previous LexCA function (by default FALSE)

nb.desc

number of paragons (para) and specific documents (dist) that are edited when describing the clusters (by default 5)

size.desc

maximum of characters when editing the paragons (para) and specific documents (dist) to describe the clusters (by default 80)

Details

LexCHCca starts from the document coordinates issued from a textual correspondence analysis. The hierarchical tree is built in such a way that only chronological contiguous nodes can be joined. The documents have to be ranked in their chronological order in the source-base (data frame format) before to apply the function (TextData format).

Legendre test allows to determine whether the fusion between two nodes based on their contiguity lead to a heterogenous new node (no homogeneity-between-clusters). If Legendre test is applied (cut.test=TRUE), the number of clusters is the number obtained by the test and nb.clust has not effects.

If no Legendre test is applied (cut.test= FALSE), the number of clusters is determined either a priori or from the constrained hierarchical tree structure.

The object $para contains the distance between each document and the centroid of its class.

The object $dist contains the distance between each document and the centroid of the farthest cluster.

The results of the description of the clusters and graphs are provided.

Value

Returns a list including:

data.clust

the active lexical table used in LexCA plus a new column called Clust_ containing the partition

coord.clust

coordinates table issued from CA plus a new column called weigths and another column called Clust_, corresponds to the partition

centers

coordinates of the gravity centers of the clusters

description

$des.word for description of the clusters of documents by their characteristic words, the paragons (des.doc$para) and specific documents (des.doc$dist) of each cluster; see details

call

list of internal objects. call$t giving the results for the hierarchical tree

dendro

hclust object. This allows for using the dendrogram in other packages

phases

details of the tracking of the agglomerative hierarchical process. In particular, the cut points (joining documents not allowed) can be identified

sum.squares

sum of squares decomposition for documents and clusters

Author(s)

Monica Bécue-Bertaut, Ramón Alvarez-Esteban ramon.alvarez@unileon.es, Josep-Antón Sánchez-Espigares, Belchin Kostov

References

Bécue-Bertaut, M., Kostov, B., Morin, A., & Naro, G. (2014). Rhetorical Strategy in Forensic Speeches: Multidimensional Statistics-Based Methodology. Journal of Classification,31, 85-106. doi:10.1007/s00357-014-9148-9.

Husson F., Lê S., Pagès J. (2017). Exploratory Multivariate Analysis by Example Using R. Chapman & Hall/CRC. doi:10.1201/b21874.

Lebart L. (1978). Programme d'agrégation avec contraintes. Les Cahiers de l'Analyse des Données, 3, pp. 275–288.

Legendre, P. & Legendre, L. (1998), Numerical Ecology (2nd ed.), Amsterdam: Elsevier Science.

Murtagh F. (1985). Multidimensional Clustering Algorithms. Vienna: Physica-Verlag, COMPSTAT Lectures.

See Also

plot.LexCHCca, LexCA

Examples

data(open.question)
res.TD<-TextData(open.question,var.text=c(9,10), var.agg="Age_Group", Fmin=10, Dmin=10, 
        stop.word.tm=TRUE)
res.LexCA<-LexCA(res.TD, graph=FALSE)
res.ccah<-LexCHCca(res.LexCA, nb.clust=4, min=3)

[Package Xplortext version 1.5.3 Index]