rainette {rainette}R Documentation

Corpus clustering based on the Reinert method - Simple clustering

Description

Corpus clustering based on the Reinert method - Simple clustering

Usage

rainette(
  dtm,
  k = 10,
  min_segment_size = 0,
  doc_id = NULL,
  min_split_members = 5,
  cc_test = 0.3,
  tsj = 3,
  min_members,
  min_uc_size
)

Arguments

dtm

quanteda dfm object of documents to cluster, usually the result of split_segments()

k

maximum number of clusters to compute

min_segment_size

minimum number of forms by document

doc_id

character name of a dtm docvar which identifies source documents.

min_split_members

don't try to split groups with fewer members

cc_test

contingency coefficient value for feature selection

tsj

minimum frequency value for feature selection

min_members

deprecated, use min_split_members instead

min_uc_size

deprecated, use min_segment_size instead

Details

See the references for original articles on the method. Computations and results may differ quite a bit, see the package vignettes for more details.

The dtm object is automatically converted to boolean.

If min_segment_size > 0 then doc_id must be provided unless the corpus comes from split_segments, in this case segment_source is used by default.

Value

The result is a list of both class hclust and rainette. Besides the elements of an hclust object, two more results are available :

References

See Also

split_segments(), rainette2(), cutree_rainette(), rainette_plot(), rainette_explor()

Examples


require(quanteda)
corpus <- data_corpus_inaugural
corpus <- head(corpus, n = 10)
corpus <- split_segments(corpus)
tok <- tokens(corpus, remove_punct = TRUE)
tok <- tokens_remove(tok, stopwords("en"))
dtm <- dfm(tok, tolower = TRUE)
dtm <- dfm_trim(dtm, min_docfreq = 3)
res <- rainette(dtm, k = 3, min_segment_size = 15)



[Package rainette version 0.3.1.1 Index]