R: Remove terms from a Document-Term-Matrix and documents with...

dtm_remove_tfidf {udpipe}

R Documentation

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency

Description

Remove terms from a Document-Term-Matrix and documents with no terms based on the term frequency inverse document frequency. Either giving in the maximum number of terms (argument top), the tfidf cutoff (argument cutoff) or a quantile (argument prob)

Usage

dtm_remove_tfidf(dtm, top, cutoff, prob, remove_emptydocs = TRUE)

Arguments

`dtm`	an object returned by `document_term_matrix`
`top`	integer with the number of terms which should be kept as defined by the highest mean tfidf
`cutoff`	numeric cutoff value to keep only terms in `dtm` where the tfidf obtained by `dtm_tfidf` is higher than this value
`prob`	numeric quantile indicating to keep only terms in `dtm` where the tfidf obtained by `dtm_tfidf` is higher than the `prob` percent quantile
`remove_emptydocs`	logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to `TRUE`.

Value

a sparse Matrix as returned by sparseMatrix where terms with high tfidf are kept and documents without any remaining terms are removed

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, xpos == "NN")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)
dtm <- dtm_remove_lowfreq(dtm, minfreq = 10)
dim(dtm)

## Keep only terms with high tfidf
x <- dtm_remove_tfidf(dtm, top=50)
dim(x)
x <- dtm_remove_tfidf(dtm, top=50, remove_emptydocs = FALSE)
dim(x)

## Keep only terms with tfidf above 1.1
x <- dtm_remove_tfidf(dtm, cutoff=1.1)
dim(x)

## Keep only terms with tfidf above the 60 percent quantile
x <- dtm_remove_tfidf(dtm, prob=0.6)
dim(x)

[Package udpipe version 0.8.11 Index]