cnlp_utils_tfidf {cleanNLP} | R Documentation |
Construct the TF-IDF Matrix from Annotation or Data Frame
Description
Given annotations, this function returns the term-frequency inverse document frequency (tf-idf) matrix from the extracted lemmas.
Usage
cnlp_utils_tfidf(
object,
tf_weight = c("lognorm", "binary", "raw", "dnorm"),
idf_weight = c("idf", "smooth", "prob", "uniform"),
min_df = 0.1,
max_df = 0.9,
max_features = 10000,
doc_var = "doc_id",
token_var = "lemma",
vocabulary = NULL,
doc_set = NULL
)
cnlp_utils_tf(
object,
tf_weight = "raw",
idf_weight = "uniform",
min_df = 0,
max_df = 1,
max_features = 10000,
doc_var = "doc_id",
token_var = "lemma",
vocabulary = NULL,
doc_set = NULL
)
Arguments
object |
a data frame containing an identifier for the document
(set with |
tf_weight |
the weighting scheme for the term frequency matrix.
The selection |
idf_weight |
the weighting scheme for the inverse document
matrix. The selection |
min_df |
the minimum proportion of documents a token should be in to be included in the vocabulary |
max_df |
the maximum proportion of documents a token should be in to be included in the vocabulary |
max_features |
the maximum number of tokens in the vocabulary |
doc_var |
character vector. The name of the column in
|
token_var |
character vector. The name of the column in
|
vocabulary |
character vector. The vocabulary set to use in
constructing the matrices. Will be computed
within the function if set to |
doc_set |
optional character vector of document ids. Useful to
create empty rows in the output matrix for documents
without data in the input. Most users will want to keep
this equal to |
Value
a sparse matrix with dimnames giving the documents and vocabular.