bind_tf_idf2 {audubon} | R Documentation |
Bind term frequency and inverse document frequency
Description
Calculates and binds the term frequency, inverse document frequency, and TF-IDF of the dataset. This function experimentally supports 3 types of term frequencies and 4 types of inverse document frequencies, which are implemented in 'RMeCab' package.
Usage
bind_tf_idf2(
tbl,
term = "token",
document = "doc_id",
n = "n",
tf = c("tf", "tf2", "tf3"),
idf = c("idf", "idf2", "idf3", "idf4"),
norm = FALSE,
rmecab_compat = TRUE
)
Arguments
tbl |
A tidy text dataset. |
term |
Column containing terms as string or symbol. |
document |
Column containing document IDs as string or symbol. |
n |
Column containing document-term counts as string or symbol. |
tf |
Method for computing term frequency. |
idf |
Method for computing inverse document frequency. |
norm |
Logical; If passed as |
rmecab_compat |
Logical; If passed as |
Details
Types of term frequency can be switched with tf
argument:
-
tf
is term frequency (not raw count of terms). -
tf2
is logarithmic term frequency of which base is 10. -
tf3
is binary-weighted term frequency.
Types of inverse document frequencies can be switched with idf
argument:
-
idf
is inverse document frequency of which base is 2, with smoothed. 'smoothed' here means just adding 1 to raw counts after logarithmizing. -
idf2
is global frequency IDF. -
idf3
is probabilistic IDF of which base is 2. -
idf4
is global entropy, not IDF in actual.
Value
A data.frame.
Examples
## Not run:
df <- dplyr::add_count(hiroba, doc_id, token)
bind_tf_idf2(df)
## End(Not run)