dictionary_dtm {chinese.misc} | R Documentation |
Making DTM/TDM for Groups of Words
Description
A dictionary has several groups of words. Sometimes what we want is not the term frequency of this or that single word, but rather the total sum of words that belong to the same group. Given a dictionary, this function can save you a lot of time because it sums up the frequencies of all groups of words and you do not need to do it manually.
Usage
dictionary_dtm(
x,
dictionary,
type = "dtm",
simple_sum = FALSE,
return_dictionary = FALSE,
checks = TRUE
)
Arguments
x |
an object of class DocumentTermMatrix or TermDocumentMatrix created by
|
dictionary |
a dictionary telling the function how you group the words. It can be a list, matrix, data.frame or character vector. Please see details for how to set this argument. |
type |
if x is a matrix, you have to tell whether it represents a document term matrix or a term document matrix. Character starting with "D" or "d" for document term matrix, and that with "T" or "t" for term document matrix. The default is "dtm". |
simple_sum |
if it is |
return_dictionary |
if |
checks |
The default is |
Details
The argument dictionary
can be set in different ways:
(1) list: if it is a list, each element represents a group of words. The element should be a character vector; if it is not, the function will manage to convert. However, the length of the element should be > 0 and has to contain at least 1 non-NA word.
(2) matrix or data.frame: each entry of the input should be character; if it is not, the function will manage to convert. At least one of the entries should not be
NA
. Each column (not row) represents a group of words.(3) character vector: it represents one group of words.
(4) Note: you do not need to worry about two same words existing in the same group, because the function will only count one of them. Neither should you worry about that the words in a certain group do not really exist in the DTM/TDM, because the function will simply ignore those non-existent words. If none of the words of that group exists, the group will still appear in the final result, although the total frequencies of that group are all 0's. By setting
return_dictionary = TRUE
, you can see which words do exist.
Value
if return_dictionary = FALSE
, an object of class DocumentTermMatrix or TermDocumentMatrix is
returned; if TRUE
, a list is returned, the 1st element is the DTM/TDM, and the 2nd
element is a named list of words. However, if simple_sum = TRUE
, the DTM/TDM in the above two
situations will be replaced by a vector.
Examples
x <- c(
"Hello, what do you want to drink and eat?",
"drink a bottle of milk",
"drink a cup of coffee",
"drink some water",
"eat a cake",
"eat a piece of pizza"
)
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
D1 <- list(
aa <- c("drink", "eat"),
bb <- c("cake", "pizza"),
cc <- c("cup", "bottle")
)
y1 <- dictionary_dtm(dtm, D1, return_dictionary = TRUE)
#
# NA, duplicated words, non-existent words,
# non-character elements do not affect the
# result.
D2 <-list(
has_na <- c("drink", "eat", NA),
this_is_factor <- factor(c("cake", "pizza")),
this_is_duplicated <- c("cup", "bottle", "cup", "bottle"),
do_not_exist <- c("tiger", "dream")
)
y2 <- dictionary_dtm(dtm, D2, return_dictionary = TRUE)
#
# You can read into a data.frame
# dictionary from a csv file.
# Each column represents a group.
D3 <- data.frame(
aa <- c("drink", "eat", NA, NA),
bb <- c("cake", "pizza", NA, NA),
cc <- c("cup", "bottle", NA, NA),
dd <- c("do", "to", "of", "and")
)
y3 <- dictionary_dtm(dtm, D3, simple_sum = TRUE)
#
# If it is a matrix:
mt <- t(as.matrix(dtm))
y4 <- dictionary_dtm(mt, D3, type = "t", return_dictionary = TRUE)