dictionary_dtm {chinese.misc}R Documentation

Making DTM/TDM for Groups of Words

Description

A dictionary has several groups of words. Sometimes what we want is not the term frequency of this or that single word, but rather the total sum of words that belong to the same group. Given a dictionary, this function can save you a lot of time because it sums up the frequencies of all groups of words and you do not need to do it manually.

Usage

dictionary_dtm(
  x,
  dictionary,
  type = "dtm",
  simple_sum = FALSE,
  return_dictionary = FALSE,
  checks = TRUE
)

Arguments

x

an object of class DocumentTermMatrix or TermDocumentMatrix created by corp_or_dtm or tm::DocumentTermMatrix or tm::TermDocumentMatrix. But it can also be a numeric matrix and you have to specify its type, see below.

dictionary

a dictionary telling the function how you group the words. It can be a list, matrix, data.frame or character vector. Please see details for how to set this argument.

type

if x is a matrix, you have to tell whether it represents a document term matrix or a term document matrix. Character starting with "D" or "d" for document term matrix, and that with "T" or "t" for term document matrix. The default is "dtm".

simple_sum

if it is FALSE (default), a DTM/TDM will be returned. If TRUE, you will not see the term frequency of each word in each text. Rather, a numeric vector is returned, each of its element represents the sum of the corresponding group of words in the corpus as a whole.

return_dictionary

if TRUE, a modified dictionary is returned, which only contains words that do exist in the DTM/TDM. The default is FALSE.

checks

The default is TRUE. This will check whether x and dictionary is valid. For dictionary, if the input is not a list of characters, the function will manage to convert. You should not set this to FALSE unless you do believe that your input is OK.

Details

The argument dictionary can be set in different ways:

Value

if return_dictionary = FALSE, an object of class DocumentTermMatrix or TermDocumentMatrix is returned; if TRUE, a list is returned, the 1st element is the DTM/TDM, and the 2nd element is a named list of words. However, if simple_sum = TRUE, the DTM/TDM in the above two situations will be replaced by a vector.

Examples

x <- c(
  "Hello, what do you want to drink and eat?", 
  "drink a bottle of milk", 
  "drink a cup of coffee", 
  "drink some water", 
  "eat a cake", 
  "eat a piece of pizza"
)
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
D1 <- list(
  aa <- c("drink", "eat"),
  bb <- c("cake", "pizza"),
  cc <- c("cup", "bottle")
)
y1 <- dictionary_dtm(dtm, D1, return_dictionary = TRUE)
#
# NA, duplicated words, non-existent words, 
# non-character elements do not affect the
# result.
D2 <-list(
  has_na <- c("drink", "eat", NA),
  this_is_factor <- factor(c("cake", "pizza")),
  this_is_duplicated <- c("cup", "bottle", "cup", "bottle"), 
  do_not_exist <- c("tiger", "dream")
)
y2 <- dictionary_dtm(dtm, D2, return_dictionary = TRUE)
#
# You can read into a data.frame 
# dictionary from a csv file.
# Each column represents a group.
D3 <- data.frame(
  aa <- c("drink", "eat", NA, NA),
  bb <- c("cake", "pizza", NA, NA),
  cc <- c("cup", "bottle", NA, NA),
  dd <- c("do", "to", "of", "and")
)
y3 <- dictionary_dtm(dtm, D3, simple_sum = TRUE)
#
# If it is a matrix:
mt <- t(as.matrix(dtm))
y4 <- dictionary_dtm(mt, D3, type = "t", return_dictionary = TRUE)

[Package chinese.misc version 0.2.3 Index]