R: Build a Random Corpus

rancor_builder {text2map}

R Documentation

Build a Random Corpus

Description

rancor_builder() generates a random corpus (rancor) based on a user defined term probabilities and vocabulary. Users can set the number of documents, as well as the mean, standard deviation, minimum, and maximum document lengths (i.e., number of tokens) of the parent normal distribution from which the document lengths are randomly sampled. The output is a single document-term matrix. To produce multiple random corpora, use rancors_builder() (note the plural). Term probabilities/vocabulary can come from a users own corpus, or a pre-compiled frequency list, such as the one derived from the Google Book N-grams corpus

Usage

rancor_builder(
  data,
  vocab,
  probs,
  n_docs = 100L,
  len_mean = 500,
  len_var = 10L,
  len_min = 20L,
  len_max = 1000L,
  seed = NULL
)

Arguments

`data`	Data.frame containing vocabulary and probabilities
`vocab`	Name of the column containing vocabulary
`probs`	Name of the column containing probabilities
`n_docs`	Integer indicating the number of documents to be returned
`len_mean`	Integer indicating the mean of the document lengths in the parent normal sampling distribution
`len_var`	Integer indicating the standard deviation of the document lengths in the parent normal sampling distribution
`len_min`	Integer indicating the minimum of the document lengths in the parent normal sampling distribution
`len_max`	Integer indicating the maximum of the document lengths in the parent normal sampling distribution
`seed`	Optional seed for reproducibility

Author(s)

Dustin Stoltz and Marshall Taylor

Examples

# create corpus and DTM
my_corpus <- data.frame(
  text = c(
    "I hear babies crying I watch them grow",
    "They'll learn much more than I'll ever know",
    "And I think to myself",
    "What a wonderful world",
    "Yes I think to myself",
    "What a wonderful world"
  ),
  line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))

dtm <- dtm_builder(
  data = my_corpus,
  text = clean_text,
  doc_id = line_id
)

# use colSums to get term frequencies
df <- data.frame(
  terms = colnames(dtm),
  freqs = colSums(dtm)
)
# convert to probabilities
df$probs <- df$freqs / sum(df$freqs)

# create random DTM
rDTM <- df |>
  rancor_builder(terms, probs)

[Package text2map version 0.2.0 Index]