lma_patcat {lingmatch} | R Documentation |
Categorize Texts
Description
Categorize raw texts using a pattern-based dictionary.
Usage
lma_patcat(text, dict = NULL, pattern.weights = "weight",
pattern.categories = "category", bias = NULL, to.lower = TRUE,
return.dtm = FALSE, drop.zeros = FALSE, exclusive = TRUE,
boundary = NULL, fixed = TRUE, globtoregex = FALSE,
name.map = c(intname = "_intercept", term = "term"),
dir = getOption("lingmatch.dict.dir"))
Arguments
text |
A vector of text to be categorized. Texts are padded by 2 spaces, and potentially lowercased. |
dict |
At least a vector of terms (patterns), usually a matrix-like object with columns for terms, categories, and weights. |
pattern.weights |
A vector of weights corresponding to terms in |
pattern.categories |
A vector of category names corresponding to terms in |
bias |
A constant to add to each category after weighting and summing. Can be a vector with names
corresponding to the unique values in |
to.lower |
Logical indicating whether |
return.dtm |
Logical; if |
drop.zeros |
logical; if |
exclusive |
Logical; if |
boundary |
A string to add to the beginning and end of each dictionary term. If |
fixed |
Logical; if |
globtoregex |
Logical; if |
name.map |
A named character vector:
Missing names are added, so names can be specified positional (e.g., |
dir |
Path to a folder in which to look for |
Value
A matrix with a row per text
and columns per dictionary category, or (when return.dtm = TRUE
)
a sparse matrix with a row per text
and column per term. Includes a WC
attribute with original
word counts, and a categories
attribute with row indices associated with each category if
return.dtm = TRUE
.
See Also
For applying term-based dictionaries (to a document-term matrix) see lma_termcat()
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
Examples
# example text
text <- c(
paste(
"Oh, what youth was! What I had and gave away.",
"What I took and spent and saw. What I lost. And now? Ruin."
),
paste(
"God, are you so bored?! You just want what's gone from us all?",
"I miss the you that was too. I love that you."
),
paste(
"Tomorrow! Tomorrow--nay, even tonight--you wait, as I am about to change.",
"Soon I will off to revert. Please wait."
)
)
# make a document-term matrix with pre-specified terms only
lma_patcat(text, c("bored?!", "i lo", ". "), return.dtm = TRUE)
# get counts of sets of letter
lma_patcat(text, list(c("a", "b", "c"), c("d", "e", "f")))
# same thing with regular expressions
lma_patcat(text, list("[abc]", "[def]"), fixed = FALSE)
# match only words
lma_patcat(text, list("i"), boundary = TRUE)
# match only words, ignoring punctuation
lma_patcat(
text, c("you", "tomorrow", "was"),
fixed = FALSE,
boundary = "\\b", return.dtm = TRUE
)
## Not run:
# read in the temporal orientation lexicon from the World Well-Being Project
tempori <- read.csv(paste0(
"https://raw.githubusercontent.com/wwbp/lexica/master/",
"temporal_orientation/temporal_orientation_lexicon.csv"
))
lma_patcat(text, tempori)
# or use the standardized version
tempori_std <- read.dic("wwbp_prospection", dir = "~/Dictionaries")
lma_patcat(text, tempori_std)
## get scores on the same scale by adjusting the standardized values
tempori_std[, -1] <- tempori_std[, -1] / 100 *
select.dict("wwbp_prospection")$selected[, "original_max"]
lma_patcat(text, tempori_std)[, unique(tempori$category)]
## End(Not run)