lma_weight {lingmatch} | R Documentation |
Document-Term Matrix Weighting
Description
Weight a document-term matrix.
Usage
lma_weight(dtm, weight = "count", normalize = TRUE, wc.complete = TRUE,
log.base = 10, alpha = 1, pois.x = 1L, doc.only = FALSE,
percent = FALSE)
Arguments
dtm |
A matrix with words as column names. |
weight |
A string referring at least partially to one (or a combination; see note) of the available weighting methods: Term weights (applied uniquely to each cell)
Document weights (applied by column)
Alternatively, |
normalize |
Logical: if |
wc.complete |
If the dtm was made with |
log.base |
The base of logs, applied to any weight using |
alpha |
A scaling factor applied to document frequency as part of pointwise mutual
information weighting, or amplify's power ( |
pois.x |
integer; quantile or probability of the poisson distribution ( |
doc.only |
Logical: if |
percent |
Logical; if |
Value
A weighted version of dtm
, with a type
attribute added (attr(dtm, 'type')
).
Note
Term weights works to adjust differences in counts within documents, with differences meaning
increasingly more from binary
to log
to sqrt
to count
to amplify
.
Document weights work to treat words differently based on their between-document or overall frequency.
When term frequencies are constant, dpois
, idf
, ridf
, and normal
give
less common words increasingly more weight, and dfmax
, dfmlog
, ppois
, df
,
dflog
, and entropy
give less common words increasingly less weight.
weight
can either be a vector with two characters, corresponding to term weight and
document weight (e.g., c('count', 'idf')
), or it can be a string with term and
document weights separated by any of :\*_/; ,-
(e.g., 'count-idf'
).
'tf'
is also acceptable for 'count'
, and 'tfidf'
will be parsed as
c('count', 'idf')
, though this is a special case.
For weight
, term or document weights can be entered individually; term weights alone will
not apply any document weight, and document weights alone will apply a 'count'
term weight
(unless doc.only = TRUE
, in which case a term-named vector of document weights is returned
instead of a weighted dtm).
Examples
# visualize term and document weights
## term weights
term_weights <- c("binary", "log", "sqrt", "count", "amplify")
Weighted <- sapply(term_weights, function(w) lma_weight(1:20, w, FALSE))
if (require(splot)) splot(Weighted ~ 1:20, labx = "Raw Count", lines = "co")
## document weights
doc_weights <- c(
"df", "dflog", "dfmax", "dfmlog", "idf", "ridf",
"normal", "dpois", "ppois", "entropy"
)
weight_range <- function(w, value = 1) {
m <- diag(20)
m[upper.tri(m, TRUE)] <- if (is.numeric(value)) {
value
} else {
unlist(lapply(
1:20, function(v) rep(if (value == "inverted") 21 - v else v, v)
))
}
lma_weight(m, w, FALSE, doc.only = TRUE)
}
if (require(splot)) {
category <- rep(c("df", "idf", "normal", "poisson", "entropy"), c(4, 2, 1, 2, 1))
op <- list(
laby = "Relative (Scaled) Weight", labx = "Document Frequency",
leg = "outside", lines = "connected", mv.scale = TRUE, note = FALSE
)
splot(
sapply(doc_weights, weight_range) ~ 1:20,
options = op, title = "Same Term, Varying Document Frequencies",
sud = "All term frequencies are 1.",
colorby = list(category, grade = TRUE)
)
splot(
sapply(doc_weights, weight_range, value = "sequence") ~ 1:20,
options = op, title = "Term as Document Frequencies",
sud = "Non-zero terms are the number of non-zero terms.",
colorby = list(category, grade = TRUE)
)
splot(
sapply(doc_weights, weight_range, value = "inverted") ~ 1:20,
options = op, title = "Term Opposite of Document Frequencies",
sud = "Non-zero terms are the number of zero terms + 1.",
colorby = list(category, grade = TRUE)
)
}