get_enriched_themes {stoRy} | R Documentation |
Find over-represented themes in a collection
Description
get_enriched_themes()
calculates the top m
most over-represented (or
enriched) themes in a sub-collection of interest from a background
collection.
Usage
get_enriched_themes(
test_collection,
background_collection = NULL,
top_m = 10,
weights = list(choice = 3, major = 2, minor = 1),
explicit = TRUE,
min_freq = 1,
blacklist = NULL,
metric = c("hgt", "tfidf")
)
Arguments
test_collection |
A |
background_collection |
A If |
top_m |
Maximum number of themes to report. The default is If |
weights |
A list assigning nonnegative weights to choice, major, and
minor theme levels. The default weighting
|
explicit |
Set to |
min_freq |
Drop themes occurring less than this number of times from
the analysis. The default |
blacklist |
A If |
metric |
A character vector specifying the choice of scoring function.
Use |
Details
The test collection of n
stories, S[1], \ldots, S[n]
, is
represented as a weighted bag-of-words, where each choice theme in
story S[j] (j=1, \ldots, n)
is counted weights$choice
times,
each major theme weights$major
times, and each minor
theme weights$choice
times.
The background collection of N
stories, S[1], \ldots, S[N]
, is a
superset of the test collection that is likewise represented as a weighted
bag-of-words.
Theme enrichment scores are calculated according to the hypergeometric test
by default. Set metric = "tfidf"
to use TF-IDF weights for the enrichment
scores.
Value
Returns a tibble
with top_m
rows (themes)
and 10 columns:
theme_name : | m -th most over-represented theme in the test
collection |
k : | Number of test collection stories featuring the theme |
k_bar : | Weighted counts of the theme summed over the test collection stories |
n : | Number of stories in the test collection |
n_bar : | Sum of all weighted counts of test collection themes |
K : | Number of background collection stories featuring the theme |
K_bar : | Weighted counts of the theme summed over the background collection stories |
N : | Number of stories in the background collection |
N_bar : | Sum of all weighted counts of background collection themes |
score : | Either the negative base 10 logarithm of the Hypergeometric
test (if metric = "hgt" ) or TF-IDF (if metric = "tfidf" ) |
References
Mikael Onsjö, Paul Sheridan (2020). Theme Enrichment Analysis: A Statistical Test for Identifying Significantly Enriched Themes in a List of Stories with an Application to the Star Trek Television Franchise. Digital Studies/le Champ Numérique, 10(1), 1. DOI: doi:10.16995/dscn.316
Examples
## Not run:
# Retrieve the top 10 most enriched themes in "The Twilight Zone" (1959)
# series episodes with all demo version stories as background:
set_lto("demo")
test_collection <- Collection$new(collection_id = "Collection: tvseries: The Twilight Zone (1959)")
result_tbl <- get_enriched_themes(test_collection)
result_tbl
# Run the same analysis on "The Twilight Zone" (1959) series without
# including minor level themes:
result_tbl <- get_enriched_themes(test_collection, weights = list(choice = 1, major = 1, minor = 0))
result_tbl
## End(Not run)