R: Find over-represented themes in a collection

get_enriched_themes {stoRy}

R Documentation

Find over-represented themes in a collection

Description

get_enriched_themes() calculates the top m most over-represented (or enriched) themes in a sub-collection of interest from a background collection.

Usage

get_enriched_themes(
  test_collection,
  background_collection = NULL,
  top_m = 10,
  weights = list(choice = 3, major = 2, minor = 1),
  explicit = TRUE,
  min_freq = 1,
  blacklist = NULL,
  metric = c("hgt", "tfidf")
)

Arguments

`test_collection`	A `Collection()` class object of stories to assay for over-represented themes.
`background_collection`	A `Collection()` class object the stories of are a superset of the `test_collection` stories. If `NULL`, the collection of all stories in the actively loaded LTO version is used.
`top_m`	Maximum number of themes to report. The default is `top_m=10`. If `Inf`, all themes occurring at least `min_occurrence` times in the collection are reported.
`weights`	A list assigning nonnegative weights to choice, major, and minor theme levels. The default weighting `list(choice = 3, major = 2, minor = 1)` counts each choice usage three times, each major theme usage twice, and each minor theme usage once. Use the uniform weighting `list(choice = 1, major = 1, minor = 1)` weights theme usages equally regardless of level. At least one weight must be positive.
`explicit`	Set to `FALSE` to include ancestor themes of the explicit thematic annotations.
`min_freq`	Drop themes occurring less than this number of times from the analysis. The default `min_freq=1` results in no themes are discarded.
`blacklist`	A `Themeset()` class object. A themeset containing themes to be dropped from the analysis. If `NULL`, no themes are filtered.
`metric`	A character vector specifying the choice of scoring function. Use `metric = "hgt"` for the hypergeometric test, and `metric = "tfidf"` for term frequency-inverse document frequency. The default specification of `metric = c("hgt", "tfidf")` results in the hypergeometric test being used in the analysis.

Details

The test collection of n stories, S[1], \ldots, S[n], is represented as a weighted bag-of-words, where each choice theme in story S[j] (j=1, \ldots, n) is counted weights$choice times, each major theme weights$major times, and each minor theme weights$choice times.

The background collection of N stories, S[1], \ldots, S[N], is a superset of the test collection that is likewise represented as a weighted bag-of-words.

Theme enrichment scores are calculated according to the hypergeometric test by default. Set metric = "tfidf" to use TF-IDF weights for the enrichment scores.

Value

Returns a tibble with top_m rows (themes) and 10 columns:

`theme_name`:	`m`-th most over-represented theme in the test collection
`k`:	Number of test collection stories featuring the theme
`k_bar`:	Weighted counts of the theme summed over the test collection stories
`n`:	Number of stories in the test collection
`n_bar`:	Sum of all weighted counts of test collection themes
`K`:	Number of background collection stories featuring the theme
`K_bar`:	Weighted counts of the theme summed over the background collection stories
`N`:	Number of stories in the background collection
`N_bar`:	Sum of all weighted counts of background collection themes
`score`:	Either the negative base 10 logarithm of the Hypergeometric test (if `metric = "hgt"`) or TF-IDF (if `metric = "tfidf"`)

References

Mikael Onsjö, Paul Sheridan (2020). Theme Enrichment Analysis: A Statistical Test for Identifying Significantly Enriched Themes in a List of Stories with an Application to the Star Trek Television Franchise. Digital Studies/le Champ Numérique, 10(1), 1. DOI: doi:10.16995/dscn.316

Examples

## Not run: 
# Retrieve the top 10 most enriched themes in "The Twilight Zone" (1959)
# series episodes with all demo version stories as background:
set_lto("demo")
test_collection <- Collection$new(collection_id = "Collection: tvseries: The Twilight Zone (1959)")
result_tbl <- get_enriched_themes(test_collection)
result_tbl

# Run the same analysis on "The Twilight Zone" (1959) series without
# including minor level themes:
result_tbl <- get_enriched_themes(test_collection, weights = list(choice = 1, major = 1, minor = 0))
result_tbl

## End(Not run)

[Package stoRy version 0.2.2 Index]