R: Find clusters of similar stories

get_story_clusters {stoRy}

R Documentation

Find clusters of similar stories

Description

get_story_clusters classifies the stories in a collection according to thematic similarity.

Usage

get_story_clusters(
  collection = NULL,
  weights = list(choice = 3, major = 2, minor = 1),
  explicit = TRUE,
  min_freq = 1,
  min_size = 3,
  blacklist = NULL
)

Arguments

`collection`	A `Collection()` class object. If `NULL`, the collection of all stories in the actively loaded LTO version is used.
`weights`	A list assigning nonnegative weights to choice, major, and minor theme levels. The default weighting `list(choice = 3, major = 2, minor = 1)` counts each choice usage three times, each major theme usage twice, and each minor theme usage once. Use the uniform weighting `list(choice = 1, major = 1, minor = 1)` weights theme usages equally regardless of level. At least one weight must be positive.
`explicit`	Set to `FALSE` to include ancestor themes of the explicit thematic annotations.
`min_freq`	Drop themes occurring less than this number of times from the analysis. The default `min_freq=1` results in no themes are discarded.
`min_size`	Minimum cluster size. The default is `min_size=3`.
`blacklist`	A `Themeset()` class object. A themeset containing themes to be dropped from the analysis. If `NULL`, no themes are filtered.

Details

The input collection of n stories, S[1], \ldots, S[n], is represented as a weighted bag-of-words, where each choice theme in story S[j] (j=1, \ldots, n) is counted weights$choice times, each major theme weights$major times, and each minor theme weights$choice times.

The function classifies the stories according to thematic similarity using the Iterative Signature Algorithm (ISA) biclustering algorithm as implemented in the isa2 R package. The clusters are "soft" meaning that a story can appear in multiple clusters.

Install isa2 package by running the command install.packages(\"isa2\") before calling this function.

Value

Returns a tibble with r rows (story clusters) and 4 columns:

`cluster_id`:	Story cluster integer ID
`stories`:	A tibble of stories comprising the cluster
`themes`:	A tibble of themes common to the clustered stories
`size`:	Number of stories in the cluster

References

Gábor Csárdi, Zoltán Kutalik, Sven Bergmann (2010). Modular analysis of gene expression data with R. Bioinformatics, 26, 1376-7.

Sven Bergmann, Jan Ihmels, Naama Barkai (2003). Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67, 031902.

Gábor Csárdi (2017). isa2: The Iterative Signature Algorithm. R package version 0.3.5. https://cran.r-project.org/package=isa2

Examples

## Not run: 
# Cluster "The Twilight Zone" franchise stories according to thematic
# similarity:
library(dplyr)
set_lto("demo")
set.seed(123)
result_tbl <- get_story_clusters()
result_tbl

# Explore a cluster of stories related to traveling back in time:
cluster_id <- 3
pull(result_tbl, stories)[[cluster_id]]
pull(result_tbl, themes)[[cluster_id]]

# Explore a cluster of stories related to mass panics:
cluster_id <- 5
pull(result_tbl, stories)[[cluster_id]]
pull(result_tbl, themes)[[cluster_id]]

# Explore a cluster of stories related to executions:
cluster_id <- 7
pull(result_tbl, stories)[[cluster_id]]
pull(result_tbl, themes)[[cluster_id]]

# Explore a cluster of stories related to space aliens:
cluster_id <- 10
pull(result_tbl, stories)[[cluster_id]]
pull(result_tbl, themes)[[cluster_id]]

# Explore a cluster of stories related to old people wanting to be young:
cluster_id <- 11
pull(result_tbl, stories)[[cluster_id]]
pull(result_tbl, themes)[[cluster_id]]

# Explore a cluster of stories related to wish making:
cluster_id <- 13
pull(result_tbl, stories)[[cluster_id]]
pull(result_tbl, themes)[[cluster_id]]

## End(Not run)

[Package stoRy version 0.2.2 Index]