split_transcript {textshape}R Documentation

Split a Transcript Style Vector on Delimiter & Coerce to Dataframe

Description

Split a transcript style vector (e.g., c("greg: Who me", "sarah: yes you!") into a name and dialogue vector that is coerced to a data.table. Leading/trailing white space in the columns is stripped out.

Usage

split_transcript(
  x,
  delim = ":",
  colnames = c("person", "dialogue"),
  max.delim = 15,
  ...
)

Arguments

x

A transcript style vector (e.g., c("greg: Who me", "sarah: yes you!").

delim

The delimiter to split on.

colnames

The column names to use for the data.table output.

max.delim

An integer stating how many characters may come before a delimiter is found. This is useful for the case when a colon is the delimiter but time stamps are also found in the text.

...

Ignored.

Value

Returns a 2 column data.table.

Examples

split_transcript(c("greg: Who me", "sarah: yes you!"))

## Not run: 
## 2015 Vice-Presidential Debates Example
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, magrittr, xml2)

debates <- c(
    wisconsin = "110908",
    boulder = "110906",
    california = "110756",
    ohio = "110489"
)

lapply(debates, function(x){
    xml2::read_html(paste0("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) %>%
        rvest::html_nodes("p") %>%
        rvest::html_text() %>%
        textshape::split_index(grep("^[A-Z]+:", .)) %>%
        textshape::combine() %>%
        textshape::split_transcript() %>%
        textshape::split_sentence()
})

## End(Not run)

[Package textshape version 1.7.5 Index]