split_transcript {textshape} | R Documentation |
Split a Transcript Style Vector on Delimiter & Coerce to Dataframe
Description
Split a transcript style vector (e.g., c("greg: Who me", "sarah: yes you!")
into a name and dialogue vector that is coerced to a data.table
.
Leading/trailing white space in the columns is stripped out.
Usage
split_transcript(
x,
delim = ":",
colnames = c("person", "dialogue"),
max.delim = 15,
...
)
Arguments
x |
A transcript style vector (e.g., |
delim |
The delimiter to split on. |
colnames |
The column names to use for the |
max.delim |
An integer stating how many characters may come before a delimiter is found. This is useful for the case when a colon is the delimiter but time stamps are also found in the text. |
... |
Ignored. |
Value
Returns a 2 column data.table
.
Examples
split_transcript(c("greg: Who me", "sarah: yes you!"))
## Not run:
## 2015 Vice-Presidential Debates Example
if (!require("pacman")) install.packages("pacman")
pacman::p_load(rvest, magrittr, xml2)
debates <- c(
wisconsin = "110908",
boulder = "110906",
california = "110756",
ohio = "110489"
)
lapply(debates, function(x){
xml2::read_html(paste0("http://www.presidency.ucsb.edu/ws/index.php?pid=", x)) %>%
rvest::html_nodes("p") %>%
rvest::html_text() %>%
textshape::split_index(grep("^[A-Z]+:", .)) %>%
textshape::combine() %>%
textshape::split_transcript() %>%
textshape::split_sentence()
})
## End(Not run)
[Package textshape version 1.7.5 Index]