sentSplit {qdap} | R Documentation |
Sentence Splitting
Description
sentSplit
- Splits turns of talk into individual sentences (provided
proper punctuation is used). This procedure is usually done as part of the
data read in and cleaning process.
sentCombine
- Combines sentences by the same grouping variable together.
TOT
- Convert the tot column from sentSplit
to
turn of talk index (no sub sentence). Generally, for internal use.
sent_detect
- Detect and split sentences on endmark boundaries.
sent_detect_nlp
- Detect and split sentences on endmark boundaries
using openNLP & NLP utilities which matches the onld version of
the openNLP package's now removed sentDetect
function.
Usage
sentSplit(
dataframe,
text.var,
rm.var = NULL,
endmarks = c("?", ".", "!", "|"),
incomplete.sub = TRUE,
rm.bracket = TRUE,
stem.col = FALSE,
text.place = "right",
verbose = is.global(2),
...
)
sentCombine(text.var, grouping.var = NULL, as.list = FALSE)
TOT(tot)
sent_detect(
text.var,
endmarks = c("?", ".", "!", "|"),
incomplete.sub = TRUE,
rm.bracket = TRUE,
...
)
sent_detect_nlp(text.var, ...)
Arguments
dataframe |
A dataframe that contains the person and text variable. |
text.var |
The text variable. |
rm.var |
An optional character vector of 1 or 2 naming the variables that are repeated measures (This will restart the "tot" column). |
endmarks |
A character vector of endmarks to split turns of talk into sentences. |
incomplete.sub |
logical. If |
rm.bracket |
logical. If |
stem.col |
logical. If |
text.place |
A character string giving placement location of the text
column. This must be one of the strings |
verbose |
logical. If |
grouping.var |
The grouping variables. Default |
as.list |
logical. If |
tot |
A tot column from a |
... |
Additional options passed to |
Value
sentSplit
- returns a dataframe with turn of talk broken apart
into sentences. Optionally a stemmed version of the text variable may be
returned as well.
sentCombine
- returns a list of vectors with the continuous
sentences by grouping.var pasted together.
returned as well.
TOT
- returns a numeric vector of the turns of talk without
sentence sub indexing (e.g. 3.2 become 3).
sent_detect
- returns a character vector of sentences split on
endmark.
sent_detect
- returns a character vector of sentences split on
endmark.
Warning
sentSplit
requires the dialogue (text)
column to be cleaned in a particular way. The data should contain qdap
punctuation marks (c("?", ".", "!", "|")
) at the end of each sentence.
Additionally, extraneous punctuation such as abbreviations should be removed
(see replace_abbreviation
).
Trailing sentences such as I thought I... will be treated as
incomplete and marked with "|"
to denote an incomplete/trailing
sentence.
Suggestion
It is recommended that the user runs check_text
on the
output of sentSplit
's text column.
Author(s)
Dason Kurkiewicz and Tyler Rinker <tyler.rinker@gmail.com>.
See Also
bracketX
,
incomplete_replace
,
stem2df
,
TOT
Examples
## Not run:
## `sentSplit` EXAMPLE:
(out <- sentSplit(DATA, "state"))
out %&% check_text() ## check output text
sentSplit(DATA, "state", stem.col = TRUE)
sentSplit(DATA, "state", text.place = "left")
sentSplit(DATA, "state", text.place = "original")
sentSplit(raj, "dialogue")[1:20, ]
## plotting
plot(out)
plot(out, grouping.var = "person")
out2 <- sentSplit(DATA2, "state", rm.var = c("class", "day"))
plot(out2)
plot(out2, grouping.var = "person")
plot(out2, grouping.var = "person", rm.var = "day")
plot(out2, grouping.var = "person", rm.var = c("day", "class"))
## `sentCombine` EXAMPLE:
dat <- sentSplit(DATA, "state")
sentCombine(dat$state, dat$person)
truncdf(sentCombine(dat$state, dat$sex), 50)
## `TOT` EXAMPLE:
dat <- sentSplit(DATA, "state")
TOT(dat$tot)
## `sent_detect`
sent_detect(DATA$state)
## NLP based sentence splitting
sent_detect_nlp(DATA$state)
## End(Not run)