R: Split strings of text 'x' into sentences.

text_to_sentences {ds4psy}

R Documentation

Split strings of text `x` into sentences.

Description

text_to_sentences splits text x (consisting of one or more character strings) into a vector of its constituting sentences.

Usage

text_to_sentences(
  x,
  sep = " ",
  split_delim = "\\.|\\?|!",
  force_delim = FALSE
)

Arguments

`x`	A string of text (required), typically a character vector.
`sep`	A character inserted as separator/delimiter between elements when collapsing multi-element strings of `x`. Default: `sep = " "` (i.e., insert 1 space between elements).
`split_delim`	Sentence delimiters (as regex) used to split the collapsed string of `x` into substrings. Default: `split_delim = "\.\|\?\|!"` (rather than `"[[:punct:]]"`).
`force_delim`	Boolean: Enforce splitting at `split_delim`? If `force_delim = FALSE` (as per default), a standard sentence-splitting pattern is assumed: `split_delim` is followed by one or more blank spaces and a capital letter. If `force_delim = TRUE`, splits at `split_delim` are enforced (without considering spacing or capitalization).

Details

The splits of x will occur at given punctuation marks (provided as a regular expression, default: split_delim = "\.|\?|!"). Empty leading and trailing spaces are removed before returning a vector of the remaining character sequences (i.e., the sentences).

The Boolean argument force_delim distinguishes between two splitting modes:

If force_delim = FALSE (as per default), a standard sentence-splitting pattern is assumed: A sentence delimiter in split_delim must be followed by one or more blank spaces and a capital letter starting the next sentence. Sentence delimiters in split_delim are not removed from the output.
If force_delim = TRUE, the function enforces splits at each delimiter in split_delim. For instance, any dot (i.e., the metacharacter "\.") is interpreted as a full stop, so that sentences containing dots mid-sentence (e.g., for abbreviations, etc.) are split into parts. Sentence delimiters in split_delim are removed from the output.

Internally, text_to_sentences first uses paste to collapse strings (adding sep between elements) and then strsplit to split strings at split_delim.

Value

A character vector (of sentences).

Examples

x <- c("A first sentence. Exclamation sentence!", 
       "Any questions? But etc. can be tricky. A fourth --- and final --- sentence.")
text_to_sentences(x)
text_to_sentences(x, force_delim = TRUE)

# Changing split delimiters:
text_to_sentences(x, split_delim = "\\.")  # only split at "."

text_to_sentences("Buy apples, berries, and coconuts.")
text_to_sentences("Buy apples, berries; and coconuts.", 
                  split_delim = ",|;|\\.", force_delim = TRUE)
                  
text_to_sentences(c("123. 456? 789! 007 etc."), force_delim = TRUE)

# Split multi-element strings (w/o punctuation):
e3 <- c("12", "34", "56")
text_to_sentences(e3, sep = " ")  # Default: Collapse strings adding 1 space, but: 
text_to_sentences(e3, sep = ".", force_delim = TRUE)  # insert sep and force split.

# Punctuation within sentences:
text_to_sentences("Dr. who is left intact.")
text_to_sentences("Dr. Who is problematic.")

[Package ds4psy version 1.0.0 Index]

Split strings of text x into sentences.