strcut_loc {tinycodet}R Documentation

Cut Strings

Description

The strcut_loc() function cuts every string in a character vector around a location range loc, such that every string is cut into the following parts:

The location range loc would usually be matrix with 2 columns, giving the start and end points of some pattern match.

The strcut_brk() function (a wrapper around stri_split_boundaries(..., tokens_only = FALSE)) cuts every string into individual text breaks (like character, word, line, or sentence boundaries).

Usage

strcut_loc(str, loc)

strcut_brk(str, type = "character", tolist = FALSE, n = -1L, ...)

Arguments

str

a string or character vector.

loc

Either one of the following:

  • the result from the stri_locate_ith function.

  • a matrix of 2 integer columns, with nrow(loc)==length(str), giving the location range of the middle part.

  • a vector of length 2, giving the location range of the middle part.

type

either one of the following:

  • a single string giving the break iterator type (i.e. "character", "line_break", "sentence", "word", or a custom set of ICU break iteration rules).

  • a list with break iteration options, like a list produced by stri_opts_brkiter.

[BOUNDARIES]

tolist

logical, indicating if strcut_brk should return a list (TRUE), or a matrix (FALSE, default).

n

see stri_split_boundaries.

...

additional arguments to be passed to stri_split_boundaries.

Details

The main difference between the strcut_ - functions and stri_split / strsplit, is that the latter generally removes the delimiter patterns in a string when cutting, while the strcut_-functions do not attempt to remove parts of the string by default, they only attempt to cut the strings into separate pieces. Moreover, the strcut_ - functions return a matrix by default.

Value

For strcut_loc():
A character matrix with length(str) rows and 3 columns, where for every row i it holds the following:

For strcut_brk(..., tolist = FALSE):
A character matrix with length(str) rows and a number of columns equal to the maximum number of pieces str was cut in.
Empty places are filled with NA.

For strcut_brk(..., tolist = TRUE):
A list with length(str) elements, where each element is a character vector containing the cut string.

See Also

tinycodet_strings

Examples



x <- rep(paste0(1:10, collapse = ""), 10)
print(x)
loc <- stri_locate_ith(x, 1:10, fixed = as.character(1:10))
strcut_loc(x, loc)
strcut_loc(x, c(5,5))
strcut_loc(x, c(NA, NA))
strcut_loc(x, c(5, NA))
strcut_loc(x, c(NA, 5))

test <- "The\u00a0above-mentioned    features are very useful. " %s+%
"Spam, spam, eggs, bacon, and spam. 123 456 789"
strcut_brk(test, "line")
strcut_brk(test, "word")
strcut_brk(test, "sentence")
strcut_brk(test)
strcut_brk(test, n = 1)
strcut_brk(test, "line", tolist = TRUE)
strcut_brk(test, "word", tolist = TRUE)
strcut_brk(test, "sentence", tolist = TRUE)

brk <- stringi::stri_opts_brkiter(
  type = "line"
)
strcut_brk(test, brk)


[Package tinycodet version 0.5.0 Index]