substr_ctl {fansi} | R Documentation |
Control Sequence Aware Version of substr
Description
substr_ctl
is a drop-in replacement for substr
. Performance is
slightly slower than substr
, and more so for type = 'width'
. Special
Control Sequences will be included in the substrings to reflect their format
when as it was when part of the source string. substr2_ctl
adds the
ability to extract substrings based on grapheme count or display width in
addition to the normal character width, as well as several other options.
Usage
substr_ctl(
x,
start,
stop,
warn = getOption("fansi.warn", TRUE),
term.cap = getOption("fansi.term.cap", dflt_term_cap()),
ctl = "all",
normalize = getOption("fansi.normalize", FALSE),
carry = getOption("fansi.carry", FALSE),
terminate = getOption("fansi.terminate", TRUE)
)
substr2_ctl(
x,
start,
stop,
type = "chars",
round = "start",
tabs.as.spaces = getOption("fansi.tabs.as.spaces", FALSE),
tab.stops = getOption("fansi.tab.stops", 8L),
warn = getOption("fansi.warn", TRUE),
term.cap = getOption("fansi.term.cap", dflt_term_cap()),
ctl = "all",
normalize = getOption("fansi.normalize", FALSE),
carry = getOption("fansi.carry", FALSE),
terminate = getOption("fansi.terminate", TRUE)
)
substr_ctl(
x,
start,
stop,
warn = getOption("fansi.warn", TRUE),
term.cap = getOption("fansi.term.cap", dflt_term_cap()),
ctl = "all",
normalize = getOption("fansi.normalize", FALSE),
carry = getOption("fansi.carry", FALSE),
terminate = getOption("fansi.terminate", TRUE)
) <- value
substr2_ctl(
x,
start,
stop,
type = "chars",
round = "start",
tabs.as.spaces = getOption("fansi.tabs.as.spaces", FALSE),
tab.stops = getOption("fansi.tab.stops", 8L),
warn = getOption("fansi.warn", TRUE),
term.cap = getOption("fansi.term.cap", dflt_term_cap()),
ctl = "all",
normalize = getOption("fansi.normalize", FALSE),
carry = getOption("fansi.carry", FALSE),
terminate = getOption("fansi.terminate", TRUE)
) <- value
Arguments
x |
a character vector or object that can be coerced to such. |
start |
integer. The first element to be extracted or replaced. |
stop |
integer. The first element to be extracted or replaced. |
warn |
TRUE (default) or FALSE, whether to warn when potentially
problematic Control Sequences are encountered. These could cause the
assumptions |
term.cap |
character a vector of the capabilities of the terminal, can
be any combination of "bright" (SGR codes 90-97, 100-107), "256" (SGR codes
starting with "38;5" or "48;5"), "truecolor" (SGR codes starting with
"38;2" or "48;2"), and "all". "all" behaves as it does for the |
ctl |
character, which Control Sequences should be treated
specially. Special treatment is context dependent, and may include
detecting them and/or computing their display/character width as zero. For
the SGR subset of the ANSI CSI sequences, and OSC hyperlinks,
|
normalize |
TRUE or FALSE (default) whether SGR sequence should be
normalized out such that there is one distinct sequence for each SGR code.
normalized strings will occupy more space (e.g. "\033[31;42m" becomes
"\033[31m\033[42m"), but will work better with code that assumes each SGR
code will be in its own escape as |
carry |
TRUE, FALSE (default), or a scalar string, controls whether to
interpret the character vector as a "single document" (TRUE or string) or
as independent elements (FALSE). In "single document" mode, active state
at the end of an input element is considered active at the beginning of the
next vector element, simulating what happens with a document with active
state at the end of a line. If FALSE each vector element is interpreted as
if there were no active state when it begins. If character, then the
active state at the end of the |
terminate |
TRUE (default) or FALSE whether substrings should have
active state closed to avoid it bleeding into other strings they may be
prepended onto. This does not stop state from carrying if |
type |
character(1L) partial matching
|
round |
character(1L) partial matching
|
tabs.as.spaces |
FALSE (default) or TRUE, whether to convert tabs to
spaces (and supress tab related warnings). This can only be set to TRUE if
|
tab.stops |
integer(1:n) indicating position of tab stops to use when converting tabs to spaces. If there are more tabs in a line than defined tab stops the last tab stop is re-used. For the purposes of applying tab stops, each input line is considered a line and the character count begins from the beginning of the input line. |
value |
a character vector or object that can be coerced to such. |
Value
A character vector of the same length and with the same attributes as x (after possible coercion and re-encoding to UTF-8).
Control and Special Sequences
Control Sequences are non-printing characters or sequences of characters.
Special Sequences are a subset of the Control Sequences, and include CSI
SGR sequences which can be used to change rendered appearance of text, and
OSC hyperlinks. See fansi
for details.
Position Semantics
When computing substrings, Normal (non-control) characters are considered to occupy positions in strings, whereas Control Sequences occupy the interstices between them. The string:
"hello-\033[31mworld\033[m!"
is interpreted as:
1 1 1 1 2 3 4 5 6 7 8 9 0 1 2 h e l l o -|w o r l d|! ^ ^ \033[31m \033[m
start
and stop
reference character positions so they never explicitly
select for the interstitial Control Sequences. The latter are implicitly
selected if they appear in interstices after the first character and before
the last. Additionally, because Special Sequences (CSI SGR and OSC
hyperlinks) affect all subsequent characters in a string, any active Special
Sequence, whether opened just before a character or much before, will be
reflected in the state fansi
prepends to the beginning of each substring.
It is possible to select Control Sequences at the end of a string by
specifying stop
values past the end of the string, although for Special
Sequences this only produces visible results if terminate
is set to
FALSE
. Similarly, it is possible to select Control Sequences preceding
the beginning of a string by specifying start
values less than one,
although as noted earlier this is unnecessary for Special Sequences as
those are output by fansi
before each substring.
Because exact substrings on anything other than character count cannot be
guaranteed (e.g. as a result of multi-byte encodings, or double display-width
characters) substr2_ctl
must make assumptions on how to resolve provided
start
/stop
values that are infeasible and does so via the round
parameter.
If we use "start" as the round
value, then any time the start
value corresponds to the middle of a multi-byte or a wide character, then
that character is included in the substring, while any similar partially
included character via the stop
is left out. The converse is true if we
use "stop" as the round
value. "neither" would cause all partial
characters to be dropped irrespective whether they correspond to start
or
stop
, and "both" could cause all of them to be included. See examples.
A number of Normal characters such as combining diacritic marks have
reported width of zero. These are typically displayed overlaid on top of the
preceding glyph, as in the case of "e\u301"
forming "e" with an acute
accent. Unlike Control Sequences, which also have reported width of zero,
fansi
groups zero-width Normal characters with the last preceding
non-zero width Normal character. This is incorrect for some rare
zero-width Normal characters such as prepending marks (see "Output
Stability" and "Graphemes").
Output Stability
Several factors could affect the exact output produced by fansi
functions across versions of fansi
, R
, and/or across systems.
In general it is best not to rely on exact fansi
output, e.g. by
embedding it in tests.
Width and grapheme calculations depend on locale, Unicode database
version, and grapheme processing logic (which is still in development), among
other things. For the most part fansi
(currently) uses the internals of
base::nchar(type='width')
, but there are exceptions and this may change in
the future.
How a particular display format is encoded in Control Sequences is
not guaranteed to be stable across fansi
versions. Additionally, which
Special Sequences are re-encoded vs transcribed untouched may change.
In general we will strive to keep the rendered appearance stable.
To maximize the odds of getting stable output set normalize_state
to
TRUE
and type
to "chars"
in functions that allow it, and
set term.cap
to a specific set of capabilities.
Replacement Functions
Semantics for replacement functions have the additional requirement that the
result appear as if it is the input modified in place between the positions
designated by start
and stop
. terminate
only affects the boundaries
between the original substring and the spliced one, normalize
only affects
the same boundaries, and tabs.as.spaces
only affects value
, and x
must
be ASCII only or marked "UTF-8".
terminate = FALSE
only makes sense in replacement mode if only one of x
or value
contains Control Sequences. fansi
will not account for any
interactions of state in x
and value
.
The carry
parameter causes state to carry within the original string and
the replacement values independently, as if they were columns of text cut
from different pages and pasted together. String values for carry
are
disallowed in replacement mode as it is ambiguous which of x
or value
they would modify (see examples).
When in type = 'width'
mode, it is only guaranteed that the result will be
no wider than the original x
. Narrower strings may result if a mixture
of narrow and wide graphemes cannot be replaced exactly with the same width
value, possibly because the provided start
and stop
values (or the
implicit ones generated for value
) do not align with grapheme boundaries.
Graphemes
fansi
approximates grapheme widths and counts by using heuristics for
grapheme breaks that work for most common graphemes, including emoji
combining sequences. The heuristic is known to work incorrectly with
invalid combining sequences, prepending marks, and sequence interruptors.
fansi
does not provide a full implementation of grapheme break detection to
avoid carrying a copy of the Unicode grapheme breaks table, and also because
the hope is that R will add the feature eventually itself.
The utf8
package provides a
conforming grapheme parsing implementation.
Bidirectional Text
fansi
is unaware of text directionality and operates as if all strings are
left to right (LTR). Using fansi
function with strings that contain mixed
direction scripts (i.e. both LTR and RTL) may produce undesirable results.
Note
Non-ASCII strings are converted to and returned in UTF-8 encoding. Width calculations will not work properly in R < 3.2.2.
If stop
< start
, the return value is always an empty string.
See Also
?fansi
for details on how Control Sequences are
interpreted, particularly if you are getting unexpected results,
normalize_state
for more details on what the normalize
parameter does,
state_at_end
to compute active state at the end of strings,
close_state
to compute the sequence required to close active state.
Examples
substr_ctl("\033[42mhello\033[m world", 1, 9)
substr_ctl("\033[42mhello\033[m world", 3, 9)
## Positions 2 and 4 are in the middle of the full width W (\uFF37) for
## the `start` and `stop` positions respectively. Use `round`
## to control result:
x <- "\uFF37n\uFF37"
x
substr2_ctl(x, 2, 4, type='width', round='start')
substr2_ctl(x, 2, 4, type='width', round='stop')
substr2_ctl(x, 2, 4, type='width', round='neither')
substr2_ctl(x, 2, 4, type='width', round='both')
## We can specify which escapes are considered special:
substr_ctl("\033[31mhello\tworld", 1, 6, ctl='sgr', warn=FALSE)
substr_ctl("\033[31mhello\tworld", 1, 6, ctl=c('all', 'c0'), warn=FALSE)
## `carry` allows SGR to carry from one element to the next
substr_ctl(c("\033[33mhello", "world"), 1, 3)
substr_ctl(c("\033[33mhello", "world"), 1, 3, carry=TRUE)
substr_ctl(c("\033[33mhello", "world"), 1, 3, carry="\033[44m")
## We can omit the termination
bleed <- substr_ctl(c("\033[41mhello", "world"), 1, 3, terminate=FALSE)
writeLines(bleed) # Style will bleed out of string
end <- "\033[0m\n"
writeLines(end) # Stanch bleeding
## Trailing sequences omitted unless `stop` past end.
substr_ctl("ABC\033[42m", 1, 3, terminate=FALSE)
substr_ctl("ABC\033[42m", 1, 4, terminate=FALSE)
## Replacement functions
x0<- x1 <- x2 <- x3 <- c("\033[42mABC", "\033[34mDEF")
substr_ctl(x1, 2, 2) <- "_"
substr_ctl(x2, 2, 2) <- "\033[m_"
substr_ctl(x3, 2, 2) <- "\033[45m_"
writeLines(c(x0, end, x1, end, x2, end, x3, end))
## With `carry = TRUE` strings look like original
x0<- x1 <- x2 <- x3 <- c("\033[42mABC", "\033[34mDEF")
substr_ctl(x0, 2, 2, carry=TRUE) <- "_"
substr_ctl(x1, 2, 2, carry=TRUE) <- "\033[m_"
substr_ctl(x2, 2, 2, carry=TRUE) <- "\033[45m_"
writeLines(c(x0, end, x1, end, x2, end, x3, end))
## Work-around to specify carry strings in replacement mode
x <- c("ABC", "DEF")
val <- "#"
x2 <- c("\033[42m", x)
val2 <- c("\033[45m", rep_len(val, length(x)))
substr_ctl(x2, 2, 2, carry=TRUE) <- val2
(x <- x[-1])