R: Details About Manipulation of Strings Containing Control...

fansi {fansi}

R Documentation

Details About Manipulation of Strings Containing Control Sequences

Description

Counterparts to R string manipulation functions that account for the effects of some ANSI X3.64 (a.k.a. ECMA-48, ISO-6429) control sequences.

Control Characters and Sequences

Control characters and sequences are non-printing inline characters or sequences initiated by them that can be used to modify terminal display and behavior, for example by changing text color or cursor position.

We will refer to X3.64/ECMA-48/ISO-6429 control characters and sequences as "Control Sequences" hereafter.

There are four types of Control Sequences that fansi can treat specially:

"C0" control characters, such as tabs and carriage returns (we include delete in this set, even though technically it is not part of it).
Sequences starting in "ESC[", also known as Control Sequence Introducer (CSI) sequences, of which the Select Graphic Rendition (SGR) sequences used to format terminal output are a subset.
Sequences starting in "ESC]", also known as Operating System Commands (OSC), of which the subset beginning with "8" is used to encode URI based hyperlinks.
Sequences starting in "ESC" and followed by something other than "[" or "]".

Control Sequences starting with ESC are assumed to be two characters long (including the ESC) unless they are of the CSI or OSC variety, in which case their length is computed as per the ECMA-48 specification, with the exception that OSC hyperlinks may be terminated with BEL ("\a") in addition to ST ("ESC\"). fansi handles most common Control Sequences in its parsing algorithms, but it is not a conforming implementation of ECMA-48. For example, there are non-CSI/OSC escape sequences that may be longer than two characters, but fansi will (incorrectly) treat them as if they were two characters long. There are many more unimplemented ECMA-48 specifications.

In theory it is possible to encode CSI sequences with a single byte introducing character in the 0x40-0x5F range instead of the traditional "ESC[". Since this is rare and it conflicts with UTF-8 encoding, fansi does not support it.

Within Control Sequences, fansi further distinguishes CSI SGR and OSC hyperlinks by recording format specification and URIs into string state, and applying the same to any output strings according to the semantics of the functions in use. CSI SGR and OSC hyperlinks are known together as Special Sequences. See the following sections for details.

Additionally, all Control Sequences, whether special or not, do not count as characters, graphemes, or display width. You can cause fansi to treat particular Control Sequences as regular characters with the ctl parameter.

CSI SGR Control Sequences

NOTE: not all displays support CSI SGR sequences; run term_cap_test to see whether your display supports them.

CSI SGR Control Sequences are the subset of CSI sequences that can be used to change text appearance (e.g. color). These sequences begin with "ESC[" and end in "m". fansi interprets these sequences and writes new ones to the output strings in such a way that the original formatting is preserved. In most cases this should be transparent to the user.

Occasionally there may be mismatches between how fansi and a display interpret the CSI SGR sequences, which may produce display artifacts. The most likely source of artifacts are Control Sequences that move the cursor or change the display, or that fansi otherwise fails to interpret, such as:

Unknown SGR substrings.
"C0" control characters like tabs and carriage returns.
Other escape sequences.

Another possible source of problems is that different displays parse and interpret control sequences differently. The common CSI SGR sequences that you are likely to encounter in formatted text tend to be treated consistently, but less common ones are not. fansi tries to hew by the ECMA-48 specification for CSI SGR control sequences, but not all terminals do.

The most likely source of problems will be 24-bit CSI SGR sequences. For example, a 24-bit color sequence such as "ESC[38;2;31;42;4" is a single foreground color to a terminal that supports it, or separate foreground, background, faint, and underline specifications for one that does not. fansi will always interpret the sequences according to ECMA-48, but it will warn you if encountered sequences exceed those specified by the term.cap parameter or the "fansi.term.cap" global option.

fansi will will also warn if it encounters Control Sequences that it cannot interpret. You can turn off warnings via the warn parameter, which can be set globally via the "fansi.warn" option. You can work around "C0" tabs characters by turning them into spaces first with tabs_as_spaces or with the tabs.as.spaces parameter available in some of the fansi functions

fansi interprets CSI SGR sequences in cumulative "Graphic Rendition Combination Mode". This means new SGR sequences add to rather than replace previous ones, although in some cases the effect is the same as replacement (e.g. if you have a color active and pick another one).

OSC Hyperlinks

Operating System Commands are interpreted by terminal emulators typically to engage actions external to the display of text proper, such as setting a window title or changing the active color palette.

Some terminals have added support for associating URIs to text with OSCs in a similar way to anchors in HTML, so fansi interprets them and outputs or terminates them as needed. For example:

"\033]8;;xy.z\033\\LINK\033]8;;\033\\"

Might be interpreted as link to the URI "x.z". To make the encoding pattern clearer, we replace "\033]" with "<OSC>" and "\033\\" with "<ST>" below:

<OSC>8;;URI<ST>LINK TEXT<OSC>8;;<ST>

State Interactions

The cumulative nature of state as specified by SGR or OSC hyperlinks means that unterminated strings that are spliced will interact with each other. By extension, a substring does not inherently contain all the information required to recreate its state as it appeared in the source document. The default fansi configuration terminates extracted substrings and prepends original state to them so they present on a stand-alone basis as they did as part of the original string.

To allow state in substrings to affect subsequent strings set terminate = FALSE, but you will need to manually terminate them or deal with the consequences of not doing so (see "Terminal Quirks").

By default, fansi assumes that each element in an input character vector is independent, but this is incorrect if the input is a single document with each element a line in it. In that situation state from each line should bleed into subsequent ones. Setting carry = TRUE enables the "single document" interpretation.

To most closely approximate what writeLines(x) produces on your terminal, where x is a stateful string, use writeLines(fansi_fun(x, carry=TRUE, terminate=FALSE)). fansi_fun is a stand-in for any of the fansi string manipulation functions. Note that even with a seeming "null-op" such as substr_ctl(x, 1, nchar_ctl(x), carry=TRUE, terminate=FALSE) the output control sequences may not match the input ones, but the output should look the same if displayed to the terminal.

fansi strings will be affected by any active state in strings they are appended to. There are no parameters to control what happens in this case, but fansi provides functions that can help the user get the desired behavior. state_at_end computes the active state the end of a string, which can then be prepended onto the input of fansi functions so that they are aware of the active style at the beginning of the string. Alternatively, one could use close_state(state_at_end(...)) and pre-pend that to the output of fansi functions so they are unaffected by preceding SGR. One could also just prepend "ESC[0m", but in some cases as described in ?normalize_state that is sub-optimal.

If you intend to combine stateful fansi manipulated strings with your own, it may be best to set normalize = TRUE for improved compatibility (see ?normalize_state.)

Terminal Quirks

Some terminals (e.g. OS X terminal, ITerm2) will pre-paint the entirety of a new line with the currently active background before writing the contents of the line. If there is a non-default active background color, any unwritten columns in the new line will keep the prior background color even if the new line changes the background color. To avoid this be sure to use terminate = TRUE or to manually terminate each line with e.g. "ESC[0m". The problem manifests as:

" " = default background
"#" = new background
">" = start new background
"!" = restore default background

+-----------+
| abc\n     |
|>###\n     |
|!abc\n#####| <- trailing "#" after newline are from pre-paint
| abc       |
+-----------+

The simplest way to avoid this problem is to split input strings by any newlines they contain, and use terminate = TRUE (the default). A more complex solution is to pad with spaces to the terminal window width before emitting the newline to ensure the pre-paint is overpainted with the current line's prevailing background color.

Encodings / UTF-8

fansi will convert any non-ASCII strings to UTF-8 before processing them, and fansi functions that return strings will return them encoded in UTF-8. In some cases this will be different to what base R does. For example, substr re-encodes substrings to their original encoding.

Interpretation of UTF-8 strings is intended to be consistent with base R. There are three ways things may not work out exactly as desired:

fansi, despite its best intentions, handles a UTF-8 sequence differently to the way R does.
R incorrectly handles a UTF-8 sequence.
Your display incorrectly handles a UTF-8 sequence.

These issues are most likely to occur with invalid UTF-8 sequences, combining character sequences, and emoji. For example, whether special characters such as emoji are considered one or two wide evolves as software implements newer versions the Unicode databases.

Internally, fansi computes the width of most UTF-8 character sequences outside of the ASCII range using the native R_nchar function. This will cause such characters to be processed slower than ASCII characters. Unlike R (at least as of version 4.1), fansi can account for graphemes.

Because fansi implements its own internal UTF-8 parsing it is possible that you will see results different from those that R produces even on strings without Control Sequences.

Overflow

The maximum length of input character vector elements allowed by fansi is the 32 bit INT_MAX, excluding the terminating NULL. As of R4.1 this is the limit for R character vector elements generally, but is enforced at the C level by fansi nonetheless.

It is possible that during processing strings that are shorter than INT_MAX would become longer than that. fansi checks for that overflow and will stop with an error if that happens. A work-around for this situation is to break up large strings into smaller ones. The limit is on each element of a character vector, not on the vector as a whole. fansi will also error on your system if R_len_t, the R type used to measure string lengths, is less than the processed length of the string.

R < 3.2.2 support

Nominally you can build and run this package in R versions between 3.1.0 and 3.2.1. Things should mostly work, but please be aware we do not run the test suite under versions of R less than 3.2.2. One key degraded capability is width computation of wide-display characters. Under R < 3.2.2 fansi will assume every character is 1 display width. Additionally, fansi may not always report malformed UTF-8 sequences as it usually does. One exception to this is nchar_ctl as that is just a thin wrapper around base::nchar.

[Package fansi version 1.0.6 Index]