R: Decode and Encode HTML Entities

HTMLencode {textutils}

R Documentation

Decode and Encode HTML Entities

Description

Decode and encode HTML entities.

Usage

HTMLdecode(x, named = TRUE, hex = TRUE, decimal = TRUE)
HTMLencode(x, use.iconv = FALSE, encode.only = NULL)
HTMLrm(x, ...)

Arguments

`x`	`HTMLdecode`, `HTMLencode`: a character vector of length one; for `HTMLrm`: a character vector
`use.iconv`	logical. Should conversion via `iconv` be tried from native encoding to UTF-8?
`named`	logical: replace named character references?
`hex`	logical: replace hexadecimal character references?
`decimal`	logical: replace decimal character references?
`encode.only`	character
`...`	other arguments

Details

HTMLdecode replaces named, hexadecimal and decimal character references as defined by HTML5 (see References) with characters. The resulting character vector is marked as UTF-8 (see Encoding).

HTMLencode replaces UTF-8-encoded substrings with HTML5 named entities (a.k.a. “named character references”). A semicolon ‘;’ will not be replaced by the entity ‘&semi;’. Other than that, however, HTMLencode is quite thorough in its job: it will replace all characters for which named entities exists, even ‘,’ and or ‘&quest;’. You can restrict the characters to be replaced by specifying encode.only.

HTMLrm removes HTML tags. All content between style and head tags is removed, as are comments. Note that each element of x is considered a single HTML document; so for multiline documents, paste/collapse the document.

Value

character

Author(s)

Enrico Schumann

References

https://www.w3.org/TR/html5/syntax.html#named-character-references

https://html.spec.whatwg.org/multipage/syntax.html#character-references

Examples

HTMLdecode(c("Max &amp; Moritz", "4 &lt; 9"))
## [1] "Max & Moritz" "4 < 9"

HTMLencode(c("Max & Moritz", "4 < 9"))
## [1] "Max &amp; Moritz" "4 &LT; 9"

HTMLencode("Max, Moritz & more")
## [1] "Max&comma; Moritz &amp; more"
HTMLencode("Max, Moritz & more", encode.only = c("&", "<", ">"))
## [1] "Max, Moritz &amp; more"


HTMLrm("before <a href='http://enricoschumann.net'>LINK</a>  after")
## [1] "before http://enricoschumann.net  after"

[Package textutils version 0.4-1 Index]