R: Removes XML/HTML Tags and Umlauts

removeXML {tosca}

R Documentation

Removes XML/HTML Tags and Umlauts

Description

Removes XML tags (removeXML), remove or resolve HTML tags (removeHTML) and changes german umlauts in a standardized form (removeUmlauts).

Usage

removeXML(x)

removeUmlauts(x)

removeHTML(
  x,
  dec = TRUE,
  hex = TRUE,
  entity = TRUE,
  symbolList = c(1:4, 9, 13, 15, 16),
  delete = TRUE,
  symbols = FALSE
)

Arguments

`x`	Character: Vector or list of character vectors.
`dec`	Logical: If `TRUE` HTML-entities in decimal-style would be resolved.
`hex`	Logical: If `TRUE` HTML-entities in hexadecimal-style would be resolved.
`entity`	Logical: If `TRUE` HTML-entities in text-style would be resolved.
`symbolList`	numeric vector to chhose from the 16 ISO-8859 Lists (ISO-8859 12 did not exists and is empty).
`delete`	Logical: If `TRUE` all not resolved HTML-entities would bei deleted?
`symbols`	Logical: If `TRUE` most symbols from ISO-8859 would be not resolved (DEC: 32:64, 91:96, 123:126, 160:191, 215, 247, 818, 8194:8222, 8254, 8291, 8364, 8417, 8470).

Details

The decision which u.type is used should consider the language of the corpus, because in some languages the replacement of umlauts can change the meaning of a word. To change which columns are used by removeXML use argument xmlAction in readTextmeta.

Value

Adjusted character string or list, depending on input.

Examples

xml <- "<text>Some <b>important</b> text</text>"
removeXML(xml)

x <- "&#x00f8; &#248; &oslash;"
removeHTML(x=x, symbolList = 1, dec=TRUE, hex=FALSE, entity=FALSE, delete = FALSE)
removeHTML(x=x, symbolList = c(1,3))

y <- c("Bl\UFChende Apfelb\UE4ume")
removeUmlauts(y)

[Package tosca version 0.3-2 Index]