removeXML {tosca} | R Documentation |
Removes XML/HTML Tags and Umlauts
Description
Removes XML tags (removeXML), remove or resolve HTML tags (removeHTML) and changes german umlauts in a standardized form (removeUmlauts).
Usage
removeXML(x)
removeUmlauts(x)
removeHTML(
x,
dec = TRUE,
hex = TRUE,
entity = TRUE,
symbolList = c(1:4, 9, 13, 15, 16),
delete = TRUE,
symbols = FALSE
)
Arguments
x |
Character: Vector or list of character vectors. |
dec |
Logical: If |
hex |
Logical: If |
entity |
Logical: If |
symbolList |
numeric vector to chhose from the 16 ISO-8859 Lists (ISO-8859 12 did not exists and is empty). |
delete |
Logical: If |
symbols |
Logical: If |
Details
The decision which u.type is used should consider the language of the corpus, because in some languages the replacement of umlauts can change the meaning of a word.
To change which columns are used by removeXML use argument xmlAction in readTextmeta
.
Value
Adjusted character string or list, depending on input.
Examples
xml <- "<text>Some <b>important</b> text</text>"
removeXML(xml)
x <- "ø ø ø"
removeHTML(x=x, symbolList = 1, dec=TRUE, hex=FALSE, entity=FALSE, delete = FALSE)
removeHTML(x=x, symbolList = c(1,3))
y <- c("Bl\UFChende Apfelb\UE4ume")
removeUmlauts(y)