delete.markup {stylo} | R Documentation |
Delete HTML or XML tags
Description
Function for removing markup tags (e.g. HTML, XML) from a string of characters. All XML markup is assumed to be compliant with the TEI guidelines (https://tei-c.org/).
Usage
delete.markup(input.text, markup.type = "plain")
Arguments
input.text |
any string of characters (e.g. vector) containing markup tags that have to be deleted. |
markup.type |
any of the following values: |
Details
This function needs to be used carefully: while a document formatted in compliance with the TEI guidelines will be parsed flawlessly, the cleaning up of an HTML page harvested randomly on the web might cause some side effects, e.g. the footers, disclaimers, etc. will not be removed.
Author(s)
Maciej Eder, Mike Kestemont
See Also
load.corpus
, txt.to.words
,
txt.to.words.ext
, txt.to.features
Examples
delete.markup("Gallia est omnis <i>divisa</i> in partes tres",
markup.type = "html")
delete.markup("Gallia<note>Gallia: Gaul.</note> est omnis
<emph>divisa</emph> in partes tres", markup.type = "xml")
delete.markup("<speaker>Hamlet</speaker>Words, words, words...",
markup.type = "xml.drama")