EuropresseSource {tm.plugin.europresse} | R Documentation |
Europresse Source
Description
Construct a source for an input containing a set of articles exported from Europresse in the HTML format.
Usage
EuropresseSource(x, encoding = "UTF-8")
Arguments
x |
Either a character identifying the file or a connection. |
encoding |
A character giving the encoding of |
Details
This function imports the body of the articles, but also sets several meta-data variables on individual documents:
-
datetimestamp
: The publication date. -
heading
: The title of the article. -
origin
: The newspaper the article comes from. -
section
: If available, the part of the newspaper containing the article. -
pages
: If available, the pages where the article appeared.
Please note that it commonly happens that the encoding specified in Europresse HTML files does not correspond to the one actually used in the text: in that case, you will need to find out the correct encoding and specify it manually.
Value
An object of class EuropresseSource
which extends the class
Source
representing set of articles from Europresse.
Author(s)
Milan Bouchet-Valat
See Also
readEuropresseHTML2
for the function actually parsing
individual articles.
getSources
to list available sources.
Examples
library(tm)
file <- system.file("texts", "europresse_test2.html",
package = "tm.plugin.europresse")
corpus <- Corpus(EuropresseSource(file))
# See the contents of the documents
inspect(corpus)
# See meta-data associated with first article
meta(corpus[[1]])