lnt_read {LexisNexisTools} | R Documentation |
Read in a LexisNexis file
Description
Read a file from LexisNexis in a supported format and convert it to an object of class LNToutput. Supported formats are TXT, DOC, RTF and PDF files.
Usage
lnt_read(
x,
encoding = "UTF-8",
extract_paragraphs = TRUE,
convert_date = TRUE,
start_keyword = "auto",
end_keyword = "auto",
length_keyword = "auto",
author_keyword = "auto",
exclude_lines = "^LOAD-DATE: |^UPDATE: |^GRAFIK: |^GRAPHIC: |^DATELINE: ",
recursive = FALSE,
file_type = c("txt", "rtf", "doc", "pdf", "docx", "zip"),
remove_cover = TRUE,
remove_classification = TRUE,
verbose = TRUE,
...
)
Arguments
x |
Name(s) of file(s) or one or multiple directories containing files from LexisNexis to be converted. |
encoding |
Encoding to be assumed for input files. Defaults to UTF-8 (the LexisNexis standard value). |
extract_paragraphs |
A logical flag indicating if the returned object will include a third data frame with paragraphs. |
convert_date |
A logical flag indicating if it should be tried to convert the date of each article into Date format. For non-standard dates provided by LexisNexis it might be safer to convert dates afterwards (see lnt_asDate). |
start_keyword |
Is used to indicate the beginning of an article. All articles should have the same number of Beginnings, ends and lengths (which indicate the last line of metadata). Use regex expression such as "\d+ of \d+ DOCUMENTS$" (which would catch e.g., the format "2 of 100 DOCUMENTS") or "auto" to try all common keywords. Keyword search is case sensitive. |
end_keyword |
Is used to indicate the end of an article. Works the same way as start_keyword. A common regex would be "^LANGUAGE: " which catches language in all caps at the beginning of the line (usually the last line of an article). |
length_keyword |
Is used to indicate the end of the metadata. Works the same way as start_keyword and end_keyword. A common regex would be "^LENGTH: " which catches length in all caps at the beginning of the line (usually the last line of the metadata). |
author_keyword |
A keyword to identify the author(s) in the metadata. |
exclude_lines |
Lines in which these keywords are found are excluded.
Set to |
recursive |
A logical flag indicating whether subdirectories are searched for more files. |
file_type |
File types/extensions to be included in search for files. |
remove_cover |
Logical. Should the cover page be removed. |
remove_classification |
Logical. Should the classification provided by LexisNexis be removed? |
verbose |
A logical flag indicating whether information should be printed to the screen. |
... |
Additional arguments passed on to lnt_asDate. |
Details
The function can produce an LNToutput S4 object with two or three data.frame: meta, containing all meta information such as date, author and headline and articles, containing just the article ID and the text of the articles. When extract_paragraphs is set to TRUE, the output contains a third data.frame, similar to articles but with articles split into paragraphs.
When left to 'auto', the keywords will use the following defaults, which should be the standard keywords in all languages used by 'LexisNexis':
* start_keyword = "\d+ of \d+ DOCUMENTS$| Dokument \d+ von \d+$|
Document \d+ de \d+$"
.
* end_keyword = "^LANGUAGE: |^SPRACHE: |^LANGUE: "
.
Value
An LNToutput S4 object consisting of 3 data.frames for metadata, articles and paragraphs.
Author(s)
Johannes B. Gruber
Examples
LNToutput <- lnt_read(lnt_sample(copy = FALSE))
meta.df <- LNToutput@meta
articles.df <- LNToutput@articles
paragraphs.df <- LNToutput@paragraphs