speech_build {speech} | R Documentation |
Transform speeches in pdf to data.frame
Description
It allows to extract the individual speeches of each legislator in a document and obtain a data.frame.
Usage
speech_build(
file,
add.error.sir = NULL,
rm.error.leg = NULL,
compiler = FALSE,
quality = FALSE,
param = list(char = 6500, drop.page = 2)
)
Arguments
file |
list or character vector specifying the path or URL to a PDF file. It can be one or more files. |
add.error.sir |
character vector. It allows to specify different ways in which
the term that orders the speeches could be miswritten: sir. By default it is |
rm.error.leg |
character vector. It allows to add legislator's names
to be eliminated. By default it is |
compiler |
logical. When the checking of the process of conversion from pdf to data frame
is completed, it is necessary to compile the data frame. To compile implies to unite all the
speeches of each of the legislators for each document. As it is an operation
that must be carried out after making corrections, it is necessary to opt for it.
By default it is |
quality |
logical. If
|
param |
list of length 2 with magnitudes for arguments "character for page" and "drop page non evaluate" respectively. The default values are the median characters of 8500 documents that make up the speech datasets. |
Details
This function converts PDF documents to data.frame. The conversion is
made by seeking interventions of legislators from the word "SENOR". As the
quality of PDF files is not always the best it is recommended to verify that
no legislator is omitted in the data.frame construction process. To make
corrections of the word "SENOR" is that the argument add.error.sir
should be used. The function has a long list of different ways in which
the word "SENOR" may be written in a document, but not all possible future
problems are covered. When the PDF document is a scan that was treated with
an OCR, it should be checked with greater caution to ensure that the operation
was performed correctly.
Value
data.frame class puy
with the following variables:
legislator
: name of the legislatorsspeech
: speeches by legislatorsdate
: session dateid
: namefile
legislature
: legislature id (period of government)sex
: sexchamber
: chamber to which the document belongs. It can be: Chamber of Representatives, Senate, General Assembly or Permanent Commission.
If quality is TRUE, the following are added:
index_1
: index_1index_2
: index_2
Examples
# url <- speech::speech_url(chamber = "C", from = "17-09-2019", to = "17-09-2019")
# out <- speech_build(file = url)
# out <- speech_build(file = url, compiler = FALSE,
# quality = TRUE,
# add.error.sir = c("SEf'IOR"),
# rm.error.leg = c("PRtSIDENTE", "SUB", "PRfSlENTE"),
# param = list(char = 6000, drop.page = 3))
# out <- list.files(pattern = "*.pdf") %>% speech_build()
# out <- list.files(pattern = "*.pdf") %>%
# speech_build(., compiler = TRUE, param = list(char = 4500, drop.page = 3))