scrape_urls {archiveRetriever} | R Documentation |
scrape_urls: Scraping Urls from the Internet Archive
Description
scrape_urls
scrapes Urls of mementos and lower-level web pages stored in the Internet Archive using XPaths as default
Usage
scrape_urls(
Urls,
Paths,
collapse = TRUE,
startnum = 1,
attachto = NULL,
CSS = FALSE,
archiveDate = FALSE,
ignoreErrors = FALSE,
stopatempty = TRUE,
emptylim = 10,
encoding = "UTF-8",
lengthwarning = TRUE
)
Arguments
Urls |
A character vector of the memento of the Internet Archive |
Paths |
A named character vector of the content to be scraped from the memento. Takes XPath expressions as default. |
collapse |
Collapse matching html nodes |
startnum |
Specify the starting number for scraping the Urls. Important when scraping breaks during process. |
attachto |
Scraper attaches new content to existing object in working memory. Object should stem from same scraping process. |
CSS |
Use CSS selectors as input for the Paths |
archiveDate |
Retrieve the archiving date |
ignoreErrors |
Ignore errors for some Urls and proceed scraping |
stopatempty |
Stop if scraping does not succeed |
emptylim |
Specify the number of Urls not being scraped until break-off |
encoding |
Specify a default encoding for the homepage. Default is 'UTF-8' |
lengthwarning |
Warning function for large number of URLs appears. Set FALSE to disable default warning. |
Value
This function scrapes the content of mementos or lower-level web pages from the Internet Archive. It returns a tibble including Urls and the scraped content. However, a memento being stored in the Internet Archive does not guarantee that the information from the homepage can be actually scraped. As the Internet Archive is an internet resource, it is always possible that a request fails due to connectivity problems. One easy and obvious solution is to re-try the function.
Examples
## Not run:
scrape_urls(
Urls = "https://web.archive.org/web/20201001000859/https://www.nytimes.com/section/politics",
Paths = c(title = "//article/div/h2//text()", teaser = "//article/div/p/text()"),
collapse = FALSE, archiveDate = TRUE)
## End(Not run)