scrape_urls {archiveRetriever}R Documentation

scrape_urls: Scraping Urls from the Internet Archive

Description

scrape_urls scrapes Urls of mementos and lower-level web pages stored in the Internet Archive using XPaths as default

Usage

scrape_urls(
  Urls,
  Paths,
  collapse = TRUE,
  startnum = 1,
  attachto = NULL,
  CSS = FALSE,
  archiveDate = FALSE,
  ignoreErrors = FALSE,
  stopatempty = TRUE,
  emptylim = 10,
  encoding = "UTF-8",
  lengthwarning = TRUE,
  nonArchive = FALSE
)

Arguments

Urls

A character vector of the memento of the Internet Archive

Paths

A named character vector of the content to be scraped from the memento. Takes XPath expressions as default.

collapse

Logical value indicating whether to collapse matching html nodes, or character input of xpath by which matches are supposed to be collapsed. Structuring Xpaths can only be used with Xpath selectors as Paths input and CSS = FALSE. If a Xpath is given, the Paths argument only refers to children of the structure given in collapse.

startnum

Specify the starting number for scraping the Urls. Important when scraping breaks during process.

attachto

Scraper attaches new content to existing object in working memory. Object should stem from same scraping process.

CSS

Use CSS selectors as input for the Paths

archiveDate

Retrieve the archiving date

ignoreErrors

Ignore errors for some Urls and proceed scraping

stopatempty

Stop if scraping does not succeed

emptylim

Specify the number of Urls not being scraped until break-off

encoding

Specify a default encoding for the homepage. Default is 'UTF-8'

lengthwarning

Warning function for large number of URLs appears. Set FALSE to disable default warning.

nonArchive

Logical input. Can be set to TRUE if you want to use the archiveRetriever to scrape web pages outside the Internet Archive. Cannot be used in combination with archiveDate.

Value

This function scrapes the content of mementos or lower-level web pages from the Internet Archive. It returns a tibble including Urls and the scraped content. However, a memento being stored in the Internet Archive does not guarantee that the information from the homepage can be actually scraped. As the Internet Archive is an internet resource, it is always possible that a request fails due to connectivity problems. One easy and obvious solution is to re-try the function.

Examples

## Not run: 
scrape_urls(
Urls = "https://web.archive.org/web/20201001000859/https://www.nytimes.com/section/politics",
Paths = c(title = "//article/div/h2//text()", teaser = "//article/div/p/text()"),
collapse = FALSE, archiveDate = TRUE)

scrape_urls(
 Urls = "https://stackoverflow.com/questions/21167159/css-nth-match-doesnt-work",
 Paths = c(ans="//div[@itemprop='text']/*", aut="//div[@itemprop='author']/span[@itemprop='name']"),
 collapse = "//div[@id='answers']/div[contains(@class, 'answer')]",
 nonArchive = TRUE,
 encoding = "bytes")

## End(Not run)

[Package archiveRetriever version 0.4.0 Index]