xscrape {scraEP} | R Documentation |
Extract information from webpages to a data.frame, using XPath or CSS queries.
Description
This function transforms an html/xml page (or list of pages) into a data.frame, extracting nodes specified by their XPath.
Usage
xscrape(pages,
col.xpath = ".", row.xpath = "/html",
col.css = NULL, row.css = NULL,
collapse = " | ", encoding = NULL,
page.name = TRUE, nice.text = TRUE,
parallel = 0,
engine = c("auto", "XML", "xml2"))
Arguments
pages |
an object of class |
col.xpath |
a character vector of XPath queries used for creating the result columns. If the vector is named, these names are given to the columns. The default "." takes the text from the whole of each page or intermediary node (specified by |
row.xpath |
a character string, containing an XPath query for creating the result rows. The result of this query (on each page) becomes a row in the resulting data.frame. If not specified (default), the intermediary nodes are whole html pages, so that each page becomes a row in the result. |
col.css |
same as |
row.css |
same as |
collapse |
a character string, containing the separator that will be used in case a |
encoding |
a character string (eg. "UTF-8" or "ISO-8859-1"), containing the encoding parameter that will be used by |
page.name |
a logical. If TRUE, the result will contain a column indicating the name of the page each row was extracted from. If |
nice.text |
a logical. If TRUE (only possible with engine xml2), the rvest::html_text2 function is used to extract text into the result, often making the text much cleaner. If FALSE, the function runs faster, but the text might be less clean. |
parallel |
a numeric, indicating the number of cores to use for parallel computation. The default 0 takes all available cores. The parallelization is done on the pages if their number is greater than the number of provided cores, otherwise it is done on the intermediary nodes. Note that parallelization relies on parallel::mclapply, and is thus not supported on Windows systems. |
engine |
a character string, indicating the engine to use for data extraction: either "XML", "xml2", or "auto" (default). The default will adapt the engine to the type of |
Details
If a col.xpath
or col.css
query designs a full node, only its text is extracted. If it designs an attribute (eg. ends with '/@href' for weblinks), only the attribute's value is extracted.
If a col.xpath
or col.css
query matches no elements in a page, returned value is NA
. If it matches multiple elements, they are concatenated into a single character string, separated by collapse
.
Value
A data.frame, where each row corresponds to an intermediary node (either a full page or an XML node within a page, specified by row.xpath
or row.css
), and each column corresponds to the text of a col.xpath
or col.css
query.
Author(s)
Julien Boelaert jubo.stats@gmail.com
Examples
## Extract all external links and their titles from a wikipedia page
data(wiki)
wiki.parse <- XML::htmlParse(wiki)
links <- xscrape(wiki.parse,
row.xpath= "//a[starts-with(./@href, 'http')]",
col.xpath= c(title= ".", link= "./@href"),
parallel = 1)
## Not run:
## Convert results from a search for 'R' on duckduckgo.com
## First download the search page
duck <- XML::htmlParse("http://duckduckgo.com/html/?q=R")
## Then run xscrape on the dowloaded and parsed page
results <- xscrape(duck,
row.xpath= "//div[contains(@class, 'result__body')]",
col.xpath= c(title= "./h2",
snippet= ".//*[@class='result__snippet']",
url= ".//a[@class='result__url']/@href"))
## End(Not run)
## Not run:
## Convert results from a search for 'R' and 'Julia' on duckduckgo.com
## Directly provide the URLs to xscrape
results <- xscrape(c("http://duckduckgo.com/html/?q=R",
"http://duckduckgo.com/html/?q=julia"),
row.xpath= "//div[contains(@class, 'result__body')]",
col.xpath= c(title= "./h2",
snippet= ".//*[@class='result__snippet']",
url= ".//a[@class='result__url']/@href"))
## End(Not run)