html_df {htmldf}R Documentation

Get a tabular summary of webpage content from a vector of urls

Description

From a vector of urls, html_df() will attempt to fetch the html. From the html, html_df() will attempt to look for a page title, rss feeds, images, embedded social media profile handles and other page metadata. Page language is inferred using the package cld3 which wraps Google's Compact Language Detector 3.

Usage

html_df(
  urlx,
  max_size = 5e+06,
  wait = 0,
  retry_times = 0,
  time_out = 30,
  show_progress = TRUE,
  keep_source = TRUE,
  chrome_bin = NULL,
  chrome_args = NULL,
  ...
)

Arguments

urlx

A character vector containing urls. Local files must be prepended with file://.

max_size

Maximum size in bytes of pages to attempt to parse, defaults to 5000000. This is to avoid reading very large pages that may cause read_html() to hang.

wait

Time in seconds to wait between successive requests. Defaults to 0.

retry_times

Number of times to retry a URL after failure.

time_out

Time in seconds to wait for httr::GET() to complete before exiting. Defaults to 30.

show_progress

Logical, defaults to TRUE. Whether to show progress during download.

keep_source

Logical argument - whether or not to retain the contents of the page source column in the output tibble. Useful to reduce memory usage when scraping many pages. Defaults to TRUE.

chrome_bin

(Optional) Path to a Chromium install to use Chrome in headless mode for scraping

chrome_args

(Optional) Vector of additional command-line arguments to pass to chrome

...

Additional arguments to 'httr::GET()'.

Value

A tibble with columns

Author(s)

Alastair Rushworth

Examples

# Examples require an internet connection...
urlx <- c("https://github.com/alastairrushworth/htmldf", 
          "https://alastairrushworth.github.io/")
dl   <- html_df(urlx)
# preview the dataframe
head(dl)
# social tags
dl$social
# page titles
dl$title
# page language
dl$lang
# rss feeds
dl$rss
# inferred code language
dl$code_lang
# print the page source
dl$source



[Package htmldf version 0.6.0 Index]