tika_fetch {rtika}R Documentation

Fetch Files with the Content-Type Preserved in the File Extension

Description

On the Internet, Content-Type information is mainly communicated via the server's headers. This is an issue if a file is saved to disk without examining the headers. The file can have a missing or incorrect file extension. For example, a URL ending in a slash (/) can produce file with the Content-Type of text/html. The same URL might also produce a image/jpeg or application/pdf file. URLs ending in .php, .cfm can produce any Content-Type. The downloaded file will lose the server's declared Content-Type unless its appended as a file extension. tika_fetch() gets a file from the URL, examines the server headers, and appends the matching file extension from Tika's database.

Usage

tika_fetch(
  urls,
  download_dir = tempdir(),
  ssl_verifypeer = TRUE,
  retries = 1,
  quiet = TRUE
)

Arguments

urls

Character vector of one or more URLs to be downloaded.

download_dir

Character vector of length one describing the path to the directory to save the results.

ssl_verifypeer

Logical, with a default of TRUE. Some server SSL certificates might not be recognized by the host system, and in these rare cases the user can ignore that if they know why.

retries

Integer of the number of times to retry each url after a failure to download.

quiet

Logical if download warnings should be printed. Defaults to FALSE.

Value

Character vector of the same length and order as input with the paths describing the locations of the downloaded files. Errors are returned as NA.

Examples


tika_fetch('https://tika.apache.org/')
# a unique file name with .html appended to it


[Package rtika version 2.7.0 Index]