dai_async {daiR}R Documentation

OCR documents asynchronously

Description

Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional data.

Usage

dai_async(
  files,
  dest_folder = NULL,
  bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
  proj_id = get_project_id(),
  proc_id = Sys.getenv("DAI_PROCESSOR_ID"),
  proc_v = NA,
  skip_rev = "true",
  loc = "eu",
  token = dai_token()
)

Arguments

files

a vector or list of pdf filepaths in a GCS Storage bucket Filepaths must include all parent bucket folder(s) except the bucket name

dest_folder

the name of the GCS Storage bucket subfolder where you want the json output

bucket

the name of the GCS Storage bucket where the files to be processed are located

proj_id

a GCS project id

proc_id

a Document AI processor id

proc_v

one of 1) a processor version name, 2) "stable" for the latest processor from the stable channel, or 3) "rc" for the latest processor from the release candidate channel.

skip_rev

whether to skip human review; "true" or "false"

loc

a two-letter region code; "eu" or "us"

token

an access token generated by dai_auth() or another auth function

Details

Requires a GCS access token and some configuration of the .Renviron file; see package vignettes for details. Currently, a dai_async() call can contain a maximum of 50 files (but a multi-page pdf counts as one file). You can not have more than 5 batch requests and 10,000 pages undergoing processing at any one time. Maximum pdf document length is 2,000 pages. With long pdf documents, Document AI divides the JSON output into separate files ('shards') of 20 pages each. If you want longer shards, use dai_tab_async(), which accesses another API endpoint that allows for shards of up to 100 pages.

Value

A list of HTTP responses

Examples

## Not run: 
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async("my_document.pdf")

# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async("for_processing/pdfs/my_document.pdf")

# Bulk process by passing a vector of filepaths in the files argument:
dai_async(my_files)

# Specify a bucket subfolder for the json output:
dai_async(my_files, dest_folder = "processed")

## End(Not run)

[Package daiR version 1.0.0 Index]