dai_async {daiR} | R Documentation |
OCR documents asynchronously
Description
Sends files from a Google Cloud Services (GCS) Storage bucket to the GCS Document AI v1 API for asynchronous (offline) processing. The output is delivered to the same bucket as JSON files containing the OCRed text and additional data.
Usage
dai_async(
files,
dest_folder = NULL,
bucket = Sys.getenv("GCS_DEFAULT_BUCKET"),
proj_id = get_project_id(),
proc_id = Sys.getenv("DAI_PROCESSOR_ID"),
proc_v = NA,
skip_rev = "true",
loc = "eu",
token = dai_token()
)
Arguments
files |
a vector or list of pdf filepaths in a GCS Storage bucket Filepaths must include all parent bucket folder(s) except the bucket name |
dest_folder |
the name of the GCS Storage bucket subfolder where you want the json output |
bucket |
the name of the GCS Storage bucket where the files to be processed are located |
proj_id |
a GCS project id |
proc_id |
a Document AI processor id |
proc_v |
one of 1) a processor version name, 2) "stable" for the latest processor from the stable channel, or 3) "rc" for the latest processor from the release candidate channel. |
skip_rev |
whether to skip human review; "true" or "false" |
loc |
a two-letter region code; "eu" or "us" |
token |
an access token generated by |
Details
Requires a GCS access token and some configuration of the
.Renviron file; see package vignettes for details. Currently, a
dai_async()
call can contain a maximum of 50 files (but a
multi-page pdf counts as one file). You can not have more than
5 batch requests and 10,000 pages undergoing processing at any one time.
Maximum pdf document length is 2,000 pages. With long pdf documents,
Document AI divides the JSON output into separate files ('shards') of
20 pages each. If you want longer shards, use dai_tab_async()
,
which accesses another API endpoint that allows for shards of up to
100 pages.
Value
A list of HTTP responses
Examples
## Not run:
# with daiR configured on your system, several parameters are automatically provided,
# and you can pass simple calls, such as:
dai_async("my_document.pdf")
# NB: Include all parent bucket folders (but not the bucket name) in the filepath:
dai_async("for_processing/pdfs/my_document.pdf")
# Bulk process by passing a vector of filepaths in the files argument:
dai_async(my_files)
# Specify a bucket subfolder for the json output:
dai_async(my_files, dest_folder = "processed")
## End(Not run)