daiR is an R package for Google Document AI, a
powerful server-based OCR processor with support for over 60 languages.
The package provides an interface for the Document AI API and comes with
additional tools for output file parsing and text reconstruction. See
the daiR
website for more
details.
Quick OCR short documents:
## NOT RUN
library(daiR)
<- dai_sync("file.pdf")
response <- text_from_dai_response(response)
text cat(text)
Batch process asynchronously via Google Storage:
## NOT RUN
library(googleCloudStorageR)
library(purrr)
<- c("file1.pdf", "file2.pdf", "file3.pdf")
my_files map(my_files, gcs_upload)
dai_async(my_files)
<- gcs_list_objects()
contents <- grep("json$", contents$name, value = TRUE)
output_files map(output_files, ~ gcs_get_object(.x, saveToDisk = file.path(tempdir(), .x)))
<- text_from_dai_file(file.path(tempdir(), output_files[1]))
sample_text cat(sample_text)
Turn images of tables into R dataframes:
## NOT RUN:
<- dai_sync_tab("tables.pdf")
response <- tables_from_dai_response(response) dfs
Google Document AI is a paid service
that requires a Google
Cloud account and a Google Storage bucket. I
recommend using Mark Edmondson’s googleCloudStorageR
package in
combination with daiR
.
Install the latest development version from Github:
::install_github("hegghammer/daiR") devtools