🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

jeremyf · 2023-04-14T14:38:05Z

This follows on the work of #219 and relates to scientist-softserv/adventist_knapsack#406.

Given a FileSet with an original file of a PDF
And that PDF has been handled by the rodeo
When the IIIF Print gem goes to split the PDF
Then we should use the pre-processed rodeo files instead of running any inline splitting

Discussion

When we split a PDF into multiple pages, we likely do not want to fallback to the Hyrax::FileSetDerivativeService. That service is for converting original files. We instead want to utilize the image, extracted text, etc. that the Derivative::Rodeo created.

We also want to consider that we have existing PDF splitting and do not want to yet disrupt that processing. So the strategy is to create a new process that we use to handle split PDFs. We could, in theory, fall-back to the existing IIIF Print split processing if the PDF does not have pages in the rodeo.

An assumption is that, for a given file, the rodeo will have either none or all of the constituent pages. That is to say, we should not expect that IIIF Print would create the image and handle OCR for a single page of the PDF.

By design, we could demand that the rodeo split the PDF and return the constituent pages and their derivatives.

To consider is the fact that we may not need to wait for all of the splitting jobs. Instead we can: create the child work, create a file set, and assign the rodeo files directly. We will likely not want to run the derivatives for the created file set.

2023-05-31 Notes

To leverage the Derivative Rodeo’s PdfSplitGenerator, we need to create a wrapper class in IIIF Print.

The wrapper class should have a .call method that has the following signature:

def self.call(path, file_set:)
end

That will allow us to replace the inner workings of IiifPrint::Jobs::ChildWorksFromPdfJob#split_pdf (see below)

def split_pdf(original_pdf_path, user, child_model)
  # TODO: This is the place to change out the existing service and instead use the derivative
  # rodeo; we will likely need to look at method signatures to tighten this interface.
  image_files = @parent_work.iiif_print_config.pdf_splitter_service.call(original_pdf_path)

With the file_set, we can use the IiifPrint::DerivativeRodeoService.derivative_rodeo_input_uri to create the pre_process/input_uri of the PDF, which we then pass to the PDFSplitGenerator. And the output templates will need to also consider how we write the file.

##
# This method "hard-codes" some existing assumptions about the input_uri based on
# implementations for Adventist.  Those are reasonable assumptions but time will tell how
# reasonable.
#
# @param file_set [FileSet]
# @return [String]
def self.derivative_rodeo_input_uri(file_set:)

The text was updated successfully, but these errors were encountered:

Why as a development dependency? Because the DerivativeRodeo introduces a dependency on Faraday >= 1. And the Valkyrie and ActiveFedora versions which Hyrax 2 and 3 depend on have a Faraday dependency of < 1. I am pushing this up so that I can begin development on the ingest aspect of the Derivative Rodeo. Also to see how this resolves in our CI setup and to see the impact, if any on downstream implementations of IIIF Print (e.g. Adventist, British Library, ATLA, PALNI/PALCI, UTK, and others). The plan is to determine if we want to have this Faraday conflict setup or if we want to swap out something else in the underlying DerivativeRodeo. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - #219 - #220

Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>

Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220

This commit is a refactor in-place. The primary goal is to allow for passing the file_set to the child works; something that is ideal for the derivative rodeo's interface. This is intended to be a swap-in-place change. That is to say, if we deploy this change and have already enqueued jobs, nothing will fail nor break. The past enqueued jobs (with a work) will use the work based logic but future enqueueings will use file_set. In using the file_set, we also avoid the issue of having passing a `nil` as the parent, and thus creating an infinite rescheduling cycle. Related to: - #220

Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>

Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220

Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220

* 🎁 Adding PDF Split Page Checks Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]> * ☑️ Verifying pdf splitter finds pre-existing files Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 * ☑️ Refining globbed_tail_locations for S3 Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 --------- Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>

Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. ref #220

jeremyf · 2023-06-02T12:28:39Z

I have written two sets of Gherkin-style scenarios, one for a PDF and one for a TIFF. A challenge we have is that we’re using the same SpaceStone handlers for the images of each of the scenarios. That is the extracted image pages of the PDF and the original TIFF.

This is complicated because the output files/directories is different between a PDF and a TIFF. In the case of the images for the PDF, we need to know the parent work ID, the file name, and the page number to correctly associate the generated image with it’s plain text, Alto XML, and word coordinates JSON. In the case of the original TIFF we are only working from the parent work ID and the file name.

At present the SpaceStone handlers and IIIF Print’s calling of the generators are responsible for correctly choosing the right location; this is done via the output and pre-processing template provided to the generators.

A fundamental challenge is that the DerivativeRodeo is templated location agnostic; it provides one set of functions in DerivativeRodeo::Services::ConvertUriViaTemplateService to provide downstream implementations with a means of assigning where we’re writing the files.

SpaceStone has resolved how it’s handling the different location templates for storing the plain text, Alto XML, and word coordinates derivatives.

Next is to resolve how IIIF Print handles this. What we will need to know is when the given FileSet is for a page of a PDF or not; and when it is from a PDF what is it’s page number.

By convention we’ll have that page number based on how SpaceStone is writing that. That page number will be encoded in the location file name. We will likely want to consider the SpaceStone filename storage.

PDF Scenarios

Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate thumbnail of that PDF into S3
Then it will be storted at s3://host-bucket/1234/abcd/abcd.pdf.jpeg

Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we split the PDF into one JPEG image per page and store in S3
Then the images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.jpeg

Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate a thumbnail of each of the page’s images and store in S3
Then the thumbnail images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.thumbnail.jpeg

Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate an ALTO XML of each of the page’s images and store in S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.alto.xml

Image Scenarios

Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate thumbnail of that TIFF into S3
Then it will be storted at s3://host-bucket/1234/efgh/efgh.thumbnail.jpeg

Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate an ALTO XML of that TIFF into S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/efgh.alto.xml

jeremyf · 2023-06-02T12:34:30Z

Proposal:

In the DerivativeRodeo, we should be setting the output template tale for PDF pages to "#{basename}/pages/#{basename}.page-%d.#{output_extension}". This helps us have a higher confidence that when we just have the filename we can assume it to be a PDF page (and thus help us find all of the other files associated with the page)

https://github.com/scientist-softserv/derivative_rodeo/blob/2ca92617c29febd6be1e5c0a8c98714d4b6f482e/lib/derivative_rodeo/generators/pdf_split_generator.rb#L32-L34

This commit adds the logic to take an unploaded PDF and then split that PDF into constituent images. It does not yet account for how we handle the derivatives we generate from the images split off from the PDF. Related to: - #220 Co-authored-by: LaRita Robinson <[email protected]> Co-authored-by: Shana Moore <[email protected]>

Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. It handles, in theory, PDF splitting and the derivative's generated in the DerivativeRodeo. Related to: - #220 Co-authored-by: LaRita Robinson <[email protected]> Co-authored-by: Shana Moore <[email protected]>

* 🎁 Add derivative_rodeo_splitter Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. It handles, in theory, PDF splitting and the derivative's generated in the DerivativeRodeo. Related to: - #220 Co-authored-by: Shana Moore <[email protected]> Co-authored-by: Jeremy Friesen <[email protected]>

This was referenced May 10, 2024

Spike: Identify What Must Change to Leverage Pre-Processing of the Rodeo scientist-softserv/adventist_knapsack#406

Closed

EPIC: Import Optimization with Out of Band Processing scientist-softserv/adventist_knapsack#421

Open

jillpe added the Derivative Rodeo label Apr 14, 2023

jeremyf changed the title ~~Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo~~ 🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo May 22, 2023

jeremyf mentioned this issue May 24, 2023

⚙️ Adding derivative_rodeo as dev dependency #243

Closed

jeremyf mentioned this issue May 25, 2023

🎁 Adding PDF Split Page Checks scientist-softserv/derivative_rodeo#36

Merged

jeremyf self-assigned this May 30, 2023

jeremyf mentioned this issue May 30, 2023

♻️ Preparing Splitter Service Swap for DerivativeRodeo #249

Merged

laritakr added a commit that referenced this issue Jun 1, 2023

🎁 Add derivative_rodeo_splitter

5e6cc41

Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. ref #220

laritakr mentioned this issue Jun 1, 2023

🎁 Add derivative_rodeo_splitter #250

Merged

laritakr added a commit that referenced this issue Jun 1, 2023

🎁 Add derivative_rodeo_splitter

b969541

Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. ref #220

jeremyf removed their assignment May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

jeremyf commented Apr 14, 2023 •

edited

Loading

jeremyf commented Jun 2, 2023

jeremyf commented Jun 2, 2023

🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

Comments

jeremyf commented Apr 14, 2023 • edited Loading

Discussion

2023-05-31 Notes

jeremyf commented Jun 2, 2023

PDF Scenarios

Image Scenarios

jeremyf commented Jun 2, 2023

jeremyf commented Apr 14, 2023 •

edited

Loading