Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220

Open
Tracked by #421 ...
jeremyf opened this issue Apr 14, 2023 · 2 comments
Open
Tracked by #421 ...

Comments

@jeremyf
Copy link
Contributor

jeremyf commented Apr 14, 2023

This follows on the work of #219 and relates to scientist-softserv/adventist_knapsack#406.

Given a FileSet with an original file of a PDF
And that PDF has been handled by the rodeo
When the IIIF Print gem goes to split the PDF
Then we should use the pre-processed rodeo files instead of running any inline splitting

Discussion

When we split a PDF into multiple pages, we likely do not want to fallback to the Hyrax::FileSetDerivativeService. That service is for converting original files. We instead want to utilize the image, extracted text, etc. that the Derivative::Rodeo created.

We also want to consider that we have existing PDF splitting and do not want to yet disrupt that processing. So the strategy is to create a new process that we use to handle split PDFs. We could, in theory, fall-back to the existing IIIF Print split processing if the PDF does not have pages in the rodeo.

An assumption is that, for a given file, the rodeo will have either none or all of the constituent pages. That is to say, we should not expect that IIIF Print would create the image and handle OCR for a single page of the PDF.

By design, we could demand that the rodeo split the PDF and return the constituent pages and their derivatives.

To consider is the fact that we may not need to wait for all of the splitting jobs. Instead we can: create the child work, create a file set, and assign the rodeo files directly. We will likely not want to run the derivatives for the created file set.

2023-05-31 Notes

To leverage the Derivative Rodeo’s PdfSplitGenerator, we need to create a wrapper class in IIIF Print.

The wrapper class should have a .call method that has the following signature:

def self.call(path, file_set:)
end

That will allow us to replace the inner workings of IiifPrint::Jobs::ChildWorksFromPdfJob#split_pdf (see below)

def split_pdf(original_pdf_path, user, child_model)
  # TODO: This is the place to change out the existing service and instead use the derivative
  # rodeo; we will likely need to look at method signatures to tighten this interface.
  image_files = @parent_work.iiif_print_config.pdf_splitter_service.call(original_pdf_path)

With the file_set, we can use the IiifPrint::DerivativeRodeoService.derivative_rodeo_input_uri to create the pre_process/input_uri of the PDF, which we then pass to the PDFSplitGenerator. And the output templates will need to also consider how we write the file.

##
# This method "hard-codes" some existing assumptions about the input_uri based on
# implementations for Adventist.  Those are reasonable assumptions but time will tell how
# reasonable.
#
# @param file_set [FileSet]
# @return [String]
def self.derivative_rodeo_input_uri(file_set:)
@jeremyf jeremyf changed the title Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo 🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo May 22, 2023
jeremyf added a commit that referenced this issue May 24, 2023
Why as a development dependency?  Because the DerivativeRodeo introduces
a dependency on Faraday >= 1.  And the Valkyrie and ActiveFedora
versions which Hyrax 2 and 3 depend on have a Faraday dependency of < 1.

I am pushing this up so that I can begin development on the ingest
aspect of the Derivative Rodeo.  Also to see how this resolves in our CI
setup and to see the impact, if any on downstream implementations of
IIIF Print (e.g. Adventist, British Library, ATLA, PALNI/PALCI, UTK, and
others).

The plan is to determine if we want to have this Faraday conflict setup
or if we want to swap out something else in the underlying
DerivativeRodeo.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- #219
- #220
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 25, 2023
Prior to this commit, if we'd already pre-processed a PDF split, we
would again re-process that split (as there was no check for existing
pages).

With this commit, we check for those pre-processed pages.

One critical bit of conversation, is that one work might have multiple
PDFs uploaded.  Therefore, it is important to have those PDFs pages
write to different "sub-directories".  I'm putting this hear so we can
account for that in a test audit of some kind.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220

Co-authored-by: Rob Kaufman <[email protected]>
Co-authored-by: Kirk Wang <[email protected]>
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 26, 2023
Updating a bit of documentation and reworking the filename to account
for a work having multiple PDFs.

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220
@jeremyf jeremyf self-assigned this May 30, 2023
jeremyf added a commit that referenced this issue May 30, 2023
This commit is a refactor in-place.  The primary goal is to allow for
passing the file_set to the child works; something that is ideal for the
derivative rodeo's interface.

This is intended to be a swap-in-place change.  That is to say, if we
deploy this change and have already enqueued jobs, nothing will fail nor
break.  The past enqueued jobs (with a work) will use the work based
logic but future enqueueings will use file_set.

In using the file_set, we also avoid the issue of having passing a `nil`
as the parent, and thus creating an infinite rescheduling cycle.

Related to:

- #220
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 30, 2023
Prior to this commit, if we'd already pre-processed a PDF split, we
would again re-process that split (as there was no check for existing
pages).

With this commit, we check for those pre-processed pages.

One critical bit of conversation, is that one work might have multiple
PDFs uploaded.  Therefore, it is important to have those PDFs pages
write to different "sub-directories".  I'm putting this hear so we can
account for that in a test audit of some kind.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220

Co-authored-by: Rob Kaufman <[email protected]>
Co-authored-by: Kirk Wang <[email protected]>
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 30, 2023
Updating a bit of documentation and reworking the filename to account
for a work having multiple PDFs.

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 30, 2023
Prior to this commit, we didn't have a spec for the S3 behavior.  We now
have a test for an S3 Faux Bucket.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 30, 2023
Prior to this commit, we didn't have a spec for the S3 behavior.  We now
have a test for an S3 Faux Bucket.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220
jeremyf added a commit to scientist-softserv/derivative_rodeo that referenced this issue May 30, 2023
* 🎁 Adding PDF Split Page Checks

Prior to this commit, if we'd already pre-processed a PDF split, we
would again re-process that split (as there was no check for existing
pages).

With this commit, we check for those pre-processed pages.

One critical bit of conversation, is that one work might have multiple
PDFs uploaded.  Therefore, it is important to have those PDFs pages
write to different "sub-directories".  I'm putting this hear so we can
account for that in a test audit of some kind.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220

Co-authored-by: Rob Kaufman <[email protected]>
Co-authored-by: Kirk Wang <[email protected]>

* ☑️ Verifying pdf splitter finds pre-existing files

Updating a bit of documentation and reworking the filename to account
for a work having multiple PDFs.

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220

* ☑️ Refining globbed_tail_locations for S3

Prior to this commit, we didn't have a spec for the S3 behavior.  We now
have a test for an S3 Faux Bucket.

Related to:

- https://github.com/scientist-softserv/adventist-dl/issues/330
- scientist-softserv/iiif_print#220

---------

Co-authored-by: Rob Kaufman <[email protected]>
Co-authored-by: Kirk Wang <[email protected]>
laritakr added a commit that referenced this issue Jun 1, 2023
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator.

ref #220
laritakr added a commit that referenced this issue Jun 1, 2023
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator.

ref #220
@jeremyf
Copy link
Contributor Author

jeremyf commented Jun 2, 2023

I have written two sets of Gherkin-style scenarios, one for a PDF and one for a TIFF. A challenge we have is that we’re using the same SpaceStone handlers for the images of each of the scenarios. That is the extracted image pages of the PDF and the original TIFF.

This is complicated because the output files/directories is different between a PDF and a TIFF. In the case of the images for the PDF, we need to know the parent work ID, the file name, and the page number to correctly associate the generated image with it’s plain text, Alto XML, and word coordinates JSON. In the case of the original TIFF we are only working from the parent work ID and the file name.

At present the SpaceStone handlers and IIIF Print’s calling of the generators are responsible for correctly choosing the right location; this is done via the output and pre-processing template provided to the generators.

A fundamental challenge is that the DerivativeRodeo is templated location agnostic; it provides one set of functions in DerivativeRodeo::Services::ConvertUriViaTemplateService to provide downstream implementations with a means of assigning where we’re writing the files.

SpaceStone has resolved how it’s handling the different location templates for storing the plain text, Alto XML, and word coordinates derivatives.

Next is to resolve how IIIF Print handles this. What we will need to know is when the given FileSet is for a page of a PDF or not; and when it is from a PDF what is it’s page number.

By convention we’ll have that page number based on how SpaceStone is writing that. That page number will be encoded in the location file name. We will likely want to consider the SpaceStone filename storage.

PDF Scenarios

Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate thumbnail of that PDF into S3
Then it will be storted at s3://host-bucket/1234/abcd/abcd.pdf.jpeg
Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we split the PDF into one JPEG image per page and store in S3
Then the images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.jpeg
Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate a thumbnail of each of the page’s images and store in S3
Then the thumbnail images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.thumbnail.jpeg
Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate an ALTO XML of each of the page’s images and store in S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.alto.xml

Image Scenarios

Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate thumbnail of that TIFF into S3
Then it will be storted at s3://host-bucket/1234/efgh/efgh.thumbnail.jpeg
Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate an ALTO XML of that TIFF into S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/efgh.alto.xml

@jeremyf
Copy link
Contributor Author

jeremyf commented Jun 2, 2023

Proposal:

In the DerivativeRodeo, we should be setting the output template tale for PDF pages to "#{basename}/pages/#{basename}.page-%d.#{output_extension}". This helps us have a higher confidence that when we just have the filename we can assume it to be a PDF page (and thus help us find all of the other files associated with the page)

https://github.com/scientist-softserv/derivative_rodeo/blob/2ca92617c29febd6be1e5c0a8c98714d4b6f482e/lib/derivative_rodeo/generators/pdf_split_generator.rb#L32-L34

jeremyf added a commit that referenced this issue Jun 2, 2023
This commit adds the logic to take an unploaded PDF and then split that
PDF into constituent images.  It does not yet account for how we handle
the derivatives we generate from the images split off from the PDF.

Related to:

- #220

Co-authored-by: LaRita Robinson <[email protected]>
Co-authored-by: Shana Moore <[email protected]>
jeremyf pushed a commit that referenced this issue Jun 2, 2023
Add a new PDF splitter option that wraps the DerivateRodeo's
PdfSplitGenerator.  It handles, in theory, PDF splitting and the
derivative's generated in the DerivativeRodeo.

Related to:

- #220

Co-authored-by: LaRita Robinson <[email protected]>
Co-authored-by: Shana Moore <[email protected]>
jeremyf pushed a commit that referenced this issue Jun 2, 2023
Add a new PDF splitter option that wraps the DerivateRodeo's
PdfSplitGenerator.  It handles, in theory, PDF splitting and the
derivative's generated in the DerivativeRodeo.

Related to:

- #220

Co-authored-by: LaRita Robinson <[email protected]>
Co-authored-by: Shana Moore <[email protected]>
jeremyf pushed a commit that referenced this issue Jun 2, 2023
Add a new PDF splitter option that wraps the DerivateRodeo's
PdfSplitGenerator.  It handles, in theory, PDF splitting and the
derivative's generated in the DerivativeRodeo.

Related to:

- #220

Co-authored-by: LaRita Robinson <[email protected]>
Co-authored-by: Shana Moore <[email protected]>
jeremyf added a commit that referenced this issue Jun 5, 2023
* 🎁 Add derivative_rodeo_splitter

Add a new PDF splitter option that wraps the DerivateRodeo's
PdfSplitGenerator.  It handles, in theory, PDF splitting and the
derivative's generated in the DerivativeRodeo.

Related to:

- #220

Co-authored-by: Shana Moore <[email protected]>
Co-authored-by: Jeremy Friesen <[email protected]>
@jeremyf jeremyf removed their assignment May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants