-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎁 Adjust IIIF Print Page Splitting Process to Utilize the Derivative::Rodeo #220
Comments
Why as a development dependency? Because the DerivativeRodeo introduces a dependency on Faraday >= 1. And the Valkyrie and ActiveFedora versions which Hyrax 2 and 3 depend on have a Faraday dependency of < 1. I am pushing this up so that I can begin development on the ingest aspect of the Derivative Rodeo. Also to see how this resolves in our CI setup and to see the impact, if any on downstream implementations of IIIF Print (e.g. Adventist, British Library, ATLA, PALNI/PALCI, UTK, and others). The plan is to determine if we want to have this Faraday conflict setup or if we want to swap out something else in the underlying DerivativeRodeo. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - #219 - #220
Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>
Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220
This commit is a refactor in-place. The primary goal is to allow for passing the file_set to the child works; something that is ideal for the derivative rodeo's interface. This is intended to be a swap-in-place change. That is to say, if we deploy this change and have already enqueued jobs, nothing will fail nor break. The past enqueued jobs (with a work) will use the work based logic but future enqueueings will use file_set. In using the file_set, we also avoid the issue of having passing a `nil` as the parent, and thus creating an infinite rescheduling cycle. Related to: - #220
Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>
Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220
Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220
Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220
* 🎁 Adding PDF Split Page Checks Prior to this commit, if we'd already pre-processed a PDF split, we would again re-process that split (as there was no check for existing pages). With this commit, we check for those pre-processed pages. One critical bit of conversation, is that one work might have multiple PDFs uploaded. Therefore, it is important to have those PDFs pages write to different "sub-directories". I'm putting this hear so we can account for that in a test audit of some kind. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]> * ☑️ Verifying pdf splitter finds pre-existing files Updating a bit of documentation and reworking the filename to account for a work having multiple PDFs. - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 * ☑️ Refining globbed_tail_locations for S3 Prior to this commit, we didn't have a spec for the S3 behavior. We now have a test for an S3 Faux Bucket. Related to: - https://github.com/scientist-softserv/adventist-dl/issues/330 - scientist-softserv/iiif_print#220 --------- Co-authored-by: Rob Kaufman <[email protected]> Co-authored-by: Kirk Wang <[email protected]>
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. ref #220
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. ref #220
I have written two sets of Gherkin-style scenarios, one for a PDF and one for a TIFF. A challenge we have is that we’re using the same SpaceStone handlers for the images of each of the scenarios. That is the extracted image pages of the PDF and the original TIFF. This is complicated because the output files/directories is different between a PDF and a TIFF. In the case of the images for the PDF, we need to know the parent work ID, the file name, and the page number to correctly associate the generated image with it’s plain text, Alto XML, and word coordinates JSON. In the case of the original TIFF we are only working from the parent work ID and the file name. At present the SpaceStone handlers and IIIF Print’s calling of the generators are responsible for correctly choosing the right location; this is done via the output and pre-processing template provided to the generators. A fundamental challenge is that the DerivativeRodeo is templated location agnostic; it provides one set of functions in SpaceStone has resolved how it’s handling the different location templates for storing the plain text, Alto XML, and word coordinates derivatives. Next is to resolve how IIIF Print handles this. What we will need to know is when the given FileSet is for a page of a PDF or not; and when it is from a PDF what is it’s page number. By convention we’ll have that page number based on how SpaceStone is writing that. That page number will be encoded in the location file name. We will likely want to consider the SpaceStone filename storage. PDF ScenariosGiven a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate thumbnail of that PDF into S3
Then it will be storted at s3://host-bucket/1234/abcd/abcd.pdf.jpeg Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we split the PDF into one JPEG image per page and store in S3
Then the images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.jpeg Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate a thumbnail of each of the page’s images and store in S3
Then the thumbnail images will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.thumbnail.jpeg Given a 2 page PDF with parent id of "1234" and filename of "abcd.pdf"
When we generate an ALTO XML of each of the page’s images and store in S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/pages/abcd-<page-number>.alto.xml Image ScenariosGiven a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate thumbnail of that TIFF into S3
Then it will be storted at s3://host-bucket/1234/efgh/efgh.thumbnail.jpeg Given a TIFF with parent id of "1234" and filename of "efgh.tiff"
When we generate an ALTO XML of that TIFF into S3
Then the ALTO XML will be stored in s3://host-bucket/1234/abcd/efgh.alto.xml |
Proposal: In the DerivativeRodeo, we should be setting the output template tale for PDF pages to |
This commit adds the logic to take an unploaded PDF and then split that PDF into constituent images. It does not yet account for how we handle the derivatives we generate from the images split off from the PDF. Related to: - #220 Co-authored-by: LaRita Robinson <[email protected]> Co-authored-by: Shana Moore <[email protected]>
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. It handles, in theory, PDF splitting and the derivative's generated in the DerivativeRodeo. Related to: - #220 Co-authored-by: LaRita Robinson <[email protected]> Co-authored-by: Shana Moore <[email protected]>
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. It handles, in theory, PDF splitting and the derivative's generated in the DerivativeRodeo. Related to: - #220 Co-authored-by: LaRita Robinson <[email protected]> Co-authored-by: Shana Moore <[email protected]>
Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. It handles, in theory, PDF splitting and the derivative's generated in the DerivativeRodeo. Related to: - #220 Co-authored-by: LaRita Robinson <[email protected]> Co-authored-by: Shana Moore <[email protected]>
* 🎁 Add derivative_rodeo_splitter Add a new PDF splitter option that wraps the DerivateRodeo's PdfSplitGenerator. It handles, in theory, PDF splitting and the derivative's generated in the DerivativeRodeo. Related to: - #220 Co-authored-by: Shana Moore <[email protected]> Co-authored-by: Jeremy Friesen <[email protected]>
This follows on the work of #219 and relates to scientist-softserv/adventist_knapsack#406.
Discussion
When we split a PDF into multiple pages, we likely do not want to fallback to the Hyrax::FileSetDerivativeService. That service is for converting original files. We instead want to utilize the image, extracted text, etc. that the Derivative::Rodeo created.
We also want to consider that we have existing PDF splitting and do not want to yet disrupt that processing. So the strategy is to create a new process that we use to handle split PDFs. We could, in theory, fall-back to the existing IIIF Print split processing if the PDF does not have pages in the rodeo.
An assumption is that, for a given file, the rodeo will have either none or all of the constituent pages. That is to say, we should not expect that IIIF Print would create the image and handle OCR for a single page of the PDF.
By design, we could demand that the rodeo split the PDF and return the constituent pages and their derivatives.
To consider is the fact that we may not need to wait for all of the splitting jobs. Instead we can: create the child work, create a file set, and assign the rodeo files directly. We will likely not want to run the derivatives for the created file set.
2023-05-31 Notes
To leverage the Derivative Rodeo’s PdfSplitGenerator, we need to create a wrapper class in IIIF Print.
The wrapper class should have a
.call
method that has the following signature:That will allow us to replace the inner workings of IiifPrint::Jobs::ChildWorksFromPdfJob#split_pdf (see below)
With the file_set, we can use the IiifPrint::DerivativeRodeoService.derivative_rodeo_input_uri to create the pre_process/input_uri of the PDF, which we then pass to the PDFSplitGenerator. And the output templates will need to also consider how we write the file.
The text was updated successfully, but these errors were encountered: