s3 assets streaming and conversion? #367

kimyu92 · 2023-05-12T02:47:03Z

kimyu92
May 12, 2023

Would it be possible to tweak the following example to stream pdf page by page without loading the whole pdf to memory?

# Existing approach
# download from cloud storage and load the whole pdf to memory
# and then perform the conversion 
file_name = "a.pdf"

pdf = Vips::Image.pdfload(file_name, access: :sequential)
n_pages = pdf.get('n-pages')

(0...n_pages).each do |page_index|
  pdf = Vips::Image.pdfload(file_name, access: :sequential, page: page_index)

  pdf.write_to_file("page_#{page_index}.png", Q:100)
end

Also, it seems weird that we have to initialize the object again to get to a particular page. Shouldn't there be an api like pdf.get_page(index) which equivalent to Vips::Image.pdfload(file_name, access: :sequential, page: page_index)?

jcupitt · 2023-05-12T10:32:23Z

jcupitt
May 12, 2023
Maintainer

Hi @kimyu92,

Unfortunately PDFs put a lot of document information at the end of the file, so you usually need to scan the whole thing before starting. As soon as you call poppler_document_new_from_stream(), the first thing it does is read the whole file.

Perhaps pdfium is less greedy? I've not tested it for this.

libvips will render a page at a time, so the actual rendering process shouldn't need that much memory.

Also, it seems weird that we have to initialize the object again to get to a particular page.

This is a consequence of the way that libvips handles multipage images -- it represents them as a single very tall, thin image, with the pages joined together vertically (a "toilet-roll" image, sorry). If your PDF has pages that are all the same size (for example, it has no pages in landscape), then you can load the whole PDF in one go and loop over pages without reinitialisation.

Sadly many PDFs are not like this, so to work for all PDFs, where each page can be a different size, you need to reinitialise.

With a PDF where all pages are the same size you can do:

$ irb                           
irb(main):001:0> require 'vips'
=> true
irb(main):002:0> x = Vips::Image.new_from_file "nipguide.pdf", n: -1
=> #<Image 595x48836 uchar, 4 bands, srgb>
irb(main):003:0> x.get "page-height"
=> 842
irb(main):004:0>

Then you can use crop to pull out pages and libvips will render them to bitmaps on demand.

pyvips has pagesplit() and pagejoin() convenience methods to turn these tall, thin images into arrays of page images. We should probably add them to ruby-vips as well.

4 replies

kimyu92 May 12, 2023
Author

pagesplit() and pagejoin() are definitely great addition for those consistent tall thin pdf page. However, I think even adding Vips::Image#get_page to abstract the reinitialization would be a great addition.

def get_page(index)
  Vips::Image.pdfload(file_name, access: :sequential, page: page_index)
end

Also, is there a way to get file_size from pdf, I couldn't find one 🤦‍♂️ at least it's not listed in get_fields

jcupitt May 12, 2023
Maintainer

is there a way to get file_size from pdf

You can use any file size API, I don't think ruby-vips needs to duplicate this, does it?

kimyu92 May 12, 2023
Author

Maybe. 😅 Probably it would be convenient #get_size for both image and pdf.

I do think we may also want to consider separate pdf instantiation to its own class. Vips::Pdf.load seems more fluent and instance method like #pages or #meta should make it more intuitive to use from OOP standpoint

jcupitt May 12, 2023
Maintainer

Ah maybe. Though this page stuff works for any multipage format, so it would need to include GIF, WEBP, TIFF, HEIC, AVIF, etc. etc. And reinit is only necessary if the page size changes, so having two APIs is useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

s3 assets streaming and conversion? #367

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

s3 assets streaming and conversion? #367

kimyu92 May 12, 2023

Replies: 1 comment · 4 replies

jcupitt May 12, 2023 Maintainer

kimyu92 May 12, 2023 Author

jcupitt May 12, 2023 Maintainer

kimyu92 May 12, 2023 Author

jcupitt May 12, 2023 Maintainer

kimyu92
May 12, 2023

Replies: 1 comment 4 replies

jcupitt
May 12, 2023
Maintainer

kimyu92 May 12, 2023
Author

jcupitt May 12, 2023
Maintainer

kimyu92 May 12, 2023
Author

jcupitt May 12, 2023
Maintainer