Measure space that content uses in a PDF file. #3904
Replies: 3 comments
-
All measures are in points, 1 point = 1/72 inch. |
Beta Was this translation helpful? Give feedback.
-
Oh I see, so if I know my PDFs are a4 size (595.28 points x 841.89 points). My real content space is the total a4 size minus margins, footer and header. Then when iterating over the If I come up with some code I'll paste in the comments. |
Beta Was this translation helpful? Give feedback.
-
I did a proof of concept, it's actually not hard. As long all you want is read the text blocks inside a PDF page, and calculate how much % they are occupying of the content area. So, in my case, all I need to know is how much % of page height the text blocks occupy. So if I get text blocks My full proof of concept is here: https://gist.github.com/douglasmiranda/5105e1ee71fecf2bdd923e6bef65f95a Notes:
|
Beta Was this translation helpful? Give feedback.
-
As a tl;dr; of this essay, I'll just provide with something I say at the end right here in the beggining:
Hi, so what I'm trying to do is, reading a pdf file, (that could be two column), identify those blocks of content and check the space being filled with content inside the body of that file.
Inspired by these articles:
Following the examples, I checked I can access the
rect.height
.multi_column
being: https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/multi_column.pyWhich leads me to believe that I can just run through my pages, (ignoring header and footer with the parameters of
column_boxes
), get the content blocks (rect) height (for this example I only care for height). And at the end I'll have how much of the page main body (ignoring header and footer) is filled with textual content.Like, lets say I can measure the first block of content in the first image, having 15 centimeters.
Now for the actual question:
Is that value
rect.height
what I am looking for?If yes, what type of measure is that? Because in that case I would try to convert that unit of measurement to some other I want.
More specifically if I could just be able to measure how much of the main body of pdf pages (not including header and footer) is filled with content I would be happy. Like main_body = 100% space of page, content fills 85% of page main_body.
Beta Was this translation helpful? Give feedback.
All reactions