You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working with pdfs using pymupdf for some time, but recently came across tagged pdfs and having lot of difficulty in parsing through the StructTreeNode and understanding the low level handling of code for this case. My knowledge about tagged pdfs is limited.
So, I have couple of questions:
How to parse the pymupdf structtreeroot representation?
Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures?
Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I am used to with pymupdf.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am working with pdfs using pymupdf for some time, but recently came across tagged pdfs and having lot of difficulty in parsing through the StructTreeNode and understanding the low level handling of code for this case. My knowledge about tagged pdfs is limited.
So, I have couple of questions:
Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging.
My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements
Beta Was this translation helpful? Give feedback.
All reactions