Tagged pdfs <extract data using StructTreeRoot> #3901

mllife · 2024-09-27T13:01:46Z

mllife
Sep 27, 2024

I am working with pdfs using pymupdf for some time, but recently came across tagged pdfs and having lot of difficulty in parsing through the StructTreeNode and understanding the low level handling of code for this case. My knowledge about tagged pdfs is limited.
So, I have couple of questions:

How to parse the pymupdf structtreeroot representation?
Is it possible to dump it into a xml kind of structure, so it is easy for me to create a parser on top of it to extract Tables and other important tagged structures?
Can I can get Bounding boxes for these structures from the structTreeRoot itself? So, I can source link them back to pdf page; as I am used to with pymupdf.

Goal - to convert pdfs to simple text or json structure while utilizing the information from tagging.
My intro to tagged pdfs was this - https://accessible-pdf.info/en/basics/general/overview-of-the-pdf-tags/#table-elements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tagged pdfs <extract data using StructTreeRoot> #3901

{{title}}

Replies: 0 comments

Select a reply

Tagged pdfs <extract data using StructTreeRoot> #3901

mllife Sep 27, 2024

Replies: 0 comments

mllife
Sep 27, 2024