Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: argument of type 'PDFObjRef' is not iterable #1120

Closed
ibecav opened this issue Apr 11, 2024 · 5 comments
Closed

TypeError: argument of type 'PDFObjRef' is not iterable #1120

ibecav opened this issue Apr 11, 2024 · 5 comments
Labels

Comments

@ibecav
Copy link

ibecav commented Apr 11, 2024

Describe the bug

As with several others I have encountered this error when using the module. For example #935. I encountered it using an exact copy of your example script for extracting form values here https://github.com/jsvine/pdfplumber?tab=readme-ov-file#extracting-form-values but with the example pdf I am enclosing.

Have you tried repairing the PDF?

Yes, the results were (I had to laugh because yes, it really is a pdf file and it certainly renders correctly on screen):

Traceback (most recent call last):
  File "C:\Users\PowellCh\Desktop\RProjs\production_hai\clogged_pdf_toilet.py", line 4, in <module>
    pdf = pdfplumber.open("example.pdf", repair=True)
  File "C:\Users\PowellCh\AppData\Roaming\Python\Python312\site-packages\pdfplumber\pdf.py", line 95, in open
    return cls(
  File "C:\Users\PowellCh\AppData\Roaming\Python\Python312\site-packages\pdfplumber\pdf.py", line 45, in __init__
    self.doc = PDFDocument(PDFParser(stream), password=password or "")
  File "C:\Users\PowellCh\AppData\Roaming\Python\Python312\site-packages\pdfminer\pdfdocument.py", line 752, in __init__
    raise PDFSyntaxError("No /Root object! - Is this really a PDF?")
pdfminer.pdfparser.PDFSyntaxError: No /Root object! - Is this really a PDF?

Code to reproduce the problem

As stated above a simple copy of one of your examples run against the example pdf.

import pdfplumber
from pdfplumber.utils.pdfinternals import resolve_and_decode, resolve

pdf = pdfplumber.open("example.pdf", repair=True)

def parse_field_helper(form_data, field, prefix=None):
    """ appends any PDF AcroForm field/value pairs in `field` to provided `form_data` list

        if `field` has child fields, those will be parsed recursively.
    """
    resolved_field = field.resolve()
    field_name = '.'.join(filter(lambda x: x, [prefix, resolve_and_decode(resolved_field.get("T"))]))
    if "Kids" in resolved_field:
        for kid_field in resolved_field["Kids"]:
            parse_field_helper(form_data, kid_field, prefix=field_name)
    if "T" in resolved_field or "TU" in resolved_field:
        # "T" is a field-name, but it's sometimes absent.
        # "TU" is the "alternate field name" and is often more human-readable
        # your PDF may have one, the other, or both.
        alternate_field_name  = resolve_and_decode(resolved_field.get("TU")) if resolved_field.get("TU") else None
        field_value = resolve_and_decode(resolved_field["V"]) if 'V' in resolved_field else None
        form_data.append([field_name, alternate_field_name, field_value])


form_data = []
fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"]
for field in fields:
    parse_field_helper(form_data, field)
    

PDF file

FWIW it's a fillable form pdf created by the CDC and saved locally after filling.

example.pdf

Expected behavior

I expected it to work the same way your example code does. The code does work on other pdf files that aren't of this type.

Actual behavior

Traceback (most recent call last):
  File "C:\Users\PowellCh\Desktop\RProjs\production_hai\clogged_pdf_toilet.py", line 27, in <module>
    for field in fields:
TypeError: 'PDFObjRef' object is not iterable

Screenshots

I can't think of any that would be helpful but please inform if otherwise

Environment

  • pdfplumber version: [0.11.0]
  • Python version: [Python 3.12.2 (tags/v3.12.2:6abddd9, Feb 6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)] on win32]
  • OS: [Windows - although FWIW same error on a Mac]

Additional context

My apologies in advance if I forgot any details in this issue. I'm new to Python and your excellent module but have experience in other languages. My current hypothesis based on reading other issues is that there is something non standard about the pdf itself but I am hopeful there is a workaround.

@ibecav ibecav added the bug label Apr 11, 2024
@jeremybmerrill
Copy link
Contributor

Looks like calling resolve() on fields fixes the problem.

Replace fields = resolve(pdf.doc.catalog["AcroForm"])["Fields"] with

fields = resolve(resolve(pdf.doc.catalog["AcroForm"])["Fields"])

and it looks like it works. I think we could modify the example code to do this.

@ibecav
Copy link
Author

ibecav commented Apr 19, 2024

Thank you. I'll try this fix in a little bit. As to changing the example I'll leave that to your discretion I'm by no means an expert but my understanding is that PDFs can be fickle and as I noted your example does work on some PDFs as is.

@ibecav
Copy link
Author

ibecav commented Apr 19, 2024

Thank you, that does indeed seem to resolve the error.

@jsvine
Copy link
Owner

jsvine commented Apr 19, 2024

Thanks @jeremybmerrill for the solution, and @ibecav for flagging. I've now updated the example code in the README.

@jsvine jsvine closed this as completed Apr 19, 2024
@jeremybmerrill
Copy link
Contributor

great! I'm by no means an expert either -- all standards-compliant PDFs are alike, but all weird PDFs are weird in their own unique way -- but I do know that calling resolve() at every opportunity seems to make problems disappear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants