Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix TypeError: argument of type 'PSLiteral' is not iterable #883

Closed
wants to merge 4 commits into from

Conversation

nathtest
Copy link

@nathtest nathtest commented May 3, 2023

Pull request

Fix TypeError: argument of type 'PSLiteral' is not iterable for pdf where "W" and "H" are null in obj.

Traceback of the error :

Traceback (most recent call last):
  File "/opt/editik_engine/src/editik_engine/commands_class/engine/engine.py", line 179, in generate_custom_documents
    SplitVpc(self.computed_out_path + file.fic_nom, self.computed_out_path).run()
  File "/opt/editik_engine/src/editik_engine/commands_class/document_generator/split_vpc.py", line 93, in run
    extracted_text = page.extract_text()
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 260, in extract_text
    return utils.extract_text(self.chars, **kwargs)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/container.py", line 48, in chars
    return self.objects.get("char", [])
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 161, in objects
    self._objects = self.parse_objects()
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 222, in parse_objects
    for obj in self.iter_layout_objects(self.layout._objs):
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfplumber/page.py", line 110, in layout
    interpreter.process_page(self.page_obj)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.execute(list_value(streams))
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 933, in execute
    func(*args)
  File "/opt/editik_engine/src/venv3/lib/python3.9/site-packages/pdfminer/pdfinterp.py", line 840, in do_EI
    if 'W' in obj and 'H' in obj:
TypeError: argument of type 'PSLiteral' is not iterable

How Has This Been Tested?

I've had a few pdf at work that could not been read because W and H were present but null.
This is a better way to check if those args are in obj.

This fix solved the issue for those pdf and did not impact other pdfs.

We handle more than 10k pdf per day so i want to say this is correctly tested.

Checklist

  • I have read CONTRIBUTING.md.
  • [] I have added a concise human-readable description of the change to CHANGELOG.md.
  • I have tested that this fix is effective or that this feature works.
  • [] I have added docstrings to newly created methods and classes.
  • I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

@pietermarsman
Copy link
Member

@nathtest Thanks for your time and contribution!

I'm happy to merge this if I can test it on a PDF that shows the issue. Can you share the PDF and the code that you are using?

@pietermarsman
Copy link
Member

I'm closing this issue because there is no sample PDF to test on.

Also, I suspect this change does not what it is intended to do. Consider the output of following code snippet.

class A:
    def __contains__(self, item):
        print("__contains__")

    def __getattr__(self, item):
        print("__getattr__")

hasattr(A(), "a")
"a" in A()

The output is

__getattr__
__contains__

So hasattr uses __getattr__ to test if a field or method is there. And the in test uses the __contains__ method. Note that the PDFStackT object implements the __contains__ method, but not the __getattr__ method. Thus the default __getattr__ method is used in your implementation. This will check if H or W is a field or method on the object, which it is never. Hence, this does not raise the error you are experiencing, but does introduce a new bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants