Rewrite PSBaseParser and add an optimized in-memory version #1041

dhdaines · 2024-09-19T03:43:01Z

The buffering in PSBaseParser (and associated code contortions like the mess that was PDFContentParser) was fragile but also unnecesary when doing file I/O - cPython's BufferedReader implementation is considerably faster, so we can just reimplement the "parser" (really a lexer) as a character-based state machine.

But this actually would lead to an overall slowdown, because in reality, most of the time we aren't parsing PDF data from a buffered file, but from a BytesIO wrapped around an in-memory buffer. In this case, the buffering is redundant but nonetheless faster in practice since it avoids the overhead of calling BytesIO.read repeatedly.

The obvious solution is to create a separate "parser" (really a lexer) using good old regular expressions the way our ancestors intended, and simply use this one when passed an in-memory buffer.

Also means that there is a bit less inheritance abuse in the code, as PSStackParser needs to delegate to the appropriate implementation.

Fixes: #885 and #1025

Also there were some details of the PDF parsing that were incorrect. Most notably, hex strings with odd length are supposed to be padded in big-endian fashion (i.e. <abcde> is equivalent to <abcde0>) but this was not the case in the existing code (which treated this as <abcd0e> instead).

Tested on the usual test suite with nox, profiled with cProfile and time.time.

Checklist

I have read CONTRIBUTING.md.
I have added a concise human-readable description of the change to CHANGELOG.md.
I have tested that this fix is effective or that this feature works.
I have added docstrings to newly created methods and classes.
I have updated the README.md and the readthedocs documentation. Or verified that this is not necessary.

This was referenced Sep 19, 2024

fix: fix the fix to #884 to fix #1025 #1030

Open

[Bug]: Scan time regression in 16.4.3 with --redo-ocr ocrmypdf/OCRmyPDF#1380

Closed

dhdaines force-pushed the new_parser branch 2 times, most recently from 05b7e66 to 55345e3 Compare September 19, 2024 13:25

dhdaines added 2 commits September 19, 2024 09:48

Rewrite PSBaseParser and add an optimized in-memory version

0a1ab08

fix: make sure it is really bytes in font.decode

1bb4cae

dhdaines force-pushed the new_parser branch from 55345e3 to 1bb4cae Compare September 19, 2024 13:48

dhdaines added 2 commits September 19, 2024 10:00

fix: a couple of invalid PDF fuzz cases

4c7d494

fix: match behaviour between PSFile / PSInMemory parser

6e9d73f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite PSBaseParser and add an optimized in-memory version #1041

Rewrite PSBaseParser and add an optimized in-memory version #1041

dhdaines commented Sep 19, 2024 •

edited

Loading

Rewrite PSBaseParser and add an optimized in-memory version #1041

Are you sure you want to change the base?

Rewrite PSBaseParser and add an optimized in-memory version #1041

Conversation

dhdaines commented Sep 19, 2024 • edited Loading

dhdaines commented Sep 19, 2024 •

edited

Loading