Rewrite PSBaseParser and add an optimized in-memory version #1041
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The buffering in
PSBaseParser
(and associated code contortions like the mess that wasPDFContentParser
) was fragile but also unnecesary when doing file I/O - cPython'sBufferedReader
implementation is considerably faster, so we can just reimplement the "parser" (really a lexer) as a character-based state machine.But this actually would lead to an overall slowdown, because in reality, most of the time we aren't parsing PDF data from a buffered file, but from a
BytesIO
wrapped around an in-memory buffer. In this case, the buffering is redundant but nonetheless faster in practice since it avoids the overhead of callingBytesIO.read
repeatedly.The obvious solution is to create a separate "parser" (really a lexer) using good old regular expressions the way our ancestors intended, and simply use this one when passed an in-memory buffer.
Also means that there is a bit less inheritance abuse in the code, as
PSStackParser
needs to delegate to the appropriate implementation.Fixes: #885 and #1025
Also there were some details of the PDF parsing that were incorrect. Most notably, hex strings with odd length are supposed to be padded in big-endian fashion (i.e.
<abcde>
is equivalent to<abcde0>
) but this was not the case in the existing code (which treated this as<abcd0e>
instead).Tested on the usual test suite with nox, profiled with cProfile and time.time.
Checklist