-
Notifications
You must be signed in to change notification settings - Fork 290
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Make parsing of text be non-quadratic.
In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case. This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior: ``` In [1]: import html5lib In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000) 2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000) 6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000) 19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each) ``` Switch from appending to the internal `str`, to appending text to an array of text chunks, as appends can be done in constant time. Using `bytearray` is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended. This improves parsing of text documents noticeably: ``` In [1]: import html5lib In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000) 2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000) 3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000) 8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each) ```
- Loading branch information
Showing
2 changed files
with
60 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters