Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performace parsing numeric entities #35

Open
GoogleCodeExporter opened this issue Mar 22, 2015 · 8 comments
Open

Poor performace parsing numeric entities #35

GoogleCodeExporter opened this issue Mar 22, 2015 · 8 comments

Comments

@GoogleCodeExporter
Copy link

html5lib seems to take roughly quadratic time to parse when you have a very
long sequence of number entities (like
http://www.archive.org/details/DuckandC1951) - is that a feature or a known
bug or something unexpected? 

(Uh, I think I mean more like exponential than quadratic. In any case, it's
really slow - ~24 seconds for 600KB, ~44 seconds for 800KB, etc)

Original issue reported on code.google.com by [email protected] on 6 May 2007 at 10:59

@GoogleCodeExporter
Copy link
Author

Uhh, I should have noted that was from Philip` on #html-wg

Original comment by [email protected] on 6 May 2007 at 11:03

@GoogleCodeExporter
Copy link
Author

Reproducible with something like:

for i in range(15):
    t = time.time()
    html5lib.HTMLParser().parse('�' * (10000*i))
    print i, time.time()-t

which gives me output like

1 0.723376989365
2 1.57332181931
3 2.78034710884
4 4.29565691948
5 6.53889489174
6 8.71565198898
7 11.5304911137
8 15.7282669544
9 19.5130882263
10 23.3610501289
11 31.4318139553
12 34.7210710049
13 39.443447113
14 46.4562900066

(I was right the first time about it being quadratic). Non-numeric entities 
have the 
same problem. Repeated tags (e.g. "<b></b>" * n) don't have that problem and 
run in 
linear time.

Original comment by [email protected] on 6 May 2007 at 11:32

@GoogleCodeExporter
Copy link
Author

[deleted comment]

1 similar comment
@GoogleCodeExporter
Copy link
Author

[deleted comment]

@GoogleCodeExporter
Copy link
Author

I believe the reason for the poor performance is related to
HTMLInputStream.{char(),unget()}.

HTMLInputStream uses a queue to keep track of characters that
have not yet been consumed.  The qeueu is implemented using a
list.  The insertion/deletion at the head of a list has O(n)
cost (http://www.python.org/doc/2.4/lib/module-collections).
Parsing a long sequence of number entities makes frequent calls
to unget() that insert to head of the queue, results in quadratic
runtime.

To improve this situation, we can use the deque object that is
O(1) for insertion/deletion at both head and tail of the queue.
The attached patch replaces the use of list with deque in
HTMLInputStream, and the following output compares the
performance before and after the patch.


    list               deque

0   0.00147581100464   0.00155997276306
1   1.44161701202      0.882177829742
2   2.9482858181       1.8475689888
3   4.52331590652      2.94394207001
4   6.12853693962      4.09121584892
5   7.86033296585      5.47660899162
6   9.69427108765      6.8242650032
7  11.7957978249       8.55761408806
8  14.7313899994      10.0578951836
9  17.2368850708      11.4750590324
10 20.0438449383      13.7235751152
11 24.4049508572      15.3424620628
12 27.3131408691      19.2072041035
13 31.4682328701      21.3321089745
14 35.9160130024      25.5739672184

Original comment by [email protected] on 15 Sep 2007 at 7:33

Attachments:

@GoogleCodeExporter
Copy link
Author

I checked this patch in with a pure python deque for python<=2.3. Thanks!


Original comment by [email protected] on 19 Sep 2007 at 10:43

@GoogleCodeExporter
Copy link
Author

I don't have a good solution for this for ruby yet, and the problem appears to 
be even worse:

>> 5.times {|i| t = Time.now; HTML5.parse('�' *(10_000 * i)); puts i, 
Time.now - t}
0
0.001261
1
3.637202
2
12.306799
3
26.885312
4
47.649831

Original comment by [email protected] on 2 Oct 2007 at 1:10

  • Added labels: ruby

@GoogleCodeExporter
Copy link
Author

Original comment by [email protected] on 8 Jun 2008 at 9:36

  • Added labels: Port-Ruby
  • Removed labels: ruby

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant