Poor performace parsing numeric entities #35

GoogleCodeExporter · 2015-03-22T13:17:03Z

html5lib seems to take roughly quadratic time to parse when you have a very
long sequence of number entities (like
http://www.archive.org/details/DuckandC1951) - is that a feature or a known
bug or something unexpected? 

(Uh, I think I mean more like exponential than quadratic. In any case, it's
really slow - ~24 seconds for 600KB, ~44 seconds for 800KB, etc)

Original issue reported on code.google.com by [email protected] on 6 May 2007 at 10:59

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-03-22T13:17:03Z

Uhh, I should have noted that was from Philip` on #html-wg

Original comment by [email protected] on 6 May 2007 at 11:03

GoogleCodeExporter · 2015-03-22T13:17:03Z

Reproducible with something like:

for i in range(15):
    t = time.time()
    html5lib.HTMLParser().parse('&#0;' * (10000*i))
    print i, time.time()-t

which gives me output like

1 0.723376989365
2 1.57332181931
3 2.78034710884
4 4.29565691948
5 6.53889489174
6 8.71565198898
7 11.5304911137
8 15.7282669544
9 19.5130882263
10 23.3610501289
11 31.4318139553
12 34.7210710049
13 39.443447113
14 46.4562900066

(I was right the first time about it being quadratic). Non-numeric entities 
have the 
same problem. Repeated tags (e.g. "<b></b>" * n) don't have that problem and 
run in 
linear time.

Original comment by [email protected] on 6 May 2007 at 11:32

GoogleCodeExporter · 2015-03-22T13:17:04Z

[deleted comment]

GoogleCodeExporter · 2015-03-22T13:17:04Z

[deleted comment]

GoogleCodeExporter · 2015-03-22T13:17:04Z

I believe the reason for the poor performance is related to
HTMLInputStream.{char(),unget()}.

HTMLInputStream uses a queue to keep track of characters that
have not yet been consumed.  The qeueu is implemented using a
list.  The insertion/deletion at the head of a list has O(n)
cost (http://www.python.org/doc/2.4/lib/module-collections).
Parsing a long sequence of number entities makes frequent calls
to unget() that insert to head of the queue, results in quadratic
runtime.

To improve this situation, we can use the deque object that is
O(1) for insertion/deletion at both head and tail of the queue.
The attached patch replaces the use of list with deque in
HTMLInputStream, and the following output compares the
performance before and after the patch.


    list               deque

0   0.00147581100464   0.00155997276306
1   1.44161701202      0.882177829742
2   2.9482858181       1.8475689888
3   4.52331590652      2.94394207001
4   6.12853693962      4.09121584892
5   7.86033296585      5.47660899162
6   9.69427108765      6.8242650032
7  11.7957978249       8.55761408806
8  14.7313899994      10.0578951836
9  17.2368850708      11.4750590324
10 20.0438449383      13.7235751152
11 24.4049508572      15.3424620628
12 27.3131408691      19.2072041035
13 31.4682328701      21.3321089745
14 35.9160130024      25.5739672184

Original comment by [email protected] on 15 Sep 2007 at 7:33

Attachments:

inputstream.py-patch

GoogleCodeExporter · 2015-03-22T13:17:04Z

I checked this patch in with a pure python deque for python<=2.3. Thanks!

Original comment by [email protected] on 19 Sep 2007 at 10:43

GoogleCodeExporter · 2015-03-22T13:17:04Z

I don't have a good solution for this for ruby yet, and the problem appears to 
be even worse:

>> 5.times {|i| t = Time.now; HTML5.parse('�' *(10_000 * i)); puts i, 
Time.now - t}
0
0.001261
1
3.637202
2
12.306799
3
26.885312
4
47.649831

Original comment by [email protected] on 2 Oct 2007 at 1:10

Added labels: ruby

GoogleCodeExporter · 2015-03-22T13:17:05Z

Original comment by [email protected] on 8 Jun 2008 at 9:36

Added labels: Port-Ruby
Removed labels: ruby

GoogleCodeExporter added Type-Defect Priority-Medium auto-migrated Port-Ruby labels Mar 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performace parsing numeric entities #35

Poor performace parsing numeric entities #35

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

Poor performace parsing numeric entities #35

Poor performace parsing numeric entities #35

Comments

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015

GoogleCodeExporter commented Mar 22, 2015