Skip to content

Implementing bucketed arrays work in progess

Paul Tremberth edited this page Jun 25, 2013 · 1 revision

https://github.com/fizx/parsley/wiki/JSON-Structure#bucketed-arrays

Example page: http://en.wikipedia.org/wiki/Python_(programming_language)

We're interested in extracting main sections from the page, i.e. groups of H2 titles and paragraphs and links, etc. (everything in between logical sections basically). In Wikipedia (desktop) pages, H2 titles, P sections paragraphs, etc. are all siblings, there's not hierarchy in the HTML source code, no <section> grouping sub-subjects together... Yeah, I know, the mobile version of Wikipedia pages do have logical sections (in our case see http://en.m.wikipedia.org/wiki/Python_(programming_language)) :) But Wikipedia pages seem like a good playground.

...
<h2><span class="mw-headline" id="Naming">Naming</span><span class="mw-editsection">[<a href="/w/index.php?title=Python_(programming_language)&amp;action=edit&amp;section=14" title="Edit section: Naming">edit</a></span></h2>
<p>Python's name is derived from the television series <i><a href="/wiki/Monty_Python%27s_Flying_Circus" title="Monty Python's Flying Circus">Monty Python's Flying Circus</a></i>,
...
and <a href="/wiki/PyGTK" title="PyGTK">PyGTK</a>, which bind <a href="/wiki/Qt_(framework)" title="Qt (framework)">Qt</a> and <a href="/wiki/GTK" title="GTK" class="mw-redirect">GTK</a>, respectively, to Python; and <a href="/wiki/PyPy" title="PyPy">PyPy</a>, a Python implementation written in Python.</p>
...

Let's download the page and parse with lxml

>>> import lxml.etree
>>> import lxml.html
>>> import urllib2
>>> url = 'http://en.wikipedia.org/wiki/Python_(programming_language)'
>>> ua = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36'
>>> req = urllib2.Request(url)
>>> req.add_header('User-Agent', ua)
>>> root = lxml.etree.parse(urllib2.urlopen(req), parser=lxml.html.HTMLParser()).getroot()
>>> root
<Element html at 0x3397710>

The main content is in div#mw-content-text. Our boundaries are h2 tags (sections)

>>> import lxml.cssselect
>>> xp_main = lxml.etree.XPath('//div[@id="mw-content-text"]')
>>> css_h2 = lxml.cssselect.CSSSelector('h2')
>>> css_h2.path
u'descendant-or-self::h2'

Define an XPath expression for elements in between boundaries

>>> between = lxml.etree.XPath('following-sibling::*[not(%s)]' % css_h2.path)
>>> between.path
u'following-sibling::*[not(descendant-or-self::h2)]'

Loop on all section boundaries and groups the sibling elements

>>> xp_main(root)
[<Element div at 0x33c2530>]
>>> main = xp_main(root)[0]
>>> allbuckets = []
>>> for t in css_h2(main):
    bucket = []
    bucket.append(t)
    bucket.extend(between(t))
    allbuckets.append(bucket)
>>> allbuckets
[[<Element h2 at 0x3397530>],
 [<Element h2 at 0x33971d0>,
  <Element table at 0x33a8ef0>,
  <Element div at 0x33a8f50>,
  <Element div at 0x33a8fb0>,
  <Element p at 0x33bf050>,
  ...
  <Element table at 0x33c21d0>,
  <Element table at 0x33c2230>,
  <Element div at 0x33c2290>,
  <Element p at 0x33c22f0>],
  ...
 [<Element h2 at 0x33a8dd0>,
  <Element ul at 0x33c2050>,
  <Element table at 0x33c20b0>,
  <Element ul at 0x33c2110>,
  <Element table at 0x33c2170>,
  <Element table at 0x33c21d0>,
  <Element table at 0x33c2230>,
  <Element div at 0x33c2290>,
  <Element p at 0x33c22f0>],
 [<Element h2 at 0x33a8e30>,
  <Element table at 0x33c20b0>,
  <Element ul at 0x33c2110>,
  <Element table at 0x33c2170>,
  <Element table at 0x33c21d0>,
  <Element table at 0x33c2230>,
  <Element div at 0x33c2290>,
  <Element p at 0x33c22f0>]]

Hm... first bucket contains only one H2 element. Let's look into what matched:

>>> import pprint
>>> for b in allbuckets[:2]:
        readable = map(lambda e: "<%s: %s>" % (e.tag.upper(), lxml.etree.tostring(e, method="text", encoding=unicode)), b)
        pprint.pprint(readable[:5])
[u'<H2: Contents\n>']
[u'<H2: History[edit]\n>',
 u'<TABLE: \nThis section requires expansion. (March 2013)\n>',
 u'<DIV: \n\n\n\nGuido van Rossum, the creator of Python\n\n\n>',
 u'<DIV: Main article: History of Python\n>',
 u"<P: Python was conceived in the late 1980s[16] and its implementation was started in December 1989[17] by Guido van Rossum at CWI in the Netherlands as a successor to the ABC language (itself inspired by SETL)[18] capable of exception handling and interfacing with the Amoeba operating system.[1] Van Rossum is Python's principal author, and his continuing central role in deciding the direction of Python is reflected in the title given to him by the Python community, Benevolent Dictator for Life (BDFL).\n>"]

So the first bucket contains only the title of the Table-of-Content. It appears it does have no siblings:

...
<tr>
<td>
<div id="toctitle">
<h2>Contents</h2>
</div>
<ul>
<li class="toclevel-1 tocsection-1"><a href="#History">
...

Let's try and select the sections boundaries better. From div#mw-content-text only consider immediate H2 children with an XPath expression

>>> xp_h2 = lxml.etree.XPath('h2')
>>> xp_h2.path
>>> between = lxml.etree.XPath('following-sibling::*[not(%s)]' % xp_h2.path)
>>> between.path
>>> allbuckets = []
>>> for t in xp_h2(main):
    bucket = []
    bucket.append(t)
    bucket.extend(between(t))
    allbuckets.append(bucket)
>>> for b in allbuckets[:2]:
        readable = map(lambda e: "<%s: %s>" % (e.tag.upper(), lxml.etree.tostring(e, method="text", encoding=unicode)), b)
        pprint.pprint(readable[:5])
[u'<H2: History[edit]\n>',
 u'<TABLE: \nThis section requires expansion. (March 2013)\n>',
 u'<DIV: \n\n\n\nGuido van Rossum, the creator of Python\n\n\n>']
[u'<H2: Features and philosophy[edit]\n>',
 u'<P: Python is a multi-paradigm programming language: object-oriented programming and structured programming are fully supported, and there are a number of language features which support functional programming and aspect-oriented programming (including by metaprogramming[22] and by magic methods).[23] Many other paradigms are supported using extensions, including design by contract[24][25] and logic programming.[26]\n>',
 u'<P: Python uses dynamic typing and a combination of reference counting and a cycle-detecting garbage collector for memory management. An important feature of Python is dynamic name resolution (late binding), which binds method and variable names during program execution.\n>']

That's better!

What sections did we get?

>>> for b in allbuckets[:-1]:
        readable = map(lambda e: "<%s: %s>" % (e.tag.upper(), lxml.etree.tostring(e, method="text", encoding=unicode)), b)
        pprint.pprint(readable[:1])
[u'<H2: History[edit]\n>']
[u'<H2: Features and philosophy[edit]\n>']
[u'<H2: Syntax and semantics[edit]\n>']
[u'<H2: Libraries[edit]\n>']
[u'<H2: Development environments[edit]\n>']
[u'<H2: Implementations[edit]\n>']
[u'<H2: Development[edit]\n>']
[u'<H2: Naming[edit]\n>']
[u'<H2: Use[edit]\n>']
[u'<H2: Impact[edit]\n>']
[u'<H2: See also[edit]\n>']
[u'<H2: References[edit]\n>']
[u'<H2: Further reading[edit]\n>']

That seems right.

Now let's wrap each bucket's elements in their own <section>

>>> sections
>>> for b in allbuckets:
    section = lxml.etree.Element("section")
    section.extend(b)
    sections.append(section)
>>> sections
[<Element section at 0x3914780>,
 <Element section at 0x39147d0>,
 <Element section at 0x3914820>,
 <Element section at 0x3914870>,
 <Element section at 0x39148c0>,
 <Element section at 0x3914910>,
 <Element section at 0x3914960>,
 <Element section at 0x39149b0>,
 <Element section at 0x3914a00>,
 <Element section at 0x3914a50>,
 <Element section at 0x3914aa0>,
 <Element section at 0x3914af0>,
 <Element section at 0x3914b40>,
 <Element section at 0x3914b90>]
>>> map(lambda s: [e.tag for e in s][:5], sections)
[['h2', 'table', 'div', 'div', 'p'],
 ['h2', 'p', 'p', 'p', 'p'],
 ['h2', 'div', 'p', 'h3', 'p'],
 ['h2', 'p', 'p', 'p', 'p'],
 ['h2', 'p', 'p', 'p'],
 ['h2', 'div', 'p', 'p', 'p'],
 ['h2', 'p', 'p', 'p', 'ul'],
 ['h2', 'p', 'p'],
 ['h2', 'div', 'p', 'p', 'p'],
 ['h2', 'p', 'ul', 'p', 'p'],
 ['h2', 'div', 'ul'],
 ['h2', 'div'],
 ['h2', 'ul'],
 ['h2', 'table', 'ul', 'table', 'table']]

Now, we should be able to use parslepy with these sections

>>> import parslepy
>>> parsley = {"title": "h2 span.mw-headline", "links(a)": [{"name": ".", "url": "@href"}]}
>>> parselet = parslepy.Parselet(parsley)
>>> map(lambda s: parselet.extract(s), sections)
[{'links': [{'name': u'edit]',
    'url': '/w/index.php?title=Python_(programming_language)&action=edit&section=1'},
   {'name': u'', 'url': '/wiki/File:Wiki_letter_w_cropped.svg'},
   {'name': u'expansion.',
    'url': '//en.wikipedia.org/w/index.php?title=Python_(programming_language)&action=edit'},
   {'name': u'', 'url': '/wiki/File:Guido_van_Rossum.jpg'},
   ...
   {'name': u'backported to the backwards-compatible Python 2.6 and 2.7.',
    'url': '/wiki/Backporting'},
   {'name': u'[21]', 'url': '#cite_note-pep-3000-21'}],
  'title': u'History'},
...
  'title': u'Further reading'},
 {'links': [{'name': u'edit]',
    'url': '/w/index.php?title=Python_(programming_language)&action=edit&section=20'},
   ...
   {'name': u'Comparison with closed source',
    'url': '/wiki/Comparison_of_open_source_and_closed_source'},
   {'name': u'Book:Free and Open Source Software',
    'url': '/wiki/Book:Free_and_Open_Source_Software'},
   {'name': u'Category:Free software', 'url': '/wiki/Category:Free_software'},
   {'name': u'Portal:Free software', 'url': '/wiki/Portal:Free_software'},
   {'name': u'', 'url': '/wiki/Wikipedia:Good_articles'}],
  'title': u'External links'}]