Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User documentation? #489

Open
davaya opened this issue May 6, 2020 · 4 comments
Open

User documentation? #489

davaya opened this issue May 6, 2020 · 4 comments

Comments

@davaya
Copy link

davaya commented May 6, 2020

Is there any tutorial documentation for this package? Something that would answer questions like when parsing an HTML file:

with open(fname, 'rb') as f:
    doc = html5lib.parse(f)
for e in list(doc[0]):
    print(e.tag)

Why does the result look like this?

{http://www.w3.org/1999/xhtml}meta
{http://www.w3.org/1999/xhtml}title
{http://www.w3.org/1999/xhtml}link

A naive user would expect tag to return the literal value contained in the HTML element, not the tag prefixed with a qualifier of some sort. It would be helpful to have a document that explains why the prefix has been injected and how to configure the library to return unadorned tags.

@gsnedders
Copy link
Member

https://html5lib.readthedocs.io/en/latest/ has docs, though it doesn't answer the why. https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.html5parser.HTMLParser specifically mentions the namespaceHTMLElements arg.

The why is quite simple: it's what the HTML spec says (that almost all elements get inserted in the HTML namespace from HTML); you can see browsers do this via document.documentElement.namespaceURI on this page.

@davaya
Copy link
Author

davaya commented May 6, 2020

Thanks - creating a parser object explicitly with the proper args gives the desired results. New users reading the overview might benefit from some rationale (pros and cons) for choosing a particular tree type, and a note pointing out that namespacing is one of the options that can be specified.

It could be argued that new users might not be aware of namespacing and should not get it by default, while those who do need it would know enough to opt in.

@guettli
Copy link

guettli commented Jul 9, 2021

I think basic usage example would be helpful.

Example: parse html, replace the innerHTML of all <a> links to "super foo". And then write it out again.

Up to now the docs are just about parsing.

Please add some example how to process the parsed data.

@guettli
Copy link

guettli commented Jul 9, 2021

Something like this would be nice to have in the docs:

from xml.etree import ElementTree
from html5lib import HTMLParser

parser = HTMLParser(namespaceHTMLElements=False)

tree = parser.parse('''
  foo
  <h1>Moonlight</h1>
  bar''')

for e in tree.findall('.//h1'):
    e.text = 'Sunshine'

print(ElementTree.tostring(etree))

Source: https://stackoverflow.com/questions/68313619/how-to-replace-the-innerhtml-of-all-h1-tags-with-html5lib

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants