Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BugFix: html pages containing <?xml version="1.0" encoding="utf-8"?> declarations #151

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jkehler
Copy link

@jkehler jkehler commented Oct 4, 2014

I had experienced a problem with html pages that started with

It seems the lxml.html parser is not able to deal with this properly as noted here. https://stackoverflow.com/questions/15302125/html-encoding-and-lxml-parsing

I have inserted a regex substitution to remove the tag from the raw html and the problem is now resolved.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "goose/__init__.py", line 56, in extract
    return self.crawl(cc)
  File "goose/__init__.py", line 63, in crawl
    article = crawler.crawl(crawl_candiate)
  File "goose/crawler.py", line 90, in crawl
    doc = self.get_document(raw_html)
  File "goose/crawler.py", line 176, in get_document
    doc = self.parser.fromstring(raw_html)
  File "goose/parsers.py", line 54, in fromstring
    self.doc = lxml.html.fromstring(html)
  File "/home/jeff/.virtualenvs/python-goose/lib/python2.7/site-packages/lxml/html/__init__.py", line 723, in fromstring
    doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
  File "/home/jeff/.virtualenvs/python-goose/lib/python2.7/site-packages/lxml/html/__init__.py", line 613, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 3092, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70691)
  File "parser.pxi", line 1823, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:106654)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

@jkehler jkehler changed the title Fix issue with html pages containing <?xml> declarations BugFix: html pages containing <?xml version="1.0" encoding="utf-8"?> declarations Oct 5, 2014
@grangier
Copy link
Owner

Please provide failling url

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants