Error when parse a HTML page with a Character references without semicolon #83

carloborsoi · 2013-06-28T20:47:10Z

Hi,

I am new in this kind of forum and I found a problem to use br.links() which I guess I get the solution.

Where can I publish and discuss this solution ? The problem can solve directly in _sgmllib_copy.py but it is possible a workaround in _html.py

The problem is:
In some HTML pages, the _sgmllib_copy.py suppose some Character references (e.g. &#39) are in hexadecimal base because it finishes with A-F but it is not because there is no 'x' in the begin.(e.g. Gustaf&#39Aldo ).

Solution:
To avoid to change the _sgmllib_copy.py, it is possible to change the _html.py in line 315 from:
if name.startswith("x"):
name, base= name[1:], 16
to
if name.startswith("x"):
name, base= name[1:], 16
else:
name = filter(lambda x: x.isdigit(), name)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when parse a HTML page with a Character references without semicolon #83

Error when parse a HTML page with a Character references without semicolon #83

carloborsoi commented Jun 28, 2013

Error when parse a HTML page with a Character references without semicolon #83

Error when parse a HTML page with a Character references without semicolon #83

Comments

carloborsoi commented Jun 28, 2013