Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when parse a HTML page with a Character references without semicolon #83

Open
carloborsoi opened this issue Jun 28, 2013 · 0 comments

Comments

@carloborsoi
Copy link

Hi,

I am new in this kind of forum and I found a problem to use br.links() which I guess I get the solution.

Where can I publish and discuss this solution ? The problem can solve directly in _sgmllib_copy.py but it is possible a workaround in _html.py

The problem is:
In some HTML pages, the _sgmllib_copy.py suppose some Character references (e.g. &#39) are in hexadecimal base because it finishes with A-F but it is not because there is no 'x' in the begin.(e.g. Gustaf&#39Aldo ).

Solution:
To avoid to change the _sgmllib_copy.py, it is possible to change the _html.py in line 315 from:
if name.startswith("x"):
name, base= name[1:], 16
to
if name.startswith("x"):
name, base= name[1:], 16
else:
name = filter(lambda x: x.isdigit(), name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant