Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid XML character break docket parsers #348

Open
cgdeboer-toptal opened this issue Oct 28, 2020 · 8 comments
Open

Invalid XML character break docket parsers #348

cgdeboer-toptal opened this issue Oct 28, 2020 · 8 comments

Comments

@cgdeboer-toptal
Copy link

Summary

When a page on pacer (or elsewhere) contains some characters that are not in the valid list of XML characters lxml's html5 parser will fail.

This is not a hypothetical, I was scraping a docket at the Ohio Northern Bankruptcy Court (ohnb), and the docketreport.parse() failed because of some invalid XML characters coming back from the request.

Tasks

  • update the code in the juriscraper/lib/html_utils.py to escape these characters, probably using some regex so we don't lose too much speed.
  • capture the raw response of the parsed docket, and include it in the test suite.

Questions

  • has anyone seen this type of error coming from a pacer scrape ? You would've seen a All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters traceback bubble up the stack.
  • any opposition to having someone (possibly me) work on a patch for the html_utils to ensure that this type of data is protected against ?
@mlissner mlissner changed the title Scraper Stability Improvements Invalid XML character break docket parsers Oct 28, 2020
@mlissner
Copy link
Member

Nice find. We've seen this before in other areas, so it's not surprising to see it here too. I did some performance testing on this a while back:

https://stackoverflow.com/a/25920392/64911

The code that's in CL to handle this is:

def filter_invalid_XML_chars(input):
    """XML allows:

       Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

    This strips out everything else.

    See: http://stackoverflow.com/a/25920392/64911
    """
    if isinstance(input, str):
        # Only do str, unicode, etc.
        return re.sub(
            "[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD"
            "\U00010000-\U0010FFFF]+",
            "",
            input,
        )
    else:
        return input

I'd definitely welcome a PR for this.

@cgdeboer-toptal
Copy link
Author

https://stackoverflow.com/a/25920392/64911

I did some reading on this earlier today before reading your post, and stumbled upon the same SO post... I should have looked more closely at the author. Will work on this over the weekend.

@cgdeboer-toptal
Copy link
Author

@mlissner thanks for the link. I've been doing a little digging on this and haven't found a solution that works quite yet. I've got a sample text file with the bad payload from the docket. I'm examining the way html5lib parses characters.

I'm still working through this.

Side note, I see the code in CL, but I'm not seeing where it is used anywhere in that repo

@mlissner
Copy link
Member

mlissner commented Nov 4, 2020

Weird, yeah, looks like it's not used anymore. I suppose we could delete it since it's easy to find again on StackOverflow.

Do you need help with your progress? Sounds like you're just checking in, but if you're frustrated maybe somebody can take a look.

@johnhawkinson
Copy link
Contributor

Not just to be contrarian, but I have long been convinced the StackOverflow post does not offer the right solution.
Is there a test case available?

@cgdeboer-toptal
Copy link
Author

That's sort of what I'm finding @johnhawkinson. I'll post a PR with the failing test case.

@cgdeboer-toptal
Copy link
Author

cgdeboer-toptal commented Nov 4, 2020

The traceback on this goes back to a character parsed by html5lib, where it attempts to insert a disallowed character into the tree.

{'type': 1, 'data': '\x03'}

PR: #349

  • I've looked at using lxml's XMLParser first, as a way to use it's internal XML rules to strip out invalid characters, then stringify and reparse with html5lib to keep the API in tact... that doesn't seem to perform well and removes too much data.
  • I've looked at the solution you provided above @mlissner .

@mlissner
Copy link
Member

mlissner commented Nov 4, 2020

Not just to be contrarian, but I have long been convinced the StackOverflow post does not offer the right solution.
Is there a test case available?

Can you elaborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants