Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Misbehavior of HTML Parsing #349

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

cgdeboer-toptal
Copy link

No description provided.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  1. Let's uncomment the bit of code that does the actual work here.

  2. In your function, can you please provide a docstring? Also, I don't think it needs double underscores. It can be called from anywhere, right? The replacement string is pretty different than the one on Stackoverflow. Any idea why?

  3. In your test, do we need a file or could we just do a unit test for the function you added? If we do need a file, can we make it much, much smaller so that it's clear which part of it is problematic? As it is, Github won't even show me the file in the UI.

@cgdeboer-toptal
Copy link
Author

Response:

  1. It's commented out because it doesn't work yet. Nor does the original example from CL. Sort of WIP MR just to show the failing test. As I commented here Invalid XML character break docket parsers #348 (comment) .

  2. Sure thing. My original line of reasoning here is that the filtering of the invalid XML characters/bytes/whatevers is really only relevant for that one function and has no use outside of parsing the html. Happy to make that a public function with a docstring, once we figure out what to put there.

  3. I added the actual docket content from PACER to help narrow down the issue. Right now, all we know is that the contents of that file (the HTML from PACER) can't be parsed.. once we have a test that passes for that file, we can reduce it's size to contain just the nasty bits.

P.S apologies for accidentally closing and re-opening.

@mlissner
Copy link
Member

mlissner commented Nov 4, 2020

Great, thanks @cgdeboer-toptal. I shouldn't cross streams, but 2.0.3 is pushed, by the way.

@CLAassistant
Copy link

CLAassistant commented Dec 18, 2021

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ flooie
❌ Calvin DeBoer


Calvin DeBoer seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants